We have a vision of a Network Compute Fabric where the lines between networking and computing disappear. On the journey there, edge cloud computing provides a critical stepping-stone where computing is pushed very close to where it is needed. This distribution of computing capabilities in the network creates new challenges for its management and operation. We argue that a data-centric approach that extensively uses artificial intelligence (AI) and machine learning (ML) technologies to realize specific management functions is a good candidate to tackle these challenges.
As can be seen in Figure 1, edge computing services can be provided through compute/storage resources at different locations in a network, such as on-premises at a customer/enterprise site (industrial control, for example) or at access and local/regional sites (telco operators, for example).
As the stepping stone towards the network compute fabric, the use and popularity of edge computing is growing rapidly in the industry today, driven by new application and application areas, such as IoT and Industry 4.0, 5G, extended reality (XR) and smart devices’s being used as a platform for innovating new types of services and applications. Ericsson’s 5G market compass predicts that 25 percent of 5G use cases will rely on edge computing by 2023.
Challenges in managing edge cloud
Existing cloud management solutions mostly assume that the cloud platform runs on a large, highly homogeneous pool of hardware resources managed by system administrators available 24/7. However, for edge cloud environments, the situation is different. Let’s take a closer look at these key points:
- Constrained and limited resources
- Heterogeneity and dynamicity
- High performance and reliability
- Need for human intervention
Constrained and limited resources
An edge infrastructure generally has few hardware resources on which it needs to run both its management operations and the cloud application it hosts. Therefore, one of the biggest challenges for edge clouds is the efficient use of limited computing and storage resources.
Heterogeneity and dynamicity
The requirements applications put on edge cloud environments vary significantly and in several dimensions: space (for example downtown vs. a residential area), time (perhaps used at 4am vs. 9am) and type of application (let’s say IoT vs. gaming). In commercial areas, there will be a large number of vehicle-related workloads during rush hour, while in residential areas, streaming-type loads will generally be more prevalent outside of working hours. Quite a few Industry 4.0 applications require reliable low-latency communication, while best-effort services are acceptable for many non-critical IoT applications. To meet the requirements, the edge cloud environments will be heterogeneous, with different types and sizes of hardware and software deployed at each location. As a consequence, the cloud management platform for edge clouds should be smart and self-adapting. Specifically, it should be capable of managing edge data centers with heterogeneous HW/SW sizes and configurations, able to handle any application thrown at it, and be able to provision the correct HW/SW resources suitable for each application at the right edge site.
High performance and reliability
One key selling point of edge clouds is the potential for lower latency compared to central clouds due to lower network delays. To take advantage of the edge cloud, applications must be designed for it. An example we can mention here is bypass kernel using DPDK. More to the point, there will also be strict requirements on fast detection, analysis and remedy of problems and issues.
When it comes to managing any hardware and software infrastructure, failures are the norm. This means that hardware and software components that make up an edge cloud platform are expected to fail sooner or later. These failures can negatively impact the deployed applications and may lead to performance degradation and violation of SLAs. In some cases, the problems could propagate from one system to another in an edge site, or even from one edge site to another. Existing fault management solutions are mostly limited to raising alerts based on simple thresholds and often require an administrators help to handle the fault – often a very slow process.
In the next step, once a problem is discovered, manual troubleshooting can take time given the complexity and the size of current cloud platforms and applications. In edge cloud environments, a problem that affects application performance and risks violating SLAs needs to be detected, analyzed and remedied in a timely manner.
Need for human intervention
The infrastructure on which edge cloud platforms run is expected to be deployed in sites that are in large numbers and are sometimes located in ‘remote’ areas. This makes it difficult for a small team of system administrators to have frequent physical access to each site on a regular basis. This means that it’s not possible to rely on the availability of a human administrator if and when a problem occurs or when a reconfiguration is needed. Therefore, the edge cloud management platform should be self-managing, performing most maintenance operations and handling basic problems autonomously, only requiring human intervention for critical issues beyond its capabilities.
A new breed of management solutions
The conclusion from the analysis above is that traditional management approaches and techniques used on centralized data center-based cloud platforms are not sufficient or suitable for edge cloud environments. There is a crucial need for a new breed of management solutions, capable of addressing the new challenges. Such solutions, beyond simplifying the operation of edge clouds, are also critical to ensure a smooth transition to the network compute fabric future.
AI/ML for intelligent edge management
Artificial intelligence and machine learning (AI/ML) add the ability to extract knowledge from a large amount of data to implement intelligent solutions. An AI/ML-enabled edge cloud management platform can make data-driven inferences, predictions and decisions based on what it was able to learn from existing data with limited human intervention. This leads to quick and near optimal solutions to many kinds of problems, including those outlined in the previous section.
ML methods can be classified mainly into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning, as shown in Figure 2.
- Supervised learning methods use labeled training datasets to create models that map the input data to the desired output. This function can then be applied to new input data to infer the output.
- Unsupervised learning methods use unlabeled training datasets to create models that capture some characteristics of the system.
- Semi-supervised learning methods use a small amount of labeled data, supplemented by a comparatively large amount of unlabeled data.
- Reinforcement learning methods are used to train an agent that takes actions in an environment based on its observations. This is based on an iterative process that uses the feedback from the environment to learn the correct sequence of actions and to maximize a cumulative reward.
When it comes to adopting AI/ML to edge cloud management, we have identified a few particularly relevant techniques.
- Transfer learning is a technique where knowledge can be leveraged from previously trained models for training new models. The key motivation for transfer learning is the fact that most models that solve complex problems need a lot of data and computation power for training. Using transfer learning, it would be possible to reduce the need and the effort for unnecessary data collection and training by using the pre-trained models to solve new problems.
- Distributed learning is another technique that refers to multi-node machine learning algorithms and systems that are designed to improve performance, increase accuracy, and scale to larger input data sizes. It can be used to increase parallelization when the training data required for sophisticated applications is in the order of terabytes. Or when the data is inherently distributed or too big to store at a central location.
- Federated learning is a technique used to train a shared model under the coordination of a central server from a federation of participating devices. The goal is to avoid data leakage by keeping personal data on the users’ devices, thus alleviating their privacy concerns.
How is AI/ML applied in cloud management?
AI/ML techniques have already been applied to several problem areas in (edge) cloud management. Below are a few highlights:
Anomaly detection is used to identify patterns in data that differ significantly from most of the data or the expected behavior. Based on the extent to which the labels (normal or anomalous) are available, anomaly detection techniques can operate in supervised, semi-supervised, or unsupervised modes. Anomaly detection can automate fault management and security management operations. It can be used for automated problem identification and localizing the root cause of the faults or security incidents.
Clustering groups data points that are similar according to a given metric. When the dataset is small, it can be clustered by manually labeling every instance, but for larger datasets, automatic labeling of the instances is needed. Clustering can be applied to performance management operations such as workload placement and scheduling. Clustering can also be used for workload characterization, identifying common groups (clusters) of tasks in the workloads, to efficiently utilize the edge cloud nodes resources.
Classification aims at categorizing unknown data points into categories discovered during training. This is the supervised equivalent of clustering. Classification can be applied to fault detection operations, for instance, classifying data into either normal or indicative of a faulty state.
Regression is concerned with modeling the relationship between variables. It estimates how dependent variables change in value when other variables change by a certain amount. Regression can make fault management operations intelligent. For instance, it can be formulated to predict hardware or software failures in an edge cloud system. It can also be applied to performance management operations such as predicting the future resource demands of a workload.
Reinforcement Learning (RL) can automate fault management operations. One example is suggesting the most appropriate remediation procedures for faults – both detected and predicted ones. Another example is continuous monitoring of an edge cloud system. RL can also determine the best way to adjust to the resource configuration for a particular workload.
Building blocks of an intelligent edge
We have described how AI/ML techniques can already be applied (and are being applied) today to solve many of the operations and management challenges of both central and edge cloud environments. However, these solutions are developed on a case-by-case basis where each implementation needs to collect its own data, implement a processing pipeline, train models and carry out inferencing. This way of developing solutions has several drawbacks.
First, the effort associated with the management of such solutions increases with their number, making it arduous to manage a large number of such systems.
Second, developing such solutions takes a lot of time and the barriers to entry for developers is high, since all parts need to be put in place.
Finally, the approach is wasteful in that it results in duplication of data (and hence wasted storage and processing resources) as well as duplication of efforts (wasted time in writing code, wasted resources in training models, and so on).
We believe that to be able to develop AI/ML driven solutions efficiently and with minimal effort, especially for edge cloud platforms, there is a need for a new unified framework or architecture. This architecture should fulfill the following requirements in order to simplify the development of intelligent edge management solutions.
- It should allow continuously collected data generated by various hardware and software components of edge clouds and providing the raw data needed for building AI/ML models and functions. The data should be stored properly and be accessible by all applications if needed.
- It should support scalable data processing and management. The amount of data collected from edge sites can grow rapidly and the data can be in various formats. The data needs to be processed so that it can be relevant, in the right format, dimension and range so that the AI/ML model can perform training and inferencing correctly.
- It should support ML/AI model management such as the model lifecycle management, including training and inferencing. It should also enable various training/learning mechanisms, like central learning, distributed learning and ensemble learning.
- It should provide flexible interfaces for developers to easily develop and customize AI/ML based edge cloud management functions, promoting sharing of data, functions and models.
We believe that new management functionalities need to be dispersed within a central site and geographically distributed within edge sites. These functionalities include:
- Data Monitoring: allows mechanisms to retrieve the raw data generated by the edge clouds and needed to build the AI/ML models and, as a consequence, the management functions. The main responsibility of data monitoring is to collect and store both the streaming and the static data generated by different sources in the managed edge cloud, for example, hardware and software components. Data monitoring also provides several functions to process and manage the data, including data collection, data movement (like extract, transform, load), data integration and data storage. One of the key functions is data transformation – unifying the data format to be used by AI/ML models.
- Data Management: manages the generated data from the managed edge cloud to ensure the quality of data that is the input data of the AI/ML models. The better the quality of the input data, the better the quality of the intelligent operations performed by the AI/ML models. For example, the collected data should be managed to extract the relevant features to be used for training the AI/ML models to have a better accuracy rate and reduced execution times. This component also maintains the preprocessed data to encourage data reuse.
- Model Management: provides functions to manage the AI/ML models that implement the ‘intelligence’ of the management operations. Model management includes different steps that are required to build the AI/ML models, such as training, evaluation, deployment, storing and monitoring of the AI/ML models. Edge clouds will change dynamically. Model management must support continuous adjustment to the data changes to ensure that the AI/ML models perform optimally. For example, when generated edge data experiences data drift, the AI/ML model could be retrained to ensure the expected performance.
- Intelligent Operation: offers a generic execution environment or runtime for running non-AI/ML code that is specific to the solution being implemented. Intelligent operation also provides interfaces for developers to consume the above functions and to control the deployment of the code at different locations in the distributed edge cloud environment.
In our next post, we will present a proposed management architecture for such an intelligent edge cloud and demonstrate its capability through several use-cases.
Read our white paper, Edge computing and deployment strategies for communication service providers.
Learn more about our research on network compute fabric.
Read our blog post, What is computing fabric in the network?