In an era defined by the ever-accelerating pace of technological advancement, the management and optimisation of complex telecommunication networks have become pivotal for industries ranging from critical infrastructures such as transport and energy, to finance, healthcare, and beyond. At the heart of this transformation lies the burgeoning field of real-time telemetry analytics, which serves as the linchpin for ensuring the reliability, performance, and security of networked systems. In this age of digital interconnectedness, where vast streams of data flow through networks at unprecedented rates, the need for intelligent and agile monitoring and service assurance solutions has never been more pressing.
In the rapidly evolving landscape of networking, the convergence of Artificial Intelligence (AI) and real-time telemetry analytics is reshaping the way we perceive and optimise network operations. The landscape of Machine Learning (ML) approaches becomes a compass guiding us toward network quality and optimisation. Every millisecond counts and insights which drive operational excellence can save users € millions in costs.
With its ability to rapidly process vast quantities of streaming data, identify subtle and often hidden patterns, and make informed intelligent decisions in real-time, AI stands poised to elevate telemetry analytics to new levels of effectiveness. This convergence of AI and telemetry not only holds the potential to uncover hidden problems and generate valuable insights within the labyrinthine of network data but also to preemptively respond to emerging issues by mitigating problems before the customers notice any impacts to the service, thus redefining the way industries operate and innovate.
Let's embark on a journey into the AI-driven future of real-time telemetry analytics, delving deep into the synergistic relationship between AI/ML, telemetry, and networks.
1. Telemetry Analytics: What, How, Why
1.1. What is network telemetry analytics?
Telemetry analytics is a critical component of today's data-driven decision-making process. It involves the collection, transmission, and analysis of data from remote sources, enabling organisations to track, analyse, and make informed decisions about their systems and processes.
Telemetry is essentially the automated measurement and transmission of data from sensors, devices, or instruments for further analysis and interpretation. Telemetry analytics takes this raw data and transforms it into useful information, knowledge, insights, and wisdom (as stems from the DIKW pyramid).
Network telemetry provides the first step to answering the critical question: “What is happening on my network?” Telemetry systems can collect, transmit and measure data from remote sources using sensors and other data collection devices, thereby providing network engineers with invaluable insights into the performance, security, and efficiency of their infrastructure.
1.2. How Is Telemetry Analytics Done Now?
The majority of modern telemetry acquisition solutions on the market work either with delayed data delivered via data busses (e.g. Apache Kafka or proprietary J2EE Tibco (TIBCO Cloud™ Integration Powered by BusinessWorks)) or data which is stored in various data lakes built on non-SQL databases ( e.g. Apache Hadoop). These data lakes can also be used also as offline data sources, as well as various bulk-statistics data file dumps.
On the contrary, the streaming telemetry data is collected in real-time, or near-real-time from remote sources using direct data pickup from the sensors or monitoring devices with highly specialised data collectors or loggers. These sources can be distributed across vast geographical areas, making data collection a challenging but essential task. Once collected, telemetry data is transmitted to a central repository via communication protocols. Common transmission methods include satellite communication, cellular networks (4G/5G), radio frequencies (e.g. LoRa), and of course via the internet for devices directly attached via a large family of TCP/IP protocols.
Telemetry data is typically stored in secure databases or data warehouses where it can be accessed and retrieved for analysis. Analysis of this data can be carried out using various methods, such as statistical analysis, machine learning algorithms and data visualisation tools. The goal is to identify patterns, anomalies, and trends in the data. An important component is also the visualisation of information obtained from telemetry data. Data visualisation helps make complex information easier to understand for Network Operators and can be presented in the form of dashboards, charts and reports.
To spot and react to dangerous patterns in the network telemetry for telecommunication networks and IT infrastructures, widespread approaches exist, and the most common one is to implement rule-based data preparation by parsing each message and converting it to a structured form or schema. This involves creating a set of predefined rules or conditions that guide the analysis and interpretation of telemetry data. For example Elastic Common Schema (ECS) is an open source specification developed with support from the Elastic user community and is used by the ELK stack. These rules are typically based on domain knowledge and broad user expertise, and this approach “dictates” how the system should respond to specific telemetry patterns or events.
Rule-based systems are highly effective for automating responses to simple, repetitive and known issues. For example, a rule-based approach can be used to define rules that trigger an alert when network traffic exceeds a certain predefined out-of-bounds threshold, indicating a potential issue with network congestion. Similarly, rule-based systems can be employed for security purposes, where specific patterns or “fingerprints” of network traffic may indicate a security breach or an unauthorised access attempt.
1.3. Why do we monitor?
Network telemetry analytics is a cornerstone for ensuring network reliability and performance. The “holy grail” of network telemetry analytics is to proactively detect and address emerging issues within the network before the users notice them, ultimately avoiding any and all service disruptions, optimising service quality to the level where inevitable service degradations are not noticeable by the users. By leveraging real-time data analysis, network telemetry analytics empowers organisations to spot “brewing” problems, engage mitigation response or dispatch engineers promptly, and maintain optimal, seamless, and resilient network environments.
With real-time network health monitoring and continuous monitoring of various network parameters, engineers can promptly detect issues such as optical imparities, flapping interfaces, packet loss, latency spikes, bandwidth congestion, routing instabilities, and many more. Through timely and vigilant identification of network anomalies, it becomes possible to gain the foresight and capability to proactively pinpoint the root causes of emerging issues and take corrective actions immediately.
The problem of delayed application of corrective actions is well known in control theory. In practice it is easy to illustrate by one driving a car or operating a vehicle remotely. Introduce a little delay, and driving the vehicle becomes difficult or impossible and might end up in losing the desired track completely. It is not a secret, that in modern networks the uncorrected issues can go unnoticed for days and untreated for months, and eventually lead to “snowballing” effect or even result in an equivalent of the deadly avalanche - the dreadful so called “Priority 1 incident”, when majority of the network users experience the service outage.
Telemetry analytics also plays a critical role in detecting unauthorised access attempts and potential security breaches. By analysing traffic patterns and anomalies, network engineers can identify suspicious activities and respond promptly to safeguard network assets. Incorporating telemetry data from various security devices and network endpoints allows to create dynamic threat intelligence feeds. This data-driven approach aids in devising effective strategies to mitigate cyber threats and vulnerabilities.
In order to improve operational efficiency, optimise resource allocation, and ensure the reliability/security of complex systems, it is critical to implement modern telemetry monitoring and analytics practices with immediate application of proper corrective actions in order to keep the network’s services on the desired track.
1.4 AI & Network Telemetry
Telemetry analytics, in combination with Artificial Intelligence (AI) and Machine Learning (ML), is revolutionising data-driven decision-making across various industries and becomes a catalyst for emergence of new business processes. When augmented with AI and ML, Telemetry analytics goes beyond mere data processing and offers transformative operational capabilities for businesses seeking actionable insights and automation. It can facilitate predictive insights, enable preventive anomaly detection, drive intelligent close-loop automation, leading to more efficient operations through proactive troubleshooting. In essence, AI acts as a force multiplier for network telemetry analytics and operators, elevating its purpose from a reactive tool to a proactive and predictive strategic asset in the management of modern, complex telecommunications data transmission networks.
As ML makes it possible to analyse vast amounts of data more efficiently, automate complex tasks, and make data-driven diagnosis (ultimately leading to more effective decision making, cost savings, and competitive advantages), it is becoming more widespread and used in almost all industries. According to Fortune Business Insights, the Machine Learning (ML) market is expected to have a CAGR of 36.2% in the coming years (from $26.03 billion in 2023 to $225.91 billion by 2030). The largest market share in 2022 (by End-use Industry) was occupied by IT & Telecommunication (18,6%), Banking, Financial Services & Insurance, Automotive & Transportation, and Retail (Fortune Business Insights).
There can be a clear upward trend in the popularity of ML approaches, as shown in the figure below, based on data from 2015 to 2020 collected from Google Trends by Sarker, I.H., who provided a comprehensive overview of ML algorithms in his research paper: Machine Learning: Algorithms, Real-World Applications and Research Directions.
2. Network Telemetry Characteristics, Problems and Challenges
Beneath the surface of network telemetry lies a complex landscape marked by unique characteristics of each data stream originating from a specific source element in the specific application’s domain, each presenting both opportunities and challenges. Here are some of them:
Massive Amounts of Real-Time Data:
One of the defining characteristics of network telemetry is the sheer volume and variety of real-time data being generated. Network elements and systems continuously produce vast streams of raw data (Data Overload), documenting every metric indicator, diagnostics message, etc. This torrent of data is a treasure trove of insights, however, managing and analysing such massive datasets is a daunting task. Just collecting and processing such a huge amount of data can overload monitoring systems. In addition, overloaded or misconfigured monitoring solutions can alter the raw data, and exacerbate the challenge of identifying meaningful insights and distinguishing between signals and noise. Often, unable to distinguish what is the signal and what is the noise, such systems erroneously discard valuable raw data, making further intelligent analysis impossible. Network engineers and data scientists are faced with the task of creating robust data collection, storage, and processing mechanisms capable of handling the constant influx of information to effectively deal with this data overload without inadvertently corrupting or discarding the valuable data.
Stochastic Nature:
Networks are inherently stochastic (unpredictable), with events occurring unpredictably. The stochastic nature of network telemetry poses another major obstacle as it’s often the issues we are unaware of that end up causing the most painful outages. For instance, traffic patterns, packet loss, and device failures to name a few, can occur randomly (intermittently) and not due to a known pattern. When analysing network telemetry, we must contend with this inherent unpredictability, necessitating sophisticated streaming ML algorithms and robust models to distinguish between normal variations and genuine issues. Machine learning and statistical techniques must be able to discern meaningful signals from maintenance activities (illustrative example) within the stochastic data sets. This unpredictability requires the use of advanced anomaly detection techniques which can identify patterns which have never been incurred before, and distinguish if the patterns are regular fluctuations or real problems. The infamous 2022 Rogers outage was caused by a maintenance upgrade that caused routers to malfunction (source). This example of a “perfect storm”, is an incident that predetermined (deterministic rules-based or less deterministic neural networks) remediation approaches could not respond to.
Diversity of Data Types (Structured/Unstructured Data):
Networks communicate via diverse protocols and a network’s elements come with mirriad of different structured ‘languages’ and ‘dialects’, resulting in countless types of telemetry data. While some data is structured, such as SNMP metrics (uptime, throughput, temperature, interface errors, etc), a substantial portion of (asynchronous) events based telemetry is unstructured. Unstructured data such as log data (Syslogs, event logs, etc) offers a wealth of untapped insights. Extracting meaningful information from unstructured data requires advanced natural language processing (NLP) and text analysis techniques which are robust and suitable for real-time streaming processing. Integrating and normalising data across the different sources can be challenging and sometimes barely possible, making it even more difficult to obtain a unified holistic view of a network.
Multi-Vendor Complexity:
The modern network environment is seldom homogenous, often comprising equipment and systems from multiple vendors. Each vendor has its own (often proprietary) data structures, system logging, and telemetry formats, making it very challenging to create a unified multi vendor telemetry monitoring system. Many standards exist (XML-based schemas such as RFC 6241 - Network Configuration Protocol (NETCONF) or legacy CORBA based APIs), with many methodologies developed (NGOSS and eTOM developed by Telemanagement Forum), but managing this diversity in heterogeneous telecom environments can still be very complex and requires vendor- and device- specific integrations. Accordingly, Syslogs produced by devices across vendors cannot be directly compared with one another. A holistic monitoring system must account for the unique attributes of each vendor / system / device to make sense of the unstructured data. The integration of these disparate data sources typically necessitates translation layers and vendor-specific parsers to prepare and “normalise” the data for analysis.
Dynamic Nature:
Networks are dynamic entities, constantly evolving in response to changing business demands, new technological trends, and emerging threats. Networks are constantly evolving: new devices are added, daily configurations changes, security measures are adjusted in real-time... This dynamic nature of networks adds another layer of complexity. Telemetry analytics solutions must be agile, capable of adapting to these changes without interruption or significant manual intervention. Dynamic configuration management and horizontal scalability across multiple network domains are vital to maintaining the efficacy of telemetry solutions in dynamic environments. Also, the addition of virtualisation measures and new technologies in recent years only aggravates the problem.
Seasonality Patterns:
Seasonality, a characteristic not often associated with network telemetry, plays a subtle yet significant role. Network traffic exhibits periodic fluctuations influenced by factors such as daily user’s routines and business hours. Recognising and accounting for these patterns is crucial in understanding network behaviour. Telemetry analytics should be equipped to handle and study these cyclical variations to provide accurate insights and predictions as well as detecting outliers which deviate from the norm.
3. Machine Learning Types
Let's delve deeper into the world of Machine Learning (ML) and explore its various flavours that can empower telemetry analytics to go beyond simple data processing and take monitoring to the next level.
Machine Learning can be categorised into four main types: Supervised Learning, Unsupervised Learning, Semi-supervised Learning, and Reinforcement Learning. These categories help us understand how ML algorithms work and define their wide range of applications depending on business needs. Let's take a closer look at each of these to demystify the world of ML!
3.1 Supervised Learning
Supervised learning is a ML approach based on the use of labelled datasets, which means it is provided with verified and trustworthy input-output pairs during training. Such data sets are used to create algorithms aimed at classifying data objects or accurately predicting outcomes. Using labelled inputs and outputs, the model can compare inputs and outputs for accuracy and learn over time.
The supervised learning process consists of two primary phases: training and inference. During the training phase, the algorithm meticulously analyses the provided dataset, adjusting its internal parameters to minimise the disparity between its predictions and the actual labels. This iterative process, often facilitated by optimisation techniques, fine-tunes the algorithm's ability to generalise from the training data to make accurate predictions on new, unseen data.
Supervised learning can be divided into two types (based on data mining tasks): classification and regression, which are two fundamental approaches that form the foundation for predicting outcomes from data. They have different goals and methods and are crucial for a wide range of predictive tasks.
Regression helps us understand the connection between input features and continuous, numeric outcomes. In regression, we use different mathematical tools, like linear regression, which deals with straight-line (linear) relationships between input and output. There's also polynomial regression methods for curved relationships, support vector regression (SVR) that uses support vector machines, and random forest regression with decision trees.
On the other hand, classification is about assigning input data points to predefined categories or classes. This is used in many areas, like spam detection, image recognition, sentiment analysis, and medical diagnosis, where we need to put data into different outcome groups. In classification, we have various algorithms, such as Logistic regression, Decision trees, Support vector machines (SVM), Random forest classification, and many more.
Strengths and Limitations of Supervised Learning
Supervised learning has a number of strengths, one of the main ones being its ability to solve problems in which well-defined and structured outputs are available. Clarity in the target results facilitates the development and application of these models in areas where accuracy and precision are of paramount importance. Supervised learning demonstrates a wide range of applicability of the models, spanning a variety of industries and applications. Its versatility is evident in areas such as image recognition, natural language processing, recommendation systems, and medical diagnostics, to name a few. Moreover, some supervised learning models, such as decision trees and linear regression, provide interpretability, allowing for deeper understanding of the logic driving their predictions — an invaluable asset in fields where model transparency is a must.
However, supervised learning is not without its limitations. Chief among these is its dependence on carefully labelled structured data, which can be both resource-intensive and difficult to obtain, especially for specialised or niche areas. Although supervised learning is a powerful system, it is limited in its capabilities by the knowledge that was included in the labelled training set which represents structured environments. For example a large set of written text in English or Chinese, pictures of different breeds of dogs and cats, objects of flora and fauna, rule based games like chess, and many others.
The inability to detect unknown patterns is one of the most significant challenges in Supervised ML. Models are trained on labelled data with known outcomes picked up from the data set representing an inherently structured (non-stochastic) environment, making them reliant on historical patterns that are already present in the training dataset. Consequently, when faced with completely new or unknown patterns in the data, supervised models will fail to provide meaningful predictions or classifications, as they lack the ability to recognise and adapt to novel or emerging trends. What is even worse, when increasingly stochastic data is included into the model, the quality of predictions rapidly deteriorates. This weakness could be exploited by malicious agents to deliberately ‘poison’ the derived models.
Furthermore, supervised models will have difficulty working with data that differs significantly from the training set, hampering their ability to generalise to new situations. Another clear limitation is the potential for bias to propagate from the training data. When training data contains biases or inaccuracies, supervised models can inadvertently learn and perpetuate these biases, which can lead to unfair or erroneous predictions.
3.2 Unsupervised Learning (UML)
Unlike Supervised learning, which relies on labelled data, unsupervised ML operates on unlabelled data, extracting valuable insights and patterns without explicit guidance. This makes it particularly suitable for environments where data can be vast, unstructured, and ever-changing (stochastic). With UML, a list of hyper-parameters are leveraged by the AI which it uses to start discovering any sort of structure within the data. Parameters are some characteristic of the data, which can for instance be represented by a numerical value (for instance for a 2D metric in time the amplitude can be taken as a parameter). In unsupervised machine learning, clustering and association are two fundamental techniques used to analyse and extract patterns from unlabelled data. Both techniques are valuable tools for extracting insights and patterns from unlabelled data, aiding in tasks such as pattern recognition, segmentation, recommendation systems, and market analysis.
Clustering, as one of the primary techniques within unsupervised learning, involves grouping similar data points together based on their inherent similarities or distances in a multi-dimensional feature space. Clustering algorithms aim to uncover the underlying structure within the data without any prior knowledge of the groups. The primary goal of clustering is to partition the data into clusters or groups in such a way that data points within the same cluster are more similar to each other compared to those in other clusters. The aim is to find meaningful groupings that can help in data analysis, pattern recognition, and decision-making.
Several clustering algorithms are employed in unsupervised learning. Some of the more popular of them are K-Means Clustering, k-nearest neighbour (kNN), HDBSCAN, and Hierarchical Clustering. K-Means Clustering partitions data into distinct clusters by sharding the feature space until the optimum results are achieved with the ‘elbow’-curve method. Hierarchical clustering builds a tree-like structure of clusters, starting with individual data points and gradually merging them into larger clusters based on their similarity. It allows for both fine-grained and coarse-grained clusterings.
Association, on the other hand, is a technique used to identify relationships, associations, or correlations between items or variables in a dataset. Association rules reveal which items tend to occur together in datasets, which can be valuable for making recommendations or optimising business strategies. A classic example is the Apriori algorithm, used extensively in market basket analysis. It identifies frequent itemsets, revealing items that are commonly purchased together. This has wide-ranging applications in recommendation systems, optimising inventory management, and market basket analysis in retail.
Strengths and Limitations of Unsupervised Learning
When dealing with situations where patterns are unknown, the data is unstructured and the environment is frequently changing or stochastic by its nature, or when access to sisable labelled datasets is limited, unsupervised learning emerges as a valuable solution. Unsupervised learning offers versatility and adaptability in data analysis and exploration, but it also comes with inherent limitations related to evaluation, interpretation, data quality, and scalability.
Among the strengths of Unsupervised learning, firstly, is that it allows us to find hidden patterns in data, without knowing in particular what we are looking for. This gives it the flexibility to still provide meaningful insights in environments where we may not know exactly what we are looking for. This makes it a strong addition to use-cases where cause and effect relationships are not fully known and need to be studied for diagnostic purposes.
Secondly, It offers flexibility by not mandating labelled training data, rendering it adaptable to scenarios where obtaining labelled datasets is unfeasible or expensive. This flexibility enhances its scalability and versatility across diverse applications. It can massively lower the amount of manual work that needs to be invested to “train” a model, which in turn lowers the barriers to entry. With UML, customers are also reassured that their data is not being used to train a model which will then be sold to their competitors. A big plus for the corporate world!
Thirdly, it is highly adaptable to diverse data types and structures, enabling its application across a wide range of datatypes, and allowing it to make sense of highly dynamic and stochastic environments which are ever changing.
Despite its strengths, unsupervised learning also presents several inherent limitations. One primary challenge lies in the evaluation and validation of results. For instance, if a cluster is found, it implies a similarity between events, but it does not necessarily imply some cause and effect relationship like with Supervised ML. The interpretation of results can be complex, requiring expert domain knowledge to extrapolate the results into actionable intelligence.
3.3 Other Notable Approaches Semi-supervised Learning
Semi-Supervised Learning serves as a powerful bridge between Supervised and Unsupervised Learning, utilising both labelled and unlabelled data. This approach combines the strengths of Supervised Learning for model training with the exploratory nature of Unsupervised Learning to uncover hidden patterns. Classification and clustering are fundamental tasks, where classification assigns labels to data points using both types of data, and clustering groups similar data points for enhanced exploration. Semi-Supervised Learning offers substantial benefits, particularly in scenarios with limited labelled data, enhancing model accuracy and adaptability for real-time monitoring. However, it relies heavily on the quality of initial labelled data and may introduce complexities in model selection and tuning.
Reinforcement Learning
Reinforcement Learning (RL) is a machine learning field focused on training intelligent agents to make sequential decisions in dynamic and structured environments, particularly applicable in telemetry analytics. These agents learn through interactions with telemetry data streams, navigating and adapting to the environment based on feedback in the form of rewards or penalties based on the envisioned desirable outcome. For example minimising antennae power output while steering the radio beams and tuning frequencies to provide the optimal throughput to the mobile terminals. Some other examples could be to produce the strongest and most lightweight mechanical support structure within a predefined weight and structure strength constraints, bi-pedal robot system learning to ‘walk’, or even playing a computer game on the “insane” complexity mode.
RL encompasses classification, assigning data to categories, and control, optimising actions in response to incoming data. In telecommunications, RL shines in resource allocation, routing, and quality of service optimisation. For example, Q-learning aids efficient data packet routing. RL's strengths include adaptability, learning through exploration, and tackling complex problems, but it requires substantial data and can be sensitive to hyperparameters and reward functions. In telecommunications, RL enhances network performance, fault detection, and resource allocation based on the structured data inputs, ushering in data-driven decision-making in telemetry analytics.
Looking to dive deeper into the world of Machine Learning and its real-world applications? Check out the insightful article "Machine Learning: Algorithms, Real-World Applications and Research Directions" written by Iqbal H. Sarker. It explores the diverse landscape of machine learning algorithms, and how they can elevate various application domains like cybersecurity, healthcare, and smart cities. Furthermore, discover the key principles and potential research directions that are shaping the future of AI and machine learning.
3.4 Supervised vs. Unsupervised ML
The choice between supervised and unsupervised machine learning (ML) approaches is pivotal for extracting valuable insights from data. One is not inherently better than the other, and it’s all about choosing the correct selection or mix of appropriate methods for the task at hand. But what do these technologies mean for Telemetry analytics and Network monitoring?
Supervised ML is characterised by its reliance on labelled training data, where models learn to make predictions or classifications based on historical examples. This approach excels in scenarios requiring precision and well-defined objectives. Unsupervised ML, on the other hand, explores data without labelled examples, making it particularly suitable for situations with evolving dynamic data and stochastic tendencies.
The choice between supervised and unsupervised learning depends on the task at hand:
Supervised learning offers precision and interpretability, making it ideal for pin-point solutions which respond to a particular repetitive problem such as intrusion detection and QoS monitoring. SML has also been applied by vendors to “predict” that an outage will occur based on some cause and effect relationship that can be seen in network telemetry. But evidence shows that the accuracy of these models is low in fact, namely due to the aforementioned specificities of network telemetry: Stochastic, Dynamic, Variability.
Conversely, Unsupervised Learning excels at anomaly detection (clustering network telemetry into different groups to isolate “strange” looking activity), data exploration, and adaptability to changing network conditions. For example, for telecommunications networks, unsupervised models can spot anomalous behaviour, categorise it, analyse trends of the discovered anomalies, and help operators identify issues that they can then mitigate.
4. OptOSS AI for Telemetry Analytics
In the Telecommunications industry, with the emergence of AI, Network Operators are striving to reach the “autonomous network” operations. A self-healing, self-managed, intelligent network which only requires human intervention when something needs to be physically repaired. And network telemetry is “fuel” that the AI “engines” need. To achieve this, a system must be in place which can:
Monitor the network holistically, end-to-end,
Spot brewing issues in the network telemetry autonomously,
Understand the significance of the issues,
Execute the appropriate remediation efforts.
This all needs to happen in real-time so issues are solved prior to service impact. Currently, we still have ways to go till we reach this “holy grail”.
From a theoretical standpoint, it would make sense to use UML for steps 1 & 2, where the adaptability to the data and initial clustering capabilities are a strong fit. That being said, clustering across a massive network consisting of several technologies (for instance mobile RAN, Fixed TRANSPORT & CORE networks) for a holistic view into the network’s status can result in rather generic clustering with insufficient specificity to be actionable.
On the other hand, using SML on the whole dataset here would be akin to “boiling the ocean”. In addition, your model would quickly be outdated due to configuration and device changes, and the evolution of the monitored network. The model would be susceptible to the stochastic nature of the data, providing low accuracy predictions. Ultimately, this would cause more work for the operational teams than help for the operators due to the false negative/positive ratios being too high.
Currently, no vendor is capable of executing steps 3 & 4 at the accuracy of a trained expert human operator. The best option currently available is to leave this step to the L2 & L3 Operators and vendor’ support engineers who can leverage years of domain expertise in providing their interpretation. That being said, SML (or a subvariant of it) is the only viable option due to the requirement of predictive or classification capabilities. Unfortunately, SML only has a high accuracy in highly structured environments. Therefore something needs to precede the SML process that can convert the chaos of the overall network into a systematic problem that SML can solve.
Enter OptOSS AI: Following the theory, OPT/NET has built OptOSS AI to automate functions 1, 2 & 4 and supports human experts with the interpretation.
OptOSS AI leverages UML to detect anomalous sequential patterns within the network telemetry and groups these into homogenous clusters. Next SML is used for deeper analysis of the clusters on a case by case basis to provide insights into what the pattern captured by the Clusters signifies. Domain experts (L2/L3) then can study the clusters and “educate” the system on their significance: One cluster might document the sequence of an Optical Power failure, and is probably worth reporting for inspection. Another cluster could be redundant, and may look anomalous due to the word “failure” appearing in the syslog sequence. But it might be an appropriate failure or an unsuitable use of the term “failure” by the software developer who coded the message. This interpretation must only happen once for each cluster, after which the system is now aware of the anomaly cluster. Operators can then trigger action scripts that the AI executes whenever the new anomaly in a cluster reappears in the future. Now you have a step-wise approach to rectify issues end-2-end and this process occurs automatically in a stepwise fashion and still includes a human expert in the decision making loop (during loop initiation).
To counteract the risk of generic clustering of the very large stochastic datasets, OptOSS AI is deployed in a hierarchical fashion, mimicking the organisational setup of the Network Operator. AI instances (which act as independent ‘brains’) are deployed per “technology” area (network domain), keeping the clustering of the UML targeted and ensuring that the insights produced are specific to the respective organisational L2/L3 expertise which handles the interpretation of the anomaly clusters. Next only the processed anomaly data with impact attributions is fed to a OptOSS “Manager” instance which provides the holistic view on the entire telecom and allows users to cross correlate anomalous activity across network domains. The architecture behind our product is an extensive subject, and we might cover it in future articles. But if you’re interested in how exactly the system detects anomalies for your organisation, let us know!
Our overall holistic approach ensures that Telecoms have reliable and high-performance networks while staying agile in the ever-evolving Communications Services Provider (CSP) industry. In the dynamic realm of telecom networks, where patterns are constantly evolving, the process OPT/NET created is a game-changer proven over many years of in-production operation.
Leverage OptOSS AI for early problem detection, ensuring that potential issues are identified and addressed promptly (MTTR reduction for incidents by 10x), reducing the risk of service disruptions (reduction in risk of outages by 70%) and security breaches, and automating your repetitive operational workflows (automate 50% of NOC operations).
Are you ready to step into the future of network operations powered by Unsupervised ML now?
Contact us to take the step into the future with OPT/NET’s hybrid-ML approach & proven and mature OptOSS AI solution for your telecom network telemetry needs!