For Network Operators, swift incident resolution is a necessity!
The ability to resolve network incidents lightning-fast can be a key differentiator for Network Operators in a market where it is hard to achieve brand loyalty from customers. This effort is driven by the need to improve reliability and provide a better user experience (UX) of digital services. In this article, we aim to cover the basics of incident resolution, discuss the different approaches, and chart a course towards proactive data-driven network operations.
Understanding MTTR in Telecom Networks
When milliseconds matter and downtime must be avoided, the significance of “MTTR” is of utmost importance. MTTR can mean several things, it can be understood as “Mean time to Repair” or “Mean time to Resolve”.
“Mean time to Repair”, exclusively includes the Reparation processes after a diagnosis is made, measuring the average time it takes from the moment the network issue is identified until it's fully resolved.
It should not be confused with “Mean time to Resolve”, which includes the Diagnosis stage, meaning the average time it takes to resolve incidents from Failure until Correct Behaviour. In the context of this article, we will use MTTR primarily as the Mean time to Resolve, as correctly diagnosing issues goes hand in hand with repairing them.
Furthermore, once issues are understood, reparation workflows can be automated with Closed Loop Automation leaving the automated diagnosis portion as the most challenging hurdle. In the world of stochastic network behaviour, correct Diagnosis and Root Cause determination requires intuition, experience and time.
As the modern world relies increasingly on seamless connectivity, reducing MTTR transforms from a metric into a strategic imperative. Whether it's a dropped call, a lagging internet connection, or a network outage, each disruption carries tangible repercussions, impacting Customer Service desks, revenue streams (penalties), and lower Net Promoter Scores.
By minimising MTTR, telecom operators can achieve a trio of benefits. Firstly, service disruptions are quickly contained, preventing them from cascading into larger “Priority 1” incidents, reducing the costs of operational departments via Proactive monitoring and Assurance. Secondly, improved incident response times translate directly into enhanced customer satisfaction, resulting in an improved NPS, bolstering trust and loyalty among users who rely on uninterrupted connectivity for their daily operations. Lastly, if MTTR is quick enough, Operators can avoid a portion of their Customer complaints all together by fixing issues before customers notice. This saves costs on Customer Service and keeps Customers happy.
Challenges in MTTR Reduction
Reducing Mean Time to Resolve (MTTR) poses unique challenges that demand strategic solutions. Network engineers, tasked with the responsibility of ensuring seamless connectivity, encounter several hurdles that can impede the swift resolution of incidents.
Network Complexity:
Telecom networks often boast elaborate architectures, characterised by a labyrinthine of routers, switches, servers, and myriad other interconnected elements with several services using combinations there-of. Within this intricate framework, quickly identifying and addressing issues can be like finding a “needle in a haystack”. The sheer complexity of these infrastructures complicates the troubleshooting process, requiring network engineers to navigate through layers of technology to pinpoint the root cause of an incident. Without a holistic “single pane of glass” view into the entire network with information on how the service operates on it, identifying anomalies and mitigating service disruptions becomes a daunting task, extending the MTTR well into the hours/days.
Diversity of Incidents Encountered in Telecom Networks
The diversity of incidents encountered in telecom networks poses a significant challenge for network operators. These incidents can range from hardware failures and software glitches to cyber-attacks and energy failures, each presenting unique challenges in terms of diagnosis, mitigation, and resolution. The varied nature and combination of these incidents makes it challenging to implement standardised response procedures across all scenarios. Each incident requires a tailored response based on factors such as the nature of the incident, its severity, and its impact on the service. Furthermore, when there is a configuration change or new device added to the network, a whole new combination of incidents can occur and the lessons learned from previous incidents. Having rule based approaches for incident prevention struggles as its a never ending race to keep up with tracking every possible combination of alarms.
Data Overload:
The exponential growth of data generated by modern telecom networks poses a significant challenge to traditional incident response mechanisms. The volume of data from network monitoring systems can overwhelm engineers, limiting their ability to quickly identify and respond to incidents. Sorting through massive siloed datasets to extract relevant information quickly becomes tedious, prolonging the MTTR. Operators sometimes go for the strategy of only looking at a subset of data that is considered to be more “important” like Alarm data and forgo to look at data like Syslog all together. But this results in a Reactive network operations approach where fires are extinguished after they’ve raised alarms and have impacted the service. There is little context on why alarms are happening meaning the contextual data is only used forensically. An effective method to proactively analyse all of the network telemetry enriched with additional data like CMDB, BSS, etc data in real-time is necessary.
Diverse Technologies:
The coexistence of diverse technologies within modern telecom networks introduces an additional layer of complexity to the incident resolution process. From traditional legacy systems to cutting-edge services such as 5G, the spectrum of technologies integrated into telecom infrastructures necessitates a multifaceted approach to troubleshooting. Each technology presents its own set of challenges and quirks, requiring specialised expertise for efficient resolution. The convergence of these technologies further complicates the troubleshooting process, as engineers must navigate through siloed systems.
Throwing all of the data into one data lake to “apply AI onto” also fails as the diversity in devices/technologies creates a “non-homogeneous” data space. This is one of the main reason purely predictive ML approaches for Network Operations rarely reach an accuracy of above 50% rendering them ineffective for productional use.
The architecture and design of AI Network Operations products must follow and comply with Telco frameworks like eTOM. This will enable an end-2-end view, while maintaining data homogeneity via hierarchical deployments of AI.
Different Kinds of Troubleshooting for MTTR Reduction in Telecom Networks
Now let’s explore some practical approaches for troubleshooting to reduce MTTR. Effective troubleshooting requires a multi-faceted approach to identify, isolate, and resolve issues promptly:
1. Analysis of behaviour of alarms (to see if there is a spike in alarms of a certain type)
One of the basic methods is the analysis of alarm behaviour to identify spikes in activity that may indicate underlying issues. The primary purpose of analysing alarm behaviour is to detect spikes that could signal potential problems or security threats within the network. By identifying spikes in alarm activity, network engineers can attempt to identify areas of concern, investigate their causes, and implement corrective measures before these issues escalate into major outages or performance degradations.
Step 1: Alarm Data Collection
The first step is collecting alarm data from a network management system (NMS). A NMS continuously monitors network devices and generates alarms when it detects irregularities or faults. This data includes various alarms related to hardware failures, configuration errors, security breaches, and performance issues. Gathering comprehensive alarm data provides the raw information needed for subsequent analysis. Tools like SolarWinds NMS, PRTG Network Monitor, and Nagios can be used to efficiently collect and manage this data.
Step 2: Categorisation
Once the alarm data is collected, the next step is to categorise these alarms by type (e.g., critical, major, minor), severity, source (e.g., specific devices or network segments), and other relevant criteria. Classification helps in organising the data, making it easier to analyse and interpret. Types of alarms might include hardware failures, security alerts, and performance issues, while severity levels can range from minor warnings to critical failures. Categorising by source involves identifying the specific devices or network segments that generated the alarms. This approach provides an understanding of where and what types of issues are occurring.
Step 3: Trend Analysis
Trend analysis involves using statistical tools or machine learning (ML) algorithms to analyse the frequency and distribution of different types of alarms over time. By examining historical data, network engineers can identify patterns and trends that might indicate underlying issues, facilitating proactive interventions. For instance, a consistent increase in performance-related alarms could suggest growing network congestion. ML frameworks like TensorFlow or Scikit-learn can be employed to conduct robust trend analysis, providing deeper insights into the network's behaviour.
Step 4: Spike Detection
Trend analysis is followed by spike detection, where sudden increases (spikes) in the number of alarms of a certain type are identified. Detecting spikes is crucial as they often signal acute issues that require immediate attention. This can be achieved using threshold-based methods, where alarms are flagged if they exceed predefined limits. More advanced techniques like moving averages or anomaly detection algorithms can also be utilised to detect less obvious spikes.
Step 5: Root Cause Analysis
Root cause analysis is conducted to understand the context and underlying reasons for the sudden increase in alarms. This step involves investigating various factors such as recent configuration changes, hardware performance, and potential security breaches. This is a manual time intensive process. The goal is to determine what caused the spike, for example a configuration change, hardware failure, network attack, or other issues.
Step 6: Response
The final step involves developing and implementing a response plan to address the identified issues. Depending on the root cause, this may include patching vulnerabilities, replacing faulty hardware, modifying configurations, or enhancing security measures.
2. Analysis of Network Traffic (to Spot Where the Breakages Happen)
The primary purpose of this analysis is to identify specific points in the network where disruptions or performance degradations occur. By monitoring and analysing traffic, network administrators can detect where breakages in their service are occuring.
Step 1: Traffic Monitoring
The first step in analysing network traffic is collecting real-time data. This can be achieved using packet sniffers or flow monitoring tools like Wireshark, NetFlow, or sFlow. Wireshark captures and analyses packets traversing the network, providing deep insights into the data being transmitted. NetFlow collects IP traffic information, giving a broad overview of network traffic patterns. sFlow provides scalable monitoring by sampling packets at specified intervals.
Step 2: Baseline Establishment
Once traffic data is collected, the next step is to establish a baseline for normal traffic patterns and performance metrics. This involves analysing historical data to determine what constitutes normal network behaviour under various conditions. Key performance indicators (KPIs) such as latency, throughput, and packet loss rates are measured and documented. Establishing a baseline is critical because it serves as a reference point against which future traffic patterns can be compared.
Step 3: Traffic Analysis
With a baseline in place, the collected traffic data is then analysed to identify deviations. This analysis focuses on spotting unusual traffic patterns, such as increased latency, packet loss, or drops in throughput. Tools like Wireshark can be used to drill down into specific packets and flows, while NetFlow and sFlow provide a broader overview of traffic trends. By studying these patterns, network engineers can detect signs of congestion, potential security breaches, or other abnormalities that might signal the onset of network problems.
Step 4: Breakage Detection
After identifying unusual traffic patterns, the next step is to pinpoint specific points in the network where these anomalies occur. This may involve tracing the path of affected traffic using diagnostic tools which map the journey of packets across the network, highlighting each hop and the time taken or if there is packet loss.
Step 5: Correlation
Once the breakages are detected, it can be correlated with the network topology, configuration changes, or other events. This manual process involves examining the network's configuration history and recent changes. For example, a recent firmware update on a router or a configuration change on a switch could be linked to the detected breakage.
Step 6: Resolution
The final step in the analysis is to pinpoint and rectify the issues. This may involve rerouting traffic, updating firmware, or adjusting network configurations. If a particular link is found to be congested, traffic can be rerouted through alternative paths. Firmware updates might be required to fix bugs or security vulnerabilities that are causing disruptions.
3. Analysis of Customer Service Data (to “Triangulate” Likely Cause of Impact Based on Network Topology)
The analysis of customer service data can identify and address network issues from an end-user perspective. This approach leverages customer feedback and service tickets to pinpoint areas in the network that may be causing problems for users on the basis of the network topology.
Step 1: Data Collection
The first step in this process is gathering customer service data, which includes complaints, service tickets, and feedback. This data can be collected from various sources such as customer support emails, call centre logs, online reviews like DownDetector and more. Data collection ensures that all potential issues are captured, providing a broad view of customer experiences. Tools like Zendesk, Salesforce Service Cloud, and JIRA Service Management can be used to streamline the collection and management of this data. These platforms offer robust features for tracking, organising, and analysing customer interactions, making it easier to gather valuable insights.
Step 2: Categorisation
Once the data is collected, the next step is to categorise the issues based on type, location, service affected, and time of occurrence. This classification helps in organising the data, making it easier to identify patterns and trends. Issues can be categorised into types such as connectivity problems, slow speeds, service interruptions, and billing errors. Additionally, categorising by location and service affected can highlight geographic or service-specific issues. Time-based categorisation can help identify recurring issues or trends over specific periods.
Step 3: Correlation
The next step is to correlate these customer-reported problems with network topology and infrastructure data. This involves mapping the complaints to specific network segments, devices, or services. By correlating this data, network engineers can identify which parts of the network are most frequently associated with reported issues. This correlation helps in understanding the relationship between customer experiences and network infrastructure, highlighting potential problem areas.
Step 4: Triangulation
The triangulation step involves using the correlated data to identify common points of failure or performance bottlenecks. This is done by cross-referencing customer data with network telemetry and traffic data. By triangulating these data sets, network administrators can manually estimate in which network domain the issue is likely coming from.
Step 5: Impact Analysis
Once the common points of failure are identified, the next step is to assess the impact of these issues on different customer segments and services. This involves analysing how the identified problems affect various customer groups and service offerings. Understanding the impact helps prioritise the issues based on their severity and the number of affected customers.
Step 6: Action Plan
The final step is to develop and implement an action plan to address the root causes of the identified issues. This may include network upgrades, changes in routing, or improved customer communication strategies. The action plan should be based on the insights gained from the previous steps and should aim to eliminate or mitigate the identified problems.
4. Analysis of Network Telemetry (to Spot Anomalies on Network Devices)
The purpose of analysing network telemetry is to detect abnormal behaviour or performance issues in individual network devices. By examining telemetry data, which includes for example logs, performance metrics, and SNMP data, network administrators can identify and address potential problems before they impact the service. This proactive approach ensures the smooth operation of network infrastructure and helps in maintaining high levels of performance and reliability.
Step 1: Telemetry Data Collection
The first step in detecting anomalies is collecting streaming telemetry data from network devices. This data includes logs, performance metrics, and Simple Network Management Protocol (SNMP) data, which provide insights into the operational state of the devices. For high effectiveness all raw network telemetry should be available for further analysis, typically resulting in astronomical quantities of data. The telemetry data can be stored in a data-lake and accessed by tools or the devices accessed directly.
Step 2: Baseline Establishment
Once telemetry data is collected, the next step is to typically establish some sort of a baseline of normal operational behaviour for each device. This involves analysing historical data to define what constitutes normal behaviour under various conditions. Key performance indicators (KPIs) such as CPU usage, memory usage, interface statistics, and error rates are measured and documented. Establishing a baseline is critical because it provides a reference point against which future data can be compared. Any deviations from this baseline can indicate potential issues. Tools like Grafana, can be used to visualise and establish these baselines effectively.
Step 3: Anomaly Detection
With a baseline in place, the collected telemetry data is then analysed to identify deviations. Anomaly detection techniques can include threshold-based alerts, statistical analysis, or more advanced ML models. Threshold-based alerts trigger when metrics exceed predefined limits, while statistical methods analyse variations from the norm. Machine learning frameworks like TensorFlow or Scikit-learn can be employed to build models that detect anomalies based on patterns and trends in the data.
Step 4: Root Cause Analysis
Upon detecting anomalies, the next step is to investigate their root causes. This involves correlating the detected anomalies with recent configuration changes, firmware updates, or other network events. Understanding the context in which the anomalies occurred is crucial for determining their underlying causes. Root cause analysis helps in pinpointing whether the anomalies are due to configuration errors, hardware malfunctions, or other factors, enabling targeted troubleshooting. This is typically a manual process where L2 engineers study the detected anomalies.
Step 5: Proactive Measures
After identifying the root causes of the anomalies, proactive measures need to be taken to address these issues so they do not end up impacting the service. This may involve reconfiguring devices, updating firmware, or replacing faulty hardware. Proactive measures are essential for preventing the recurrence of the identified issues and ensuring the long-term health of the network devices.
Step 6: Ongoing Monitoring and Adjustment
Continuous monitoring and adjustment are vital to maintaining network health and performance. After implementing proactive measures, it's important to keep an eye on the devices to ensure the issues have been resolved and no new anomalies arise. This involves regular analysis of telemetry data and adjustments to the baseline as network conditions evolve.
Step 7: Documentation and Reporting
The final step involves documenting the findings and actions taken during the anomaly detection and resolution process. Detailed documentation helps in creating a knowledge base for future reference and training purposes. It also aids in reporting to stakeholders about the network's health and the steps taken to address any issues.
“Single Pane of Glass” with OptOSS AI
Traditional troubleshooting methods often involve the use of multiple tools and manual processes to collect and analyse data, identify root causes, and implement remediations. These methods and each step of them, although effective, can be time-consuming and resource-intensive. These processes are also undertaken in a siloed format meaning teams are often blind to the findings of other approaches. OptOSS AI emerges as a solution that transforms incident response management by offering advanced AI-driven capabilities to unify and automate these troubleshooting approaches, reducing MTTR from days/hours to minutes.
Anomaly Detection at Element level: OptOSS AI excels in using a patented AI approach for streaming anomaly detection + clustering within network telemetry without the prerequisite of Model training. Anomalies are causal patterns in time consisting of several data points. OptOSS AI does this for all network telemetry available in real-time so no sampling is required, and is deployed hierarchically to monitor each network domain in the service.
Service topology and attribution of Impact: OptOSS AI has the ability to discover the topology of massive networks on each configuration change and build the Service Chains across the network domains so you can track how an element level anomaly impacts the Service. This can also be achieved by integrating with Configuration Management data.
Correlation with Customer complaints: Element level anomalies can be automatically correlated with Customer Complaints by leveraging the Service Chain information. OptOSS AI will help you prioritise which element level anomalies are the most customer impacting so you can prioritise your NOC efforts.
OptOSS AI combines all of the aforementioned approaches so your Operations teams can reduce MTTR and ensure optimal network performance. By identifying patterns, detecting anomalies, and pinpointing root causes, attributing impact and more, network administrators can proactively address issues before they escalate.
Data Ingestion
Network Discovery
Self-Updating Topology
Activity Monitoring
OptOSS AI can ingest data-at-rest (historical data) and data-in-motion (real-time, streaming data). Any structured (metrics, indicators) or unstructured time-series data (logs, strings) relating to your IT-infrastructure.
Data Processing
Anomaly Detection
Cross Correlation
Anomaly Clustering
Any time-series data (structured and unstructured) can be analysed by OptOSS AI, where irregular patterns are detected within two seconds of the data being received. OptOSS AI applies its patented process to detect, cluster, label and recognise known and unknown irregularities on critical infrastructures, enabling your Operations Centers to proactively monitor your networks.
Analytics & Automation
Closed Loop Automation
Data Mining
Root Cause Determination
Service Impact Attribution
Generative AI
OptOSS AI can completely automate complex processes. Once OptOSS AI is monitoring data streams produced by a network, it can autonomously cross correlate across multiple domains and fully execute complex troubleshooting processes. With the automation features of OptOSS AI, users can automate low-risk and medium-risk recurring occurrences with Closed Loop Automation.
Dashboards & Visualisations
Streaming Alarm View
Network "CT Scan" View
Sankey Chart view
Historical Trend analysis Graphs
Geo-Mapping
OptOSS AI acts as a “Single Pane of Glass” and encourages the usage of various highly customisable widgets, E2E overview dashboards, and more granular visualisations to help provide Ops teams with a multitude of useful and pragmatic insights. Furthermore, custom widgets/visualisations can be built for your teams which want to closely track the peskiest of problems!
Shift from a reactive to a proactive approach with OptOSS AI, and ensure seamless connectivity and reliability for your telecom networks!