Network outages significantly impact the operations and reputation of companies. The Uptime Institute research showcases an upward trend in the prevalence of major outages. One in five organisations report a “serious” or “severe” outage, which involves not only reputational damage and impact on users, but also significant financial losses. Recent research shows that the number of outages experienced by telecoms has increased by 68% from 19% in 2016 to 32% in 2023.
Networking issues are a growing cause of IT outages, driven by the complexity of modern, dynamically configured environments. The 2023 Uptime Resiliency Survey highlights that configuration failures (45%) and third-party provider issues (39%) are the leading causes. Unlike the static networks of the past, today's flexible and software-defined systems frequently undergo updates, making small errors inevitable and capable of cascading into widespread failures.
Trend data shows that the proportion of single major outages costing over $100,000 is increasing: 15% - 2021, 25% - 2022.
Why is the cost of outages skyrocketing?
While factors like inflation, SLA breaches, regulatory fines, and labor costs play a role, the primary driver is the increasing reliance on digital services for economic activity. When critical IT systems fail, businesses face immediate operational disruptions and lost revenue. In recent years certain outages costed over $150 million, measured by fines, compensation, and lost business compounding the impact (Uptime Institute). Let's take a look at some of the biggest outages in recent times to understand their causes, impact, consequences, and how they could have been avoided/prevented.
BT outage (June 2023)
On June 25, 2023, BT, the sole emergency call handling provider in the UK, faced a significant technical failure that disrupted its Emergency Call Handling Service (ECHS). The outage, lasting approximately 10.5 hours, had a significant impact on emergency calls to 999 and 112. During this period, there were nearly 14,000 unsuccessful call attempts, affecting over 12,000 unique callers. In light of BT's shortcomings outlined in the investigation report, Ofcom imposed a financial penalty of £17.5m. This penalty, reduced by 30% from an initial £25 million due to BT’s cooperation, was deemed appropriate given the critical nature of emergency communication service (Ofcom).
The outage began at 06:24 on 25 June, affecting BT's ECHS availability until the service was fully restored at 16:56. The first hour, between 06:24 and 07:33, had the worst impact, with 64% of calls being dropped due to various technical issues. These included system restarts that logged out Call Handling Agents (CHAs) and call disconnections. The root cause was traced to a configuration error within the media server of BT’s Next Generation X (NGX) platform. The NGX platform is designed to manage high volumes of VoIP traffic by coordinating call routing, media processing, and signalling functions across a distributed network. This sophisticated infrastructure supports handling of emergency call traffic, ensuring redundancy and load balancing across nodes. The configuration error disrupted these coordinated functions, directly affecting emergency call processing (Investigation report).
An erroneous configuration change triggered a feedback loop that overloaded the media server, leading to repeated unplanned restarts. The media server, designed to handle high volumes of VoIP traffic, requires precise configuration to manage SIP sessions, audio transcoding, and RTP packet routing. The issue stemmed from a misconfigured resource allocation parameter related to session handling limits. The change, which bypassed standard version control and verification protocols, introduced an unforeseen state where the media server attempted to process calls beyond its fail-safe threshold.
This misconfiguration led to cascading failures:
Resource Starvation: The media server's CPU and memory resources were rapidly exhausted, resulting in forced terminations of active processes.
Session Table Overflows: Persistent session data exceeded buffer capacities, causing dropped and incomplete calls.
Unintended Failover Activation: The system's automated failover mechanism was triggered repeatedly but failed to stabilise as the standby systems inherited the corrupted state, leading to propagation of the issue.
The deployment of the configuration change lacked robust pre-deployment testing. A comprehensive load simulation that could have replicated peak traffic conditions was not performed, leaving the system vulnerable to untested states. Additionally, BT’s monitoring suite, integrated with the NGX platform, failed to provide timely alerts indicating the severity of the situation. The alert thresholds were configured for typical operational variances and did not account for the rapid degradation triggered by the configuration error. Network telemetry and log analysis tools were unable to correlate the sequence of events quickly enough to assist the initial diagnosis.
In response to the incident, BT implemented a series of measures to strengthen its network resilience and disaster recovery processes. These include addressing the root cause by fixing the configuration error across all three nodes and introducing alarms to flag similar issues in the future. BT enhanced fault monitoring at specific sites and nodes and established well-documented, tested failover and failback procedures for the Disaster Recovery (DR) platform, reinforced through training and simulations. Automation was prioritised with an automated failover process to minimise reliance on human intervention. Upgrades to the DR platform significantly increased its call queue capacity, resolved defects in caller location information handover, and integrated capabilities for Emergency Relay Calls. BT also strengthened its collaboration with the UK government, Ofcom, and Emergency Authorities to ensure better information sharing during critical events.
Optus Outage (November 2023)
On 8 November 2023, Australia experienced one of its worst telecommunications outages. The nationwide Optus (the country's second-largest telecoms provider) network outage affected more than 10 million customers and thousands of businesses (BBC). The outage disrupted mobile, internet, and NBN services for up to 14 hours, significantly affecting public services, emergency response capabilities, and essential infrastructure. Hospitals faced interruptions that compromised patient care, businesses were unable to trade, and transport networks experienced delays.
The outage began around 4:05 AM AEDT, when approximately 90 of Optus PE routers automatically self-isolated to protect themselves from an overload of IP routing information (Parliament of Australia, infrastructure.gov.au). These safety limits were default settings supplied by a global equipment vendor (infrastructure.gov.au). The trigger for this self-isolation was a change in routing information that originated from an alternate Singtel peering router following a routine software upgrade (The Guardian). The resulting propagation of these changes overwhelmed ‘multiple layers of the IP Core network, the pre-set safety limits on a significant number of Optus network routers were exceeded and connectivity with the core network was lost’, severing connectivity and effectively paralysing the network (infrastructure.gov.au).
One of the most concerning consequences was the impact on emergency services. Optus customers—with the exception of those connected through the Campbellfield exchange—were unable to make Triple Zero (000) calls. While user devices are typically designed to camp on to alternate networks for emergency calls when their home network fails, this mechanism faltered. The 4G and 5G base stations wilted—shut down automatically—as a failsafe to encourage devices to switch networks. However, the 3G base stations continued radiating signals without providing service, causing some devices to attempt emergency calls through those non-functional units and receive error signals.
The Australian Communications and Media Authority (ACMA) found that Optus’s failure to provide access to emergency call services affected 2,145 people. Moreover, 369 welfare checks on customers who had attempted emergency calls during the outage were not conducted. ACMA Chair Nerida O’Loughlin emphasised that these failures had profound implications, stating, “Triple Zero availability is the most fundamental service telcos must provide to the public. When an emergency call fails to connect, there can be devastating consequences for public health and safety.”
In the aftermath, Optus identified two primary deficiencies: the inability to access its core infrastructure through its Operations and Maintenance (OAM) network and the inability to remotely shut down 3G base stations. To address these issues, Optus outlined a plan to bolster its capabilities by March 28, 2024. This plan includes enhancing remote access to its 3G core infrastructure and improving the recovery process for its IP core routers to reduce restoration timeframes.
Optus’s commitment to preventing future outages also involves investments in network resiliency. Following the incident, the company introduced changes to prevent similar failures and pledged to continue enhancing its services. Optus has incurred penalties exceeding $12 million for breaches of emergency call rules after an ACMA investigation (ACMA). Also, as part of its immediate response, the company offered affected customers an extra 200GB of data as compensation (ABC). Additionally, Optus later provided cash compensation to customers, responding to criticism of its initial offer (ABC).
AT&T Mobility LLC (February 2024)
On Thursday, February 22, 2024, at 2:45 AM CST, AT&T Mobility LLC experienced a nationwide wireless service outage that lasted over twelve hours. It affected customers across the United States, causing widespread disruptions to mobile phone, internet, and home phone services. Beyond AT&T's own customers, it disrupted services for mobile virtual network operators (MVNOs) and wireless providers reliant on AT&T’s infrastructure. Over 125 million devices were disconnected, cutting off vital communication channels for individuals, businesses, and public safety personnel.
The incident underscored vulnerabilities in emergency response systems. With over 25,000 911 calls blocked, many communities faced heightened risks during the outage. The outage also disrupted service to devices operated by public safety users of the First Responder Network Authority (FirstNet). Although AT&T prioritised its restoration, notifications to FirstNet users were delayed by three hours, raising questions about preparedness and real-time communication.
At the heart of the outage was a network configuration error introduced during a routine maintenance update. A misconfigured network element propagated errors, triggering AT&T’s Protection Mode, which disconnected devices to prevent cascading failures. Although the safety mechanism successfully mitigated broader network damage, a lack of peer review allowed the misconfiguration to bypass established procedures that require design reviews. Furthermore, the network's limited system resilience meant there were no sufficient safeguards to address such misconfigurations, leading to an over-reliance on the drastic step of entering Protection Mode.
When AT&T Mobility’s network entered Protection Mode, all connected devices were dropped, requiring re-registration once the mode was lifted. As the misconfigured network element was corrected, all user devices attempted to re-register and reconnect simultaneously, overwhelming the network management systems and causing severe congestion. This congestion delayed device registrations, extending the outage. AT&T employees worked to mitigate the delays and manage the flood of registration attempts, restoring most services by midday. However, congestion from mass re-registrations persisted into the afternoon, leading to failures in certain 911 calls, highlighting the cascading impact of such incidents on critical services.
Following the outage, AT&T introduced a series of measures to prevent a similar incident from occurring again. According to the FCC report, within 48 hours of the outage, the company deployed new technical controls across its network, conducting thorough scans to identify and address any vulnerabilities in network elements. Additionally, AT&T revised its operational protocols, mandating stricter peer review processes and ensuring that no maintenance work proceeds without proper confirmation that these reviews have been completed.
Also in August 2003 AT&T had already faced a 911 outage impacting parts of Illinois, Kansas, Texas, and Wisconsin. This disruption occurred during routine testing of its 911 network, when a contractor's technician unintentionally disabled a portion of the system. Unfortunately, AT&T’s network failed to automatically adjust to this disabled segment, triggering the outage (FCC). In accordance with the decision taken on the basis of the FCC EB investigation into AT&T's 911 outage on August 22, 2023, the company should pay a $950,000 civil penalty (FCC). Just a day after AT&T reached this settlement with the FCC (August 28, 2024), the company experienced another outage. While specific details on the number of affected customers and the duration were not disclosed, reports on Down Detector indicated a sharp increase in issues starting at 5 p.m. ET, peaking around 7 p.m., and tapering off by 10 p.m. (CNN).
How OPTOSS AI helps prevent Telecom Outages and mitigate Service Disruptions
Recent high-profile telecom outages, including those experienced by BT (UK), Optus (Australia), and AT&T (US) revealed serious vulnerabilities in modern telecom networks and in ways of working that are outdated and not ready to address issues in modern, complex, continuously changing telecom environments.
Misconfigurations, faulty failovers, human errors, and complexity have led to prolonged service disruptions, affecting millions of customers and impacting essential services like emergency calls and business processes. However, with OptOSS AI, these issues could have been proactively avoided before end users notice any service impact or could have massively improved the Mean Time To Resolve (MTTR).
OptOSS AI’s real-time anomaly detection, coupled with OptOSS MANAGER’s end-to-end service impact estimation and cross-domain correlation, could have enabled immediate identification and resolution of problems, ensuring business continuity and minimising service disruption. Let’s explore how OptOSS AI could address the recent failures we reviewed above.
In the BT outage case, detection with OptOSS AI could have been possible within the very first seconds of the incident. The Configuration Error would be immediately visible as a spike in the cumulative anomaly severity, starting from the patient zero. Visualised in the CT-Scan this would look as follows:
OptOSS AI would detect the ‘Patient Zero’, and extract the:
exact time,
a device the issue arose at,
chain of events (anomaly consists of),
related attribution, including affected interfaces.
Integration with maintenance records and configuration management data allows OptOSS AI to correlate if any maintenance-related activity was conducted or if this is a non-maintenance-related issue. Remediation would depend on the preferred implementation, considering our clients' evolution model as they progress through iterations toward fully autonomous networks. Here are possible options:
1. Basic - with process injection:
An incident ticket will automatically be forwarded to the proper team with the Root Cause and all related details. The response team would have all the required information about what happened. Using our LLM assistant function, the anomaly would also be explained in Layman’s terms with possible fixes so the User understands how to address the issue.
2. Semi-Automated - with OptOSS AI advanced knowledge pack:
With implemented OptOSS AI advanced knowledge pack in addition to warning proper team would be possible to link scripts once such an anomaly arises to store the current configuration and revert back to the previous stable one and check if the issue persists and with OptOSS MANAGER if the service is still impacted. On premise GEN AI could help to support technical teams with explaining the context and proposing resolutions with step by step instructions.
3. OptOSS Autonomous Networks:
Such a change that redirected call would not pass the acceptance, and the configuration would be reverted (with human in the loop of course). As a result, the aforementioned configuration change using OptOSS AI would trigger a warning during the maintenance acceptance testing stage by indicating anomaly spikes on the NGX platform’s media server. However, if the maintenance team forced acceptance of the changes, the troubleshooting team would have full clarity on the incident, including the RCA and Patient Zero, immediately after the issue appeared.
This could save most of the time wasted when acting without clarity in a rush, minimise human error, and even potentially avoid the incident altogether.
In the 21st century, telecom companies faced the challenge of addressing the growing complexity of networks while still using outdated rulebased toolsets and experiencing a shortage of resources. The toolsets of the 20th century couldn't address the issues raised by the networks that appeared in the 21st century. This is why a new toolset was required, and OptOSS was invented to address the networks of the future, using AI to manage the scale of complexity and make humans superhumans: The next generation OSS.