REVIEWS

AT&T failed to test disastrous update that kicked all devices off network

A government investigation has revealed more detail on the impact and causes of a recent AT&T outage that happened immediately after a botched network update. The nationwide outage on February 22, 2024, blocked over 92 million phone calls, including over 25,000 attempts to reach 911.

As described in more detail later in this article, the FCC criticized AT&T for not following best practices, which dictate “that network changes must be thoroughly tested, reviewed, and approved” before implementation. It took over 12 hours for AT&T to fully restore service.

“All voice and 5G data services for AT&T wireless customers were unavailable, affecting more than 125 million devices, blocking more than 92 million voice calls, and preventing more than 25,000 calls to 911 call centers,” the Federal Communications Commission said yesterday. The outage affected all 50 states as well as Washington, DC, Puerto Rico, and the US Virgin Islands.

The outage also cut off service to public safety users on the First Responder Network Authority (FirstNet), the FCC report said. “Voice and 5G data services were also unavailable to users from mobile virtual network operators (MVNOs) and other wireless customers who were roaming on AT&T Mobility’s network,” the FCC said.

An incorrect process

AT&T previously acknowledged that the mobile outage was caused by a botched update related to a network expansion. The “outage was caused by the application and execution of an incorrect process used as we were expanding our network, not a cyber attack,” AT&T said.

The FCC report said the nationwide outage began three minutes after “AT&T Mobility implemented a network change with an equipment configuration error.” This configuration error caused the AT&T network “to enter ‘protect mode’ to prevent impact to other services, disconnecting all devices from the network, and prompting a loss of voice and 5G data service for all wireless users.”

While the network change was rolled back within two hours, full service restoration “took at least 12 hours because AT&T Mobility’s device registration systems were overwhelmed with the high volume of requests for re-registration onto the network,” the FCC found.

Outage reveals deeper problems at AT&T

Although a configuration error was the immediate cause of the outage, the FCC investigation revealed various problems in AT&T’s processes that increased the likelihood of an outage and made recovery more difficult than it should have been. The FCC Public Safety and Homeland Security Bureau analyzed network outage reports and written responses submitted by AT&T and interviewed AT&T employees. The bureau’s report said:

The Bureau finds that the extensive scope and duration of this outage was the result of several factors, all attributable to AT&T Mobility, including a configuration error, a lack of adherence to AT&T Mobility’s internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls to ensure approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a variety of system issues that prolonged the outage once the configuration error had been remedied.

At 2:42 am CST on February 22, an AT&T “employee placed a new network element into its production network during a routine night maintenance window in order to expand network functionality and capacity,” the FCC said. The configuration “did not conform to AT&T’s established network element design and installment procedures, which require peer review.”

An adequate peer review should have prevented the network change from being approved and from being loaded onto the network, but this peer review did not take place, the FCC said. The configuration error was made by one employee, and the misconfigured network element was loaded onto the network by a second employee.

“The fact that the network change was loaded onto the AT&T Mobility network indicates that AT&T Mobility had insufficient oversight and controls in place to ensure that approval had occurred prior to loading,” the FCC said.

AT&T faces possible punishment

AT&T issued a statement saying it has “implemented changes to prevent what happened in February from occurring again. We fell short of the standards that we hold ourselves to, and we regret that we failed to meet the expectations of our customers and the public safety community.”

AT&T could eventually face some kind of punishment. The Public Safety and Homeland Security Bureau referred the matter to the FCC Enforcement Bureau for potential violations of FCC rules.

Verizon Wireless last month agreed to pay a $1,050,000 fine and implement a compliance plan because of a December 2022 outage in six states that lasted one hour and 44 minutes. The Verizon outage was similarly caused by a botched update, and the FCC investigation revealed systemic problems that made the company prone to such outages.


Source link

Related Articles

Back to top button