Computerworld

Telstra’s broadband outage: How a minor software change became a big problem

Minor update causes major problems

What should have been a straightforward software update to a DNS server last week snowballed into a major broadband outage for Telstra from which the telco is still recovering.

“We are incredibly apologetic to our customers for the inconvenience that’s been caused to them by this recent service interruption,” Telstra’s chief operations officer Kate McKenzie said today.

The update that was the initial cause of the widespread problem for Telstra broadband customers was rolled out on Thursday night last week.

“A software update to one of our domain name servers caused that server to go down,” McKenzie said.

“It had a flow-on effect to our customers’ modems and they couldn’t undertake the regular check that they normally do in that environment. We actually fixed the software bug overnight on Thursday night and on Friday morning we had thought that the problem was resolved, but then what happened was that we didn’t anticipate the flow-on effect.”

“We do hundreds and hundreds of changes every night of the week and history shows that we get those correct,” said Telstra’s head of networks, Mike Wright.

“This particular change really just caused a short outage in a DNS and should have had no consequential impact,” Wright said.

However the outage uncovered an unknown behaviour in Telstra modems after their heartbeat system failed to detect the network.

“A few of the modems had a residual problem in software that caused them to continually reboot and that’s really been what we’ve been managing and recovering from for the most of this time,” Wright said.

In many cases the issues could be fixed by power cycling the modem or performing a factory reset, but Telstra has begun shipping out free modems to a small number of customers who have continued to suffer problems.

“We’re working with our partners and vendors; we’ll get to the bottom of the actual issue and [be] able to send out a software update once we understand it fully and we’re comfortable that it’s been tested properly. That should protect us against that particular bug,” Wright said.

“We’ll do a very thorough look at precisely what happened and if we need to make any adjustments to procedures or talk to our vendors about different ways of doing these things,” McKenzie said.

The problem affecting the telco’s fixed line broadband services follow on from a series of high-profile service disruptions for Telstra’s mobile services earlier this year.

In response Telstra commissioned a network review and has since said it will spend $50 million on measures to prevent a repeat of those problems.

Telstra suffered a major mobile outage in February, which the telco blamed on human error. In March Telstra’s network suffered two bouts of problems.

“While no network operator in the world can guarantee that disruptions will not occur from time to time, what we can do is reduce the likelihood and the impact of those disruptions,” Telstra CEO Andrew Penn said in a speech earlier this month at an event hosted by the American Chamber of Commerce in Australia.

“We are acutely aware of the impact the outages had on our customers and we are committed to rebuilding their trust in us by meeting, if not exceeding, those expectations every day.

“That is the experience we are aiming to offer our customers: The right offer, the right services, the latest product and technology, at the right time on the best networks.”