An analysis of log data from the HP Enterprise 3PAR SAN blamed for the dramatic outage of the Australian Taxation Office’s core systems in December reveals a number of warning signs in the lead-up, according to a report made public today by the ATO.
Investigations into outage and a subsequent outage in February are continuing, but the agency today released a report outlining its current understanding of the root causes.
The SAN blamed for the outage was operated and maintained by HP Enterprise in a Sydney data centre.
Early on 12 December, the SAN was hit by multiple component failures and attempts by the SAN to autorecover were unsuccessful. Compounding the problem, control and management systems relied on the same data pathways as the production systems that were supporting impacted ATO services, the report reveals.
The outage began at 12.40am and by 3.35am 455 out of 3063 drives were inaccessible, with firmware preventing the affected drives from rebooting.
“Despite having met ATO specified conditions for categorisation as a Priority 1 incident at this time (3.35am) service provider logs indicated the incident was not escalated to this level until around 7.00am that morning,” the report states.
“The fact that system management, configuration, monitoring, and data recovery systems that were relying on the SAN also experienced outage extended the recovery process for some applications.”
One of the major physical problems with the SAN was stressed fibre optic cabling, the report states.
“At this stage of the investigation, we consider that stressed fibre optic cabling issues were a major contributor to this outage,” the report states.
Logs over the six month period before the outage reveal potential problems with the SAN that were similar to those experienced during the December outage.
“While HPE had taken some actions in response to these indicators – including the replacement of specific cables – alerts continued to be reported, indicating these actions did not resolve the potential SAN stability risk,” the report states.
The outage in February this year followed remedial work on the cables. When replacing a cable, a data card attached to the SAN was dislodged, which then replicated the series of problems encountered in December.
The report reveals that in designing the Sydney SAN “there was a relative focus on performance”, including not catering for greater than single drive failure or single cage failure, and some built-in resilience and monitoring features were not enabled.
The report makes 14 recommendations, including boosting the ATO’s “IT capability pertaining to infrastructure design and implementation planning (particularly relating to resilience and availability)”.
“We are committed to addressing each of these areas and have already made improvements or have work underway,” a statement from the ATO said.
“We are also well progressed in our work to ensure the stability, security and reliability of our systems ahead of Tax Time 2017.
“We are mindful of the disruption that the outages caused the community and our key stakeholders – practitioners, the superannuation industry and digital service providers – and are considering what further measures we can put in place to minimise the risk of the ATO and the community being exposed to this type of incident in the future.”
The full report (PDF) is available online.