Down for 4 days is a disgrace

September 28th, 2011 at 11:03 am by David Farrar

Hamish Fletcher at NZ Herald reports:

Internet failures that have forced the Companies Office and 14 other Government websites offline for four days have been described as an international embarrassment.

Other websites caught up in the outage include the Personal Property Securities Register, Intellectual Property Office and Ministry of Consumer Affairs sites. …

The blamed the outage on upgrade work to the servers hosting the sites.

Unscheduled outages should be measured in minutes, not days.

The companies office site especially is a high demand site.

I have some experience with such a site. I was a director from 2002 to 2010 of the NZ Domain Name Registry Ltd, which is known as .nz Registry Services (NZRS). They operate the registry for .nz domain names.

The service level agreement with the Domain Name Commission Ltd (which I now serve on) specifies that the registry must be available 99.9% of the time (excluding scheduled outages notified in advance) on a monthly basis. This means that any unscheduled outages must last no longer than 43 minutes over a month, or the company would be in breach of the SLA.

To help achieve that, a lot of redundancy is built into the system. In fact there are parallel systems in Wellington and Auckland, so if one city is unavailable, the other system can kick in.

If NZRS had an unscheduled outage of four days, I imagine there would have been resignations from both the board and senior management, unless it was for the most exceptional and unavoidable reason. I certainly would have offered my resignation as a Director to the shareholder (InternetNZ).

The companies office website is excellent. I use it often, and it is one of the reasons we score highly on ease of business surveys. You can establish a company in under an hour, all online. But the more vital a service becomes, the more important it is that you ensure it remains up.

MED should at a minimum commission an independent report into what went wrong, and what they need to do to prevent such an outage in future.

Tags:

21 Responses to “Down for 4 days is a disgrace”

  1. davidp (3,540 comments) says:

    Many of the Ministry of Health’s systems were out of service (or effectively out of service) for several weeks in early 2009 after they were comprehensively trashed by a worm infection. They hadn’t patched anything in years. A four day outage looks good by comparison.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  2. hmmokrightitis (1,508 comments) says:

    Not unusual davidp, I know of a couple of banks here in NZ who have a conficker problem they still havent managed to resolve – ring fenced is about as good as they can get it. Hell, a DHB I consulted at had mission critical apps running on servers 7 years old that had never ever been taken down – and no one on staff who knew the Db it was perched on.

    Hence why I make a living :)

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  3. alex Masterley (1,490 comments) says:

    The outage is starting to annoy me.
    I need the PPSR and the Companies Office sites for my day to day work.
    you would think that the MED would organise this better.
    At least LINZ is still up.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  4. Johnboy (14,961 comments) says:

    LWNJ conspiracy theorists should be asking at this point why Whaleoil is down as well! :)

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  5. berend (1,632 comments) says:

    Why doesn’t John Key borrow some more billions? I mean, look what a great job the MED is doing with their hardware, you can trust these guys to handle the money of your children real well.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  6. Nick R (497 comments) says:

    I’m sorry, but this is hilarious. We have a Government which has been cutting the public service and “reprioritising” spending from back office to front line delivery. Then we have a complete meltdown of back office functions. Quelle surprise. Who could possibly have seen that one coming?

    I wonder which poor bloody public servant will be left to carry the can for this? I can guarantee you won’t find a minister who will accept any responsibility.

    [DPF: The companies office website operates on user-pays. It is a commercial service. But nice try at blaming National for this. You are getting desperate]

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  7. m@tt (587 comments) says:

    I wonder if the upgrade work was outsourced or handled in house. If in house I wonder if they have lost any of their ‘back office bloat’ of late…

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  8. PaulL (5,872 comments) says:

    Yeah right. The problem is budget cuts Nick. Not just rank incompetence. I’ll be waiting for the report to see what the real answer is, I very much doubt that the government reduced back office budgets, and a reasonable response was “hey, let’s get rid of all the redundancy on our servers.” And if it was, then someone should be fired for that.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  9. rouppe (915 comments) says:

    This is purely the responsibility of the staff and management of the servers – both ministry and outsourced.

    I cannot believe that:
    * There was no backout plan
    * They didn’t upgrade one of the systems (Akl or Wgtn) first, then the other
    * There was no test environment in which they did the upgrade first to flush out these problems

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  10. slightlyrighty (2,496 comments) says:

    Whaleoil is back up now, and looking a little better.

    perhaps he could advise MED?

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  11. Nick R (497 comments) says:

    PaulL – Here’s another guarantee. If there is in inquiry into this, the report will say exactly what the Minister wants it to say. That is why you have independent inquiries.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  12. Roflcopter (423 comments) says:

    This is interesting, in light of a big Notice of Intent released about a week ago…

    The Department of Internal Affairs (DIA) and a number of other agencies are collaborating on common approaches to web services, or Common Web Services (CWS).

    DIA is the lead agency.

    The goal of CWS is to reduce duplication of effort and streamline procurement by allowing agencies to cluster around a small number of web publishing platforms or content management systems, possibly accompanied by panel procurement arrangements.

    This Notice to Prospective Suppliers (Notice) seeks information from prospective suppliers to help the CWS Project to:

    • create options for government CWS solutions;
    • understand market capabilities;
    • provide inputs that will help identify whether or not there is a viable business case for government CWS solutions; and
    • inform the construction of a business case, if one is viable.

    The documentation alludes to centralisation, where appropriate, of all Govt departments.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  13. PaulL (5,872 comments) says:

    Here’s another guarantee. Irrespective what the report says, people will continue to believe whatever they want to believe. So you’ll continue to think the evil govt caused it by asking for efficiency. I’ll continue to believe that incompetence caused it.

    I’ve worked in government, I have a fair idea what goes on in their back office. It is full of bloat, but in general they’re useless at dealing with it – the reality is that you could fire a good 30% of the people in the back office and never notice they were gone. But the actual process govt uses when they seek cuts doesn’t involve getting rid of the dead wood. It generally involves offering voluntary redundancies (otherwise knowing as paying your best staff to go away), or putting on a hiring freeze and waiting for natural attrition (otherwise known as preventing any new blood coming into your organisation). It may be true that the pressure on budgets therefore impacted this – but not because it had to be that way.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  14. Kimble (4,378 comments) says:

    Anonymous?

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  15. insider (1,000 comments) says:

    Remember how the Govt treated telecom after its XT failures? Funny how quiet they are…

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  16. slijmbal (1,210 comments) says:

    The day to day management of the vast majority of these servers is outsourced nowadays, especially by government entities as they struggle to retain skilled IT staff as a generalisation plus the commercial entities to whom they outsource specialise in this.

    I would be looking to see if that is the case here. If not, that may well be the problem. If so, the outsourced company would be finding its penalty clauses kicking in.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  17. wreck1080 (3,726 comments) says:

    I work on large corporate IT systems — lets just say, if I were responsible for their systems going down for 4 days I’d be sacked.

    I bet the problem relates to short cuts on testing. Ultimately it is a budgetary thing — I used to have a project manager who’d ask team leaders for an estimate for a change, then pretty much cut them in half. Ultimately, the final product would be full of bugs precisely because there is insufficient time to properly develop the software. And, guess who got the blame for this. The poor programmers working on unrealistic timeframes.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  18. Chris2 (754 comments) says:

    Why do we have to put up with this PC word “outage”?

    I’m sure it’s used to imply no one is to be blame, just an act of God. When things work are they therefore called an “inage”?

    The correct English word, rather than using “outage” is the word “failure”. But that would mean someone was responsible.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  19. peterwn (3,150 comments) says:

    What would people think if Transpower said the National Grid was down for four days. Even John Key’s head would be at risk.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  20. scrubone (3,048 comments) says:

    Why do we have to put up with this PC word “outage”?

    Because the system (think lights) are out. This is caused by some sort of failure – in the case of a light this might be a blown bulb.

    Anyho…

    I recall talking to an IT guy about an outage in 1999 with the Otago uni student login system which lasted days. It was going, but it took about 10 minutes to do anything that would usually take seconds. Problem is with a situation like that is, the issue may be quite simple to resolve, but those small tasks to fix it can add up to hours. You can’t necessarily restore from backup, because the backup copy may have 99% of the conditions that cause the issue.

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote
  21. Steve (4,496 comments) says:

    “Why do we have to put up with this PC word “outage”?”

    Well in the Engineering trade at the dirty end (hammer and tongs) it is called a Fuckup.
    There is no other name, it is just a Fuckup. The person who fucked up is the person responsible, nobody else.
    So who fucked up? It is Taxpayers money being wasted on this fuckup so own up

    Vote: Thumb up 0 Thumb down 0 You need to be logged in to vote

Leave a Reply

You must be logged in to post a comment.