Thursday, October 15, 2009

Sidekick Recovery News

The latest word on the aftermath of the Sidekick data-loss incident seems to have come out early this morning on the Hardware 2.0 blog, maintained by Adrian Kingsley-Hughes for ZDNet. Given that the last report I cited, provided by Ina Fried for CNET News, concerned a response from T-Mobile that left more than a little to be desired, the important thing about the Kingsley-Hughes post is that the broken pottery now appears to be in Microsoft hands:

Roz Ho, corporate vice president for Microsoft’s Premium Mobile Experiences division apologizes for the recent upheaval experienced by Sidekick users and also offers the good news that “most, if not all” of the lost Sidekick data has been recovered.

We are pleased to report that we have recovered most, if not all, customer data for those Sidekick customers whose data was affected by the recent outage. We plan to begin restoring users’ personal data as soon as possible, starting with personal contacts, after we have validated the data and our restoration plan. We will then continue to work around the clock to restore data to all affected users, including calendar, notes, tasks, photographs and high scores, as quickly as possible.

In other words Microsoft acknowledges that they broke it and that they are now fixing it. Note that the language is pretty vague when it come to how long the repair will take, but at least they have a plan for the process. Personally, I would have thought that the data for the calendar and tasks would take priority over personal contacts; so, at the very least, this raises the question as to whether or not those responsible for maintaining this technology actually use it. As I observed on Tuesday, the T-Mobile approach to customer relationship in this matter bordered on the insulting, if it did not actually cross the line. The idea that the Microsoft people responsible for this technology do not eat their own dog food should be as much of a red flag to Sidekick customers as the outage itself.

More interesting, however, is Kingsley-Hughes' effort to give an account of how this problem arose in the first place:

We also get the first indication as to what went wrong, and it points to something being terribly wrong with the way Danger/Microsoft was handling Sidekick data:

We have determined that the outage was caused by a system failure that created data loss in the core database and the back-up. We rebuilt the system component by component, recovering data along the way. This careful process has taken a significant amount of time, but was necessary to preserve the integrity of the data. [emphasis added]

So a single system failure took out the main database and the backup. Seriously, what bone-headed backup system had to be in place to allow that to happen? Was the data just copies to another folder on the same drive? Likely not that simple, but something equally grossly incompetent had to have happened.

On the basis of his photograph, I would guess that Kingsley-Hughes is too young to have experienced the Northeast Blackout of 1965 and is probably part of the prevailing culture that no longer sees the value of history in a strategy for crisis management. In this case, as the Wikipedia entry explains, the cause was human error:

The cause of the failure was human error that happened days before the blackout, when maintenance personnel incorrectly set a protective relay on one of the transmission lines between the Niagara generating station Sir Adam Beck Station No. 2 in Queenston, Ontario. The safety relay, which is set to trip if the current exceeds the capacity of the transmission line, was set too low.

On the other hand the ability of this one action to trigger the blackout can be attributed to a single fault in the design of the power system:

As was common on a cold November evening, power for heating, lighting and cooking was pushing the electrical system to near its peak capacity, and the transmission lines heading into Southern Ontario were heavily loaded. At 5:16 p.m. Eastern Time a small surge of power coming from Lewiston, New York's Robert Moses generating plant caused the misset relay to trip at far below the line's rated capacity, disabling a main power line heading into Southern Ontario. Instantly, the power that was flowing on the tripped line transferred to the other lines, causing them to become overloaded. Their protective relays, which are designed to protect the line if it became overloaded, tripped, isolating Adam Beck from all of Southern Ontario.

With no place else to go, the excess power from Beck then switched direction and headed east over the interconnected lines into New York State, overloading them as well and isolating the power generated in the Niagara region from the rest of the interconnected grid. The Beck and Moses generators, with no outlet for their power, were automatically shut down to prevent damage. Within five minutes the power distribution system in the northeast was in chaos as the effects of overloads and loss of generating capacity cascaded through the network, breaking it up into "islands". Plant after plant experienced load imbalances and automatically shut down. The affected power areas were the Ontario Hydro System, St Lawrence-Oswego, Western New York, Upstate New York, New England, and Maine. With only limited electrical connection southwards, power was not affected to the Southern States. The only part of the Ontario Hydro System not affected was the Fort Erie area next to Buffalo which was still powered by the old 25 Hz generators. Residents in Fort Erie were able to pick up a TV broadcast from New York where a local backup generator was being used for transmission purposes.

Note that use of the word "trigger." The underlying problem is a tendency to examine individual actions in isolation, rather than in terms of the consequences that may ensue. This is the broader question of engineering system design and implementation that I raised when the Sidekick story first broke. At the time I attributed this problem to an unhealthy relationship between engineering and marketing, but this may have its own underlying cause. A culture in which entrepreneurism has become firmly established as part of the engineering curriculum can easily become a culture in which marketing does too much talking without listening. In such a culture we may expect to read many more stories about highly promoted gadgets that first cultivate our dependence and then leave us stranded with the first (and inevitable) system failure.

1 comment:

edwin sanchez said...

Wow...I wasnt really expecting this...