Experienced a network outage? I feel your pain.
By now, all of us in networking have read about the four-day Blackberry outage or experienced our own personal Crackberry withdrawal. We also know it was a core switch failover event and backup-switch-gone-bad that is shouldering the blame for this. (According to RIM co-CEO Mike Lazaridis, the outage occurred “when a dual-redundant, dual capacity core switch failed and its backup switch failed to activate.”)
I feel for those who were working in the networking department at RIM. I can’t think of a worse position to be in. Wait, actually I can think of a worse position to be in: It would be worse to be the support engineer on the phone for the last 24hours from whichever core switch vendor they used. For that person or persons, it’s probably going to be another long night tonight including a sleeping bag next to the cubicle. Live production networking is a tough gig, too tough for me, which is why I moved out of it and into network testing. In fact, I don’t think I could handle live production networks anymore, especially not with all the outages lately.
In case you missed it with all the Blackberry excitement, Target also had an outage a few weeks ago (see “Target’s New Missoni Collection Brings Down Site”). Last week, Apple, AT&T and Sprint had an issue with a massive wave of iPhone 4S orders. I won’t even get into the Bank of America outage that spanned four days right at the first of the month when paychecks and mortgage payments were due. So for those keeping count in the last two weeks we’ve had the following outages ( some major ) that are threatening to tear apart our faith in the fixed & mobile internet fabric that keeps all our business and sometimes personal lives interconnected:
- Research in Motion
- Apple, AT&T, Sprint
- Bank of America
And that’s just the last two weeks.
Unlike some of the others (big bank, I’m looking at you) which never gave details for their outages, RIM has at least taken the high road from a PR standpoint and offered both a sincere apology and a YouTube video from the CEO.
So to all network engineers that have been experiencing outages: I feel for you. I really do. We can’t undo what happened so my only suggestion is, if you are reading this, please consider doing more stress testing. That may seem a trite statement at this stressful time because I’m sure you did test, but consider more testing with line-rate load generators such as the ones Spirent makes, to truly create stress and cause outages to understand the full impact before they happen for real. Move it up your supply chains. Urge your vendors to recreate this scenario in their labs, and suggest they use Spirent tools (Spirent TestCenter, Spirent Landslide, and Spirent Avalanche) to generate the loads you experienced when the Failover event occurred. We have other tools, network impairment emulators, for instance (Spirent GEM) to help create those environments.
This should henceforth be a mandatory test case that every major core switch vendor should run their equipment through. I’m sure we don’t want to see these kinds of things happening again, and neither do any of the millions of customers. And of course, neither do the poor support engineers that have to deal with this.
Good luck folks, I hope you get it under control soon.