Simpson Garfinkel writes about the worst software bugs in history including:
- July 28, 1962 — Mariner I space probe
- 1982 — Soviet gas pipeline. (with a Canadian connection)
- 1985-1987 — Therac-25 medical accelerator.
- 1988 — Buffer overflow in Berkeley Unix finger daemon.
- 1988-1996 — Kerberos Random Number Generator.
- January 15, 1990 — AT&T Network Outage.
- 1993 — Intel Pentium floating point divide
- 1995/1996 — The Ping of Death
- June 4, 1996 — Ariane 5 Flight 501
- November 2000 — National Cancer Institute, Panama City.
MSNBC has a great article about software failures being more related to poor decision making and people management than poor code. This has long been known in engineering, medicine and aviation organizations. Where poor designs and poor decisions often have very real life threatening consequences. For many of us, software, in particular web software, is harmless. If you application server is offline, it is inconvenient or there is a cost (usually not of life). We are starting to see a rise in the number of software engineering failures: August 2003 blackout, Royal Bank of Canada software glitch, CIBC computer glitch, Westpac web bank glitch in Australia, Akamai software glitch causes web brownouts, etc.
One of the cause not discussed in the article on software failure is the emergence of new behaviours and the unintended changes that new technologies bring. Edward Tenner‘s Why Things Bite Back: Technology and the Revenge of Unintended Consequences (Amazon.com). This is a good read and along with To Engineer is Human: The Role of Failure in Successful Design (Amazon) by Henry Petroski and Set Phasers on Stun: And Other True Tales of Design, Technology and Human Error by Steven Casey about the risks related to technology. (I still need to read The Human Factor: Revolutionizing the Way People Live with Technology by Kim Vicente). Many professions improve by looking at their failures and changing either the professional practice or the education of professionals to incorporate these new learnings. Why Things Break: Understanding the World by the Way it Comes Apart is a look at the limitations of the physical world and materials science.
I wonder if we need to build a detailed set of case studies and stories about the failure of software projects like (Petroski’s Design Paradigms: Case Histories of Error and Judgement in Engineering).
- Learning from Failure: Engineering Disasters
- Failure Watch: iCivilEngineer
- Challenger Accident: Federation of Aviation Scientists
- Don Norman offers some valueable commentary about human error in the design and operation of complex computer systems.
- Human Factors Lab at University of Texas