Some of the worst safety disasters in modern industrial history occurred in organizations that previously had a perfect safety record, argued Sidney Dekker, professor at Griffith University, at DevOps Enterprise Summit 2017 in San Francisco.
When software bugs are a good thing
Defect-free software is a noble goal, but that ambition can discourage an organization from honest communication about bigger systematic defects. “Anything that puts downwards pressure in your organization on honesty, disclosure and openness is bad for your organization and is bad for your business,” Dekker noted. Whereas, if teams openly discuss bugs that’d be inconvenient to fix, you have the potential to catch big defects and problems.
There is a sweet spot when it comes to rules and standardization — both create conditions for better software. However, if you use the wrong metrics for what quality software looks like, you incentivize bad behavior across an enterprise.
“This fascination with counting and tabulating negative events — as if they are predictive of a big event over the horizon — is an illusion,” Dekker said. “We should do something different if we want to understand how complex systems will collapse and fail.”
Companies within physical verticals often tabulate the number of days without an accident. In enterprise software development, managers routinely count the number of days of error-free orders. Dekker believes both are bad ideas. “This is an invitation for a big blowup down the horizon,” Dekker said.
Besides, a less-than-perfect safety record isn’t necessarily what reveals a flawed system. “I cannot keep my house injury- and incident-free for a week,” Dekker joked. Organizations should not overemphasize the importance of smaller negative incidents in comparison to flaws in the larger system itself.
Foster a culture of responsibility
In the medical industry, management looks at the underlying factors in reported incidents — a strategy software teams also follow in post-mortems. Dekker worked with one hospital that had 7% of patients walk out the door sicker than when they arrived. The hospital focused all of its resources on what went wrong. This included communication failures, misconstrued guidelines, procedural violations and human error. Although, when they investigated the patients that fared well, they found the same four types of mistakes.
Enterprise software developers tend to make similar missteps. You should focus on which errors still occur — or are masked — when the results are positive. This requires having:
- the ability for anyone to say “stop;”
- a recognition that past success is not a guarantee of future success;
- a culture that allows for a diversity of opinion and dissent; and
- a culture that keeps discussion on risk alive.
Getting enterprise software developers — not to mention everyone in the organization — to share their insights of disasters in the making, rather than those that happened, is not easy. The reward for not speaking up tends to be more immediate and certain than any incentive to call out problems.
Dekker noted how individuals on a team are often in circumstances where they don’t have to own a problem — even if they see it coming or are a part of the issue. “If you don’t have to own the problem, then the reward for speaking up is not there,” Dekker explained.
Learn from what goes right
Reduce the likelihood of big disasters in any complex enterprise system with conditions that foster honest feedback across the organization. “If we want to understand in complex systems how things are going to go wrong badly, we should not try and glean predictive capacities from little bugs and error counts,” Dekker explained. “We need to understand how success is created.”
Much more goes right than wrong, suggested Erik Hollnagel, professor at University of Southern Denmark and a leading safety expert on resilience engineering. Organizations tend to do post-mortems most often when things go wrong.
Dekker agreed that teams should do post-mortems, but on one condition. “For us to know how things really go wrong, we need to understand how they go right,” he said. In other words, post-mortems should also determine what was behind any successful aspects.
Abraham Wald, who many consider to be a founder of operations research during World War II, exemplified how to learn from successful elements in the midst of a predicament. Wald was tasked to figure out the best place to put armor on airplanes that were getting shot over Germany. Armor is dead weight on an airplane, and pilots wanted the bare minimum required to keep these complex systems flying.
After they measured and counted the holes on returning planes, Wald’s colleagues suggested they put more armor where there were holes. Wald’s colleagues were akin to today’s enterprise software developers who focus their attention on where software bugs occur. Instead, Wald realized they should put armor where there were no holes. All the planes that made it back were the ones that didn’t have holes in the sections where downed planes must have been hit.
“When I want to understand where the next fatality is coming from, you might think I look at the incident errors and bugs. I will look at the place where there are no bugs or holes,” Dekker said. “I want to understand how we create success, because that is where the failures will hide.”