Learning from Incidents

Listen on:

This recap includes the following episodes and guests.
Guest title reflects the guest's position at the time of the episode.

Recipes:

Retiring the Term "RCA"
Blameless Post Incident Reviews
Capture and Share Operational Learnings
Design for Learning, Not Perfection
Embrace Surprise as a Signal
Foster Curiosity and Responsibility

Robert Barron, AIOps, ChatOps and SRE @ Office of IBM CIO

Episode URL

David Leigh, Distinguished Engineer, Resilience Engineering

Episode URL

🧑‍🍳 Structured Learning from Incidents

Welcome to a special chapter of the SRE Omelette Cookbook. This isn’t just about uptime or dashboards—it’s about the stories behind the systems, the surprises that shake our assumptions, and the people who turn chaos into clarity. In this story, Robert Barron and David Leigh served up six recipes for building a culture of learning from incidents.

🚫 Retiring the Term “RCA”

Goal: Move beyond linear thinking to embrace system complexity

In the world of complex systems, the term “Root Cause Analysis” is not just outdated—it’s misleading. David Leigh made a compelling case for why we need to leave it behind.

“I strongly discourage the root cause analysis term and the approach that goes along with it like five whys,” David said. The problem? RCA assumes that systems fail due to a single, traceable cause. But in reality, incidents are rarely that simple.

“Our systems don’t fail because of a single root cause. They instead fail due to multiple causes, each of which was necessary but none uniquely sufficient.” This insight reframes how we think about failure. It’s not a linear chain—it’s a web of contributing factors, interactions, and overwhelmed expertise.

RCA also tends to focus on fixing components rather than understanding the system. That means we often miss the deeper lessons. As David put it, “If the organization is obsessed with finding and fixing a component that failed, then you’re actually leaving most of the learning value of the event on the table.”

Instead of RCA, David advocates for “post-incident learning”—a practice that prioritizes understanding over blame, storytelling over checklists, and systems thinking over simplistic answers.

So let’s retire RCA—not because it’s wrong, but because it’s not enough. Our systems deserve better. Our teams deserve better. And our learning depends on it.

Incident Learning is known as RCA which is a word we will no longer use. Another often used term is trigger events and contributing factors.

🧯 Run Blameless Post-Incident Reviews

Goal: Shift from blame to systems thinking

Imagine this: you’re 30 seconds from landing on the moon. Suddenly, your computer starts rebooting. Error 1201. Then again. And again. In Houston, fingers hover over the abort button. But the engineers—NASA’s proto-SREs—don’t panic. They look at the golden signals: altitude, descent rate, trajectory. Everything checks out. So they make the call: “We don’t know why it’s rebooting… but functionally, it’s doing the right thing.”

Robert Barron brought this moment to life to show how systems thinking overcomes knee-jerk reactions. It’s not about what broke—it’s about what the system is still doing right. David Leigh echoed this with a sharp critique of traditional incident learnings: “If the organization is obsessed with finding and fixing a component that failed, then you’re actually leaving most of the learning value of the event on the table.”

Both guests remind us: incidents are not puzzles to solve—they’re stories to understand. And the best reviews don’t end with a culprit. They end with a clearer picture of how the system really works.

Golden Signals and Moon Landing

Listen to Robert sharing more of his stories of parallels of space exploration and SRE on the podcast.

📖 Capture and Share Operational Learnings

Goal: Normalize storytelling as a learning tool

David didn’t just talk about storytelling—he built a movement around it. At IBM, he replaced dry retrospectives with monthly “Learning from Incidents” sessions. These aren’t just meetings—they’re events. Hundreds attend. Stories are told. Reports are read. And the goal? “To turn things that are written to be filed and never read into something that people would actually read.”

He shared how even near misses—incidents that never made the news—are gold mines for learning. “Any type of surprise is a candidate for learning,” he said. Because if something surprised you, it means your mental model was wrong. And that’s where the real learning begins.

Robert added his own flavor with the Apollo 12 lightning strike. The spacecraft lost all telemetry. But one engineer, John Aaron, remembered a similar glitch from a simulation a year earlier. He gave the now-famous command: “SCE to AUX.” It worked. Why? Because - when everyone else went to lunch - he had been curious enough to investigate a weird signal during mundane testing which paid dividends during a production power outage.

🧠 Design for Learning, Not Perfection

Goal: Build systems that support continuous improvement

The James Webb Space Telescope is a marvel. Twenty years of development. $10 billion. During deployment alone, over 300 single points of failure: “You don’t need to achieve perfection… most systems don’t need five nines.”

He contrasted this with everyday SRE work. Most systems don’t need to unfold like origami in space. They need to be understandable, maintainable, and recoverable. David agreed, saying that learning itself is a form of system improvement: “We need to recognize that SREs are part of a complex socio-technical system. That learning in that context is itself a system improvement.”

So instead of chasing perfection, chase understanding. Build systems that teach you something when they fail. That’s how you get better—not by avoiding failure, but by designing for it.

🧪 Embrace Surprise as a Signal

Goal: Use surprises to recalibrate mental models

"Turning incident learning as something to be filed - to be something people want to read and find valuable - and learn."

David lit up when he talked about surprise. Not as a threat—but as a gift. “Surprise is the drama… and that’s what incidents and near misses are.” They’re the plot twists that reveal what we didn’t know. And if we treat them right, they become the most memorable lessons.

He explained how IBM’s incident stories focus on what it felt like in the moment. What people saw. What they thought. What they missed. Because that’s where the learning lives—not in the timeline, but in the tension.

Robert’s stories were full of these moments. The Apollo engineers didn’t just monitor systems—they interpreted them. They knew that a rebooting computer wasn’t necessarily a failed mission. They knew that telemetry loss didn’t mean disaster. They trusted their understanding of the system—and adjusted it when needed.

🧰 Foster Curiosity and Responsibility

Goal: Encourage proactive learning and ownership

Robert believes that curiosity is the engine of great SREs. “Curiosity is what keeps you with the technical vitality that you need.” But curiosity without direction can lead to rabbit holes. That’s why it must be paired with responsibility. “You balance that with responsibility to make sure that you are driving your curiosity in the right directions.”

David shared practical ways to nurture this mindset. He recommended reading Dr. Richard Cook’s “How Complex Systems Fail”—a short, provocative paper that reshapes how you see incidents. He also encouraged SREs to learn tools like Pandas and Matplotlib to explore data more deeply and independently.

Together, they paint a picture of the modern SRE: not just a technician, but a systems thinker, a storyteller, a learner, and a leader.

🧩 Let’s Talk: Questions for Reflection

What’s one incident or near miss your team could turn into a story worth sharing?
How might your team’s learning change if you focused less on action items and more on understanding system behavior?
What signals (not metrics) do you use to know if learning is actually happening in your org?

Page updated

Google Sites

Report abuse