Delivering Resilience in
Complex Systems

Listen on:

This recap includes the following episodes and guests.
Guest title reflects the guest's position at the time of the episode.

Recipes:

Map Critical Dependencies
Conduct Resilience Assessments
Improving Resilience with Dependencies
Implement Release Readiness Gates
Measure Release Impact

Marshall Lamb, IBM Distinguished Engineer, Master Inventor and CTO, IBM Sterling

Episode URL

Ron Baker, IBM Distinguished Engineer of SRE Operations

Episode URL

📚 Chapter Recap: Delivering Resilience in Complex Systems

Recipes for Building Scalable Reliability in Complex Systems

🗺️ Map Critical Dependencies

Goal: Identify internal and external system touchpoints that influence service reliability.

Marshall Lamp kicked off his episode with a story that perfectly captured the theme of this chapter. While traveling during the pandemic, he stopped at a fast food restaurant and saw a sign that read: “Due to supply chain issues, we are temporarily unable to serve chicken sandwiches.” He wasn’t planning to order one—but the sign stuck with him. It was a clear signal that supply chain disruptions had become so visible, they were now part of everyday life. As he put it, “Suddenly, supply chain has become a household word... and I'm not sure that's a good thing..."

"At the end of the day, we only control what we control. And we can complain when someone else fails at their job. We can complain all we want, but we can't prevent them from failing at what they do. All we can do is be resilient to when they fail."

Marshall brought a similar lens from the supply chain world, where expectations for speed and transparency are rising. “We’re moving rapidly away from batch-oriented data transfers to more nimble API-based transactions,” he said, noting that customer expectations now demand near real-time updates. He also emphasized the importance of understanding customer impact: “Understand how your disruptions impact customers—that’s how you design better systems.” For both guests, the goal is not just to ship more, but to ship better—with fewer surprises and more confidence.

That story set the stage for a deeper conversation about dependencies. In Marshall’s words, “A supply chain is a stitched together series of systems and human-based processes.” No one owns the whole thing, and that’s what makes it fragile. He reminded us that “I only own the bit that I own… I can’t control what my partners do.” That’s why visibility and mapping dependencies—especially external ones—are essential to resilience.

Ron Baker brought this same mindset to software systems. He emphasized that “you’ve got to win the priority war, and it is a war,” especially when reliability features compete with new product ideas. His approach includes educating stakeholders on the risks of fragile dependencies and showing how they impact customer experience. “We focus on three areas: education of risk, margin, and vision,” he explained. And to influence change, “you have to be ahead of the curve or you’ll miss the window to influence.”

🧪 Conduct Resilience Assessments

Goal: Simulate failure scenarios to test business continuity assumptions.

Marshall’s perspective on resilience is rooted in realism: things will break, and the question is how fast you can recover. He introduced the concept of Time to Survive (TTS) alongside Time to Recovery (TTR), explaining that “if my time to recover exceeds my time to survive, I have a big problem.” He encouraged teams to model disaster scenarios and understand how long they can operate under stress before customer impact becomes unavoidable. This isn’t just about technology—it’s about people too. “We practice business continuity by having our people work from home,” he shared, highlighting the importance of workforce readiness.

EMPATHY - Taking the perspective of the users

Ron echoed this need for preparation, but from a product development lens. He emphasized that “success is not a single release—it’s measured over a two-year period.” To make the risk real for stakeholders, he uses storyboards to show what operators go through when things break: “You’ve got to hit their emotions, not their intellect.” These storyboards help teams visualize the human and customer cost of fragile systems, making the case for proactive investment in resilience. As he put it, “Create a storyboard that shows what your operators are going through.”

🔄 Improving Resilience with Dependencies

Goal: Strengthen your system’s ability to absorb and adapt to failures in external and internal dependencies.

In complex systems, dependencies are unavoidable—but fragility doesn’t have to be. Both Marshall and Ron emphasized that resilience isn’t about eliminating dependencies, but about designing systems that can adapt when those dependencies fail.

Marshall explained that in supply chains, “I only own the bit that I own… I can’t control what my partners do.” That’s why he focuses on building flexibility into the system. He shared that “having multiple suppliers and rotating between them” is a key strategy to reduce risk. But it’s not just about redundancy—it’s about readiness. “Digitized processes make data portable and visible,” he said, which allows teams to respond quickly when something breaks down.

Ron brought this same thinking into the software world. He pointed out that “you’ve got to win the priority war, and it is a war,” especially when trying to get reliability features prioritized alongside product features. His approach includes making the risk of dependency failure visible to decision-makers. “You’ve got to hit their emotions, not their intellect,” he said, describing how storyboards and real-world operator pain points help shift priorities. He also emphasized the importance of proactive planning: “If you’re not ahead of that curve… you’ll miss the opportunity to influence.”

Both guests agreed that dependency resilience is not just about technical architecture—it’s about culture, visibility, and shared responsibility. Marshall summed it up well: “We can’t prevent failure, but we can prepare for it.”

✅ Implement Release Readiness Gates

Goal: Standardize reliability and performance checks before launch.

SRE Feature Prioritization with Stakeholders:

Education of the risk, Focus on margin, Create a vision of the goal

Ron emphasized that timing is everything when it comes to influencing product development. “If you’re not ahead of that curve… you’ll miss the opportunity to influence,” he warned. That’s why he advocates for shifting reliability left—embedding it early in the planning and development process. He also shared a practical lesson learned: “Don’t show what’s promised—show what’s tested and verified.” This builds trust and ensures that reliability isn’t just a checkbox, but a real outcome. His use of an SRE scorecard helps teams prioritize and track maturity over time.

Marshall connected this to the operational side of supply chains. He explained that digitization is key to readiness: “Digitized processes make data portable and visible.” This not only supports remote work and continuity, but also enables earlier detection of issues. He also emphasized the need for engineers to understand their systems deeply: “You need to understand everywhere in your code that could break.” That awareness is what allows teams to build in the right checks before launch.

📏 Measure Release Impact, Not Just Volume

Goal: Focus metrics on customer experience and incident trends, not just delivery speed.

Metrics are only useful if they lead to better decisions. Ron cautioned against relying too heavily on dashboards: “Create a simple metric with green, yellow, red—but use it to start a conversation.” He’s learned that leadership often assumes something is done just because it’s on a roadmap. That’s why he insists on showing tested outcomes, not just intentions: “Use KPIs as tools, not checkmarks.”

💬 Listener Questions

🔍 How do you identify and track your most critical dependencies—especially the ones you don’t control?
🧪 What’s your process for simulating failure scenarios, and how often do you revisit your assumptions?
📈 Are your release metrics focused on volume or impact—and how do you measure customer experience post-launch?

Page updated

Google Sites

Report abuse

Delivering Resilience inComplex Systems