This recap includes the following episodes and guests.
Guest title reflects the guest's position at the time of the episode.
Recipes:
Tailor and Adapt - Don't Copy-Paste
Balance Theory with Practical Adoption
Start Where You Are
Google's SRE Evolution
Recognition and Teaming
In this chapter of Making of the SRE Omelette, we dig into what it really means to learn from the best—without blindly copying them. I had the pleasure of speaking with MP English from Google and Stacy Jones from IBM, two leaders who’ve lived through the evolution of SRE in very different environments. Their stories are filled with lessons on how to tailor, adapt, and evolve SRE practices to fit your own organization’s context.
"Building a practice that help us anticipate, to be in a constant state of readiness... whatever comes our way, it is something we have practiced, something we are prepared for... it is part of our DNA, how we do business."
Stacy Joines
Goal: Understand why SRE practices must be customized to your organization’s scale and culture.
One of the biggest takeaways from both conversations was this: don’t lift and shift someone else’s SRE model into your organization. MP reminded us that “SRE is a feature of scale… when you get to globally distributed systems, you get emergent phenomena that you can’t predict.” That means what works at Google might not work for a smaller or differently structured team.
Stacy echoed this with a story from IBM’s retail roots in WebSphere Commerce:
“We were doing CI/CD and fail-fast before it had a name. It came from needing to deliver in high-pressure retail environments.”
Her point? Your SRE journey should reflect your business context, not someone else’s playbook.
MP also emphasized that SREs and developers often have different but complementary skill sets:
“They build the light bulb, we figure out how to change the light bulb while the light bulb is still on.”
That kind of specialization only works when the system is designed with your team’s strengths and constraints in mind.
Stacy summed it up well: “It’s not just learning how a particular tool fits an IT process, but how that IT process fits into this culture.”
So whether you’re starting from scratch or modernizing legacy systems, the key is to adapt thoughtfully—not copy blindly.
Goal: Understand why SRE practices must be customized to your organization’s scale and culture.
One of the biggest takeaways from both conversations was this: don’t lift and shift someone else’s SRE model into your organization. MP reminded us that “SRE is a feature of scale… when you get to globally distributed systems, you get emergent phenomena that you can’t predict.” That means what works at Google might not work for a smaller or differently structured team.
Stacy echoed this with a story from IBM’s retail roots in WebSphere Commerce:
“We were doing CI/CD and fail-fast before it had a name. It came from needing to deliver in high-pressure retail environments.”
Her point? Your SRE journey should reflect your business context, not someone else’s playbook.
MP also emphasized that SREs and developers often have different but complementary skill sets:
“They build the light bulb, we figure out how to change the light bulb while the light bulb is still on.”
That kind of specialization only works when the system is designed with your team’s strengths and constraints in mind.
Stacy summed it up well: “It’s not just learning how a particular tool fits an IT process, but how that IT process fits into this culture.”
So whether you’re starting from scratch or modernizing legacy systems, the key is to adapt thoughtfully—not copy blindly.
Goal: Learn when and how to introduce structured practices like SLIs and SLOs.
We talked a lot about the tension between theory and practice. MP shared how automation and guardrails evolved at Google:
“You should be able to submit bad changes and not have them be catastrophic.”
This is only possible when you’ve built the right systems and culture. Stacy brought this home with a story about real-world disruptions: “We had deep freeze weather in Texas… and systems that weren’t completely prepared.”
The takeaway? Don’t wait for a crisis to test your readiness - and introduce practices like SLIs and SLOs only when your teams are ready to use them meaningfully.
MP emphasized that at Google, the evolution of SRE practices was gradual and grounded in real operational needs. The systems were already complex, and the teams had developed a culture of observability and safe experimentation. In that context, SLIs and SLOs weren’t just metrics—they were decision-making tools.
Stacy brought a complementary perspective from the enterprise world. She pointed out that many organizations are still building the cultural muscle for reliability. If a team isn’t yet in the habit of measuring what matters, or if they don’t have the operational maturity to act on those measurements, then SLIs and SLOs can become just another checkbox - adding overhead without delivering value.
The key is to start with the problems your team is already facing. Are you struggling with unclear definitions of “good enough”? Are incidents recurring without clear patterns? Are you unsure when to ship or rollback? If so, SLIs and SLOs can help—but only if they’re tied to real decisions and real consequences.
Goal: Embrace your current state as the foundation for building reliability.
Stacy emphasized that making SRE real means working with what you’ve got:
“It may require more upfront investment, but if it becomes just the way we do business, we’re constantly ensuring our systems are ready.”
MP added that success often lies in the invisible:
“It’s hard to see the value of the outages that don’t happen.”
Start with your current systems, your current people, and your current risks. Build from there.
Goal: Learn how SRE practices evolve through automation, experimentation, and scale.
MP gave us a behind-the-scenes look at how Google’s SRE practice has matured over time. One of the biggest shifts? Reducing human interaction with production systems.
“There are two big trends… one is getting humans out of production, and the other is moving toward shared solutions.”
This evolution wasn’t just about automation—it was about building systems that are resilient by design, where human error is a signal that the system needs improvement.
MP also highlighted how Google’s SREs are empowered to make changes safely:
“How do I canary this change? How do I make sure I’m not just throwing this change at production and hoping it works?”
These practices reflect a culture of safe experimentation and continuous learning.
Goal: Recognize the importance of collaboration, trust, and celebrating invisible wins.
One of the most powerful themes from both conversations was the importance of recognizing people—not just for fixing problems, but for preventing them. MP shared how Google uses peer bonuses and kudos to highlight contributions that might otherwise go unnoticed:
“It’s the almost fires and the non-fires that are hard to count.”
Stacy emphasized the importance of building trust and readiness into the team culture. She noted that “Success is a big aspect in demonstrating the value of things”, especially when trying to shift mindsets around reliability. Her story about helping clients realize the value of investing in stability before peak events like Black Friday showed how recognition often comes after a successful outcome—but the work starts long before.
MP also emphasized the importance of collaboration and humility in complex systems:
“The systems are far too complicated for any one human to really grok how they work.”
At Google, SREs are encouraged to find their niche and lean on teammates when needed. MP’s own journey—from astrophysics to monitoring and alerting—highlights how diverse backgrounds can bring unique strengths to the team.
Stacy added a broader perspective on teaming across roles and disciplines:
“It’s not just reliability engineers doing SRE, but everyone involved with building the solution.”
She pointed out that as IT becomes more distributed and embedded in everyday life, SRE principles need to be accessible and consumable by more people—not just specialists.
"SREs are often introduced when something has gone wrong... introduced on people's worst day."
Stacy draws the analogy to making a good omelette requiring fresh ingredients to organizations taking on the journey towards achieving reliable solutions.
Make SRE Consumable at the Edge: As Stacy put it, “SRE needs to move closer to the edge in a consumable way.”
Celebrate the Non-Fires: MP reminded us, “It’s the almost fires and the non-fires that are hard to count.”
Fresh Ingredients, Not All New: Stacy’s omelet metaphor says it best: “Take a fresh look at everything. Not all new, but fresh.”
🔍 What’s one area in your org where you think you’re “ready”—but haven’t tested that assumption recently?
🧪 How do you recognize your SRE's invisible works?
🧠 What’s a legacy system or process that might just need a “fresh ingredient” instead of a full rewrite?