This recap includes the following episodes and guests.
Guest title reflects the guest's position at the time of the episode.
Recipes:
Automate Manual Toil with Context Awareness
Incremental Innovation Loops
Shift Left with Predictive Insights
Integrate Innovation into Delivery Pipelines
Embrace Failure as a Path to Innovation
In this chapter of the SRE Omelette Cookbook, we dive into the essential ingredients behind two of the most powerful forces shaping our industry: Automation and Innovation. I sat down with two leaders who’ve helped shape how IBM approaches automation and innovation: Jerry Cuomo, IBM Fellow and CTO of Automation, and Steven Astorino, VP of Development for IBM Data & AI. Together, we explored how teams can move from manual toil to scalable systems, and from isolated ideas to integrated innovation. Here’s what we learned.
Goal: Identify and automate high-friction, repetitive tasks where human effort is stretched thin.
Jerry reminded us that automation is not just about efficiency—it’s about trust and context. He shared how IBM’s mainframe automation took over a decade to gain adoption. The breakthrough came when IBM introduced a “training mode” that showed what automation would do before it acted.
“The best way to gain trust is to fully understand how it’s going to react in that situation.”
He also reflected on the early days of operations, where runbooks were sticky notes and tribal knowledge was passed by word of mouth.
“You don’t start with 50%. You start with close to 0% and a lot of experience. And you build up from there with code.”
This is where automation becomes sustainable—when it’s built on real-world experience and codified into repeatable, reliable systems.
And Jerry reminded us that automation needs to be grounded in repeatability. “Repeatability drives scalability, drives high reliability, and is embodied in code.” If you’re still relying on tribal knowledge, that’s a signal to start capturing it in code.
Goal: Create structured, time-boxed environments where teams can experiment and deliver focused breakthroughs.
Steven shared how IBM’s Area 631 program created space for focused innovation. The model is simple: “Six people, three months, one breakthrough.” It’s a structured way to explore new ideas without the distractions of day-to-day work.
It wasn’t just the cool space or the snacks (though those helped). It was the freedom to break rules and focus. “We eliminate all the blockers. There are really no rules. Break it or make it in three months.” The freedom, paired with a clear timeline, helped teams move fast and stay focused. That’s how you create space for real breakthroughs.
And the results weren’t just prototypes—they were production-ready. “Once they graduate, we decide where they fit in… typically, they end up improving one of our technologies or features.”That’s a great example of how to turn innovation into impact.
Goal: Use data and AI to anticipate issues before they impact users or systems.
Jerry painted a vision of proactive operations. Imagine a system that alerts you before a code change causes an outage:
“Hey Kevin, if you push this, you’re not going to that hockey game—there’s a 94% chance it’ll cause an outage.”
This kind of insight requires data from across the lifecycle—from GitHub commits to incident logs—and the ability to connect it meaningfully.
“Great SRE practices are fueled by innovations that lower costs, reduce time to fix issues, and ultimately improve customer sentiment.”
It’s not just about automation—it’s about contextual automation that understands business impact and helps teams act before problems escalate.
Goal: Ensure that innovation doesn’t stay in a sandbox—it becomes part of the product.
Steven emphasized that innovation must be actionable. Every Area 631 project is designed to “graduate” into IBM’s core product lines.
“Once they graduate, we decide where they fit in… typically, they end up improving one of our technologies or features.”
He shared how one project—a sales configurator for Cloud Paks—started as a prototype and is now used across IBM to simplify complex licensing and pricing.
“The value is pretty obvious… it’s working, it’s needed, and it’s improving how we operate.”
Goal: Normalize failure as a necessary part of learning and innovation.
One of the most powerful moments came when Steven talked about failure.
“We can’t be too busy to innovate, so we have to create that environment.”
But creating space isn’t enough—you also have to make it safe to fail.
“It’s okay to fail. We forget that sometimes. Everyone wants to try something and expect it to just work or be successful all the time. In fact, that’s far from reality.”
By giving teams a time-boxed, low-risk environment, Area 631 encourages experimentation without fear. It’s not about failing fast—it’s about failing smart, learning quickly, and applying those lessons to build something better.
Both episodes reminded me that automation and innovation are two sides of the same omelette. Automation gives us the time and space to innovate. Innovation gives us the tools to automate smarter. And both require trust, collaboration, and a willingness to fail forward and learn fast.
They are also not one-time efforts. They’re iterative, and they require cultural support. Whether it’s automating log analysis or building a new sales configurator, the key is to start small, measure impact, and build from there.
Whether it’s Jerry’s vision of autopilot for IT or Steven’s Shark Tank-style Hyper Blue program, the message is clear: we need to make room for creativity, invest in people, and build systems that learn and adapt.
These conversations reminded me that the best strategies are the ones that fit into the way teams actually work. Focus automation where toil are bottlenecks. Create loops where innovation can be tested and refined. And always keep the feedback flowing.
🧠 What’s one piece of tribal knowledge in your team that should be codified today?
🛠️ If you had three months and no blockers, what breakthrough would you chase?
🤝 How do you build trust in automation within your team—what’s your explainability strategy?
🔄 What’s your team’s process for measuring the impact of automation over time?