This recap includes the following episodes and guests.
Guest title reflects the guest's position at the time of the episode.
Recipes:
Signal Filtering and Triage
Establish Feedback Loops for Model Accuracy
Build Trust through Transparency
Start Small, Scale Smart
Keep the Human in the Loop
Rama Akkiraju, IBM Fellow, CTO AIOps, Forbes 'Top 20 women in AI' 2017, Fortune 'A-team for AI' 2018, Advisor CompTIA AI Council
In this chapter of The Making of the SRE Omelet, we explore how AIOps is moving from theory to practice. I sat down with Rama Akkiraju and Isabell Sippli to unpack what AIOps really looks like in the field—beyond the buzzwords. We talked about how AI can help reduce operational noise, improve incident response, and build smarter systems over time. But we also got real about the challenges: trust, data quality, and the human side of automation.
Goal: Show how AI can reduce noise and help SREs focus on what matters.
We started with the daily grind of SREs: alerts, dashboards, and endless incident calls. Rama shared that in many organizations, SREs spend 80% of their time firefighting. She described a cycle where engineers jump from one incident to the next, with no time to reflect or improve. “By the time they fix one issue, another one has already popped up,” she said. The goal of AIOps? Flip that ratio—free up time for proactive work.
Isabell walked us through the incident lifecycle—detect, isolate, diagnose, fix, and verify—and how AI can support each phase. She explained how anomaly detection can evolve from static thresholds to adaptive baselines: “You can apply an algorithm that automatically detects your baseline, learns that over time, and finds out when there is actually an anomaly.” She also shared how AI can help isolate root causes by understanding service relationships: “You can do it smarter—automatically associate operational data with the location in your service tree.”
These aren’t just technical upgrades—they’re quality-of-life improvements for engineers.
Goal: Emphasize the importance of learning from incidents to improve AI performance.
Let's explored the power of feedback loops. Rama emphasized that capturing what happened, how it was fixed, and what should be done next time is critical. But in many organizations, that loop is broken. “Even though documentation might exist, it’s often not enforced or structured well enough to be useful,” she said. That leads to a vicious cycle—teams keep solving the same problems without learning from them.
Isabell shared a relatable challenge: “Most monitoring tools can detect if a metric goes back to normal... but can you relate that with the action that you took?” That connection is what allows AI to learn and eventually recommend or even automate the fix. She described this as the “holy grail” of AIOps: probable cause detection plus resolution recommendation.
She also reflected on her own team’s experience: “We don’t always close the feedback loop. If we had something that helped us apply learnings from previous bugs, that would be brilliant.” It’s a reminder that even the experts are still working on this—and that’s okay.
Isabell walking us through the typical incident lifecycle starting with detect, isolate and diagnose.
Isabell shares the main challenges customers faced and how to over come them.
Goal: Address the cultural and psychological barriers to adopting AIOps.
Of course, none of this works without trust. Isabell was candid: “Establishing trust in a system that does some things for you automatically is not exactly easy.” She’s seen firsthand how skepticism can stall adoption—especially when people feel their expertise is being replaced.
Rama added that culture plays a huge role. In many organizations, incident calls are still dominated by finger-pointing. Everyone—from app owners to infrastructure teams—joins the call, and the blame game begins. “Every problem gets pointed to the network,” she joked. But behind the humor is a real issue: without a blameless culture, teams won’t trust automation—or each other.
Isabell emphasized the need to bring people along: “You have to take them with you. No operational approach can be imposed.” That means showing how AI reaches its conclusions and giving humans the final say—at least for now.
Goal: Provide a practical entry point for teams looking to adopt AIOps.
So how do you actually get started with AIOps? Isabell’s advice was simple but powerful: “Look at where you have your biggest pain points... Then do you have access to data that might help you overcome that trouble?” She stressed that it’s not about boiling the ocean. Start with a focused use case, validate the value, and build from there.
Rama expanded on this with a thoughtful roadmap. She emphasized that embracing AIOps is a journey, not a one-time implementation. Companies need to begin by assessing their current maturity level—whether they’re still operating in traditional IT roles or have already started adopting SRE practices. “It’s a journey for many of them,” she said. “What would really help is to have a roadmap to get there.”
She recommended starting with a maturity assessment, then building a structured plan that includes:
Defining clear roles and responsibilities for SREs
Establishing career paths to support long-term growth
Implementing tools that support observability and automation
Measuring key metrics like mean time to detect, repair, and resolve
She also highlighted the importance of career paths for SREs. “If you can show how AIOps improves mean time to detect, repair, and resolve, it becomes easier to get buy-in from leadership,” she said. It’s not just about tools—it’s about people, process, and progress.
The best approach? Start with a focused use case, validate the value, and scale from there.
Goal: Reinforce that AIOps is about augmentation, not replacement.
Finally, we looked ahead. While the dream of fully autonomous operations is compelling, both Rama and Isabell agreed it’s not realistic—at least not anytime soon. Isabell said it best: “I’m 100% confident we will never run fully autonomously because the space is just too complex.”
Rama described a more gradual evolution: “Initially, AI operates in the human loop. Then it starts to take the lead. Eventually, humans play a role only in critical situations.” That’s the vision—AI as a partner, not a replacement.
Isabell also reminded us that many of the most exciting AI breakthroughs—like pattern recognition in images—can be applied to operations. But even with all that tech, “there will still be a certain amount of rigor and discipline needed to run businesses,” she said. The future is a blend of solid engineering and smart automation.
What’s one area in your operations where AI could reduce noise or false positives today?
How do you currently capture and reuse learnings from past incidents?
What would it take for your team to trust an AI-generated incident learning or resolution?