This recap includes the following episodes and guests.Â
Guest title reflects the guest's position at the time of the episode.
Recipes:
Signal Filtering and Triage
Establish Feedback Loops for Model Accuracy
Build Trust through Transparency
Start Small, Scale Smart
Keep the Human in the Loop
Rama Akkiraju, IBM Fellow, CTO AIOps, Forbes 'Top 20 women in AI' 2017, Fortune 'A-team for AI' 2018, Advisor CompTIA AI Council
In this chapter of The Making of the SRE Omelet, we explore how AIOps is moving from theory to practice. I sat down with Rama Akkiraju and Isabell Sippli to unpack what AIOps really looks like in the fieldâbeyond the buzzwords. We talked about how AI can help reduce operational noise, improve incident response, and build smarter systems over time. But we also got real about the challenges: trust, data quality, and the human side of automation.
Goal: Show how AI can reduce noise and help SREs focus on what matters.
We started with the daily grind of SREs: alerts, dashboards, and endless incident calls. Rama shared that in many organizations, SREs spend 80% of their time firefighting. She described a cycle where engineers jump from one incident to the next, with no time to reflect or improve. âBy the time they fix one issue, another one has already popped up,â she said. The goal of AIOps? Flip that ratioâfree up time for proactive work.
Isabell walked us through the incident lifecycleâdetect, isolate, diagnose, fix, and verifyâand how AI can support each phase. She explained how anomaly detection can evolve from static thresholds to adaptive baselines: âYou can apply an algorithm that automatically detects your baseline, learns that over time, and finds out when there is actually an anomaly.â She also shared how AI can help isolate root causes by understanding service relationships: âYou can do it smarterâautomatically associate operational data with the location in your service tree.â
These arenât just technical upgradesâtheyâre quality-of-life improvements for engineers.
Goal: Emphasize the importance of learning from incidents to improve AI performance.
Let's explored the power of feedback loops. Rama emphasized that capturing what happened, how it was fixed, and what should be done next time is critical. But in many organizations, that loop is broken. âEven though documentation might exist, itâs often not enforced or structured well enough to be useful,â she said. That leads to a vicious cycleâteams keep solving the same problems without learning from them.
Isabell shared a relatable challenge: âMost monitoring tools can detect if a metric goes back to normal... but can you relate that with the action that you took?â That connection is what allows AI to learn and eventually recommend or even automate the fix. She described this as the âholy grailâ of AIOps: probable cause detection plus resolution recommendation.
She also reflected on her own teamâs experience: âWe donât always close the feedback loop. If we had something that helped us apply learnings from previous bugs, that would be brilliant.â Itâs a reminder that even the experts are still working on thisâand thatâs okay.
Isabell walking us through the typical incident lifecycle starting with detect, isolate and diagnose.
Isabell shares the main challenges customers faced and how to over come them.
Goal: Address the cultural and psychological barriers to adopting AIOps.
Of course, none of this works without trust. Isabell was candid: âEstablishing trust in a system that does some things for you automatically is not exactly easy.â Sheâs seen firsthand how skepticism can stall adoptionâespecially when people feel their expertise is being replaced.
Rama added that culture plays a huge role. In many organizations, incident calls are still dominated by finger-pointing. Everyoneâfrom app owners to infrastructure teamsâjoins the call, and the blame game begins. âEvery problem gets pointed to the network,â she joked. But behind the humor is a real issue: without a blameless culture, teams wonât trust automationâor each other.
Isabell emphasized the need to bring people along: âYou have to take them with you. No operational approach can be imposed.â That means showing how AI reaches its conclusions and giving humans the final sayâat least for now.
Goal: Provide a practical entry point for teams looking to adopt AIOps.
So how do you actually get started with AIOps? Isabellâs advice was simple but powerful: âLook at where you have your biggest pain points... Then do you have access to data that might help you overcome that trouble?â She stressed that itâs not about boiling the ocean. Start with a focused use case, validate the value, and build from there.
Rama expanded on this with a thoughtful roadmap. She emphasized that embracing AIOps is a journey, not a one-time implementation. Companies need to begin by assessing their current maturity levelâwhether theyâre still operating in traditional IT roles or have already started adopting SRE practices. âItâs a journey for many of them,â she said. âWhat would really help is to have a roadmap to get there.â
She recommended starting with a maturity assessment, then building a structured plan that includes:
Defining clear roles and responsibilities for SREs
Establishing career paths to support long-term growth
Implementing tools that support observability and automation
Measuring key metrics like mean time to detect, repair, and resolve
She also highlighted the importance of career paths for SREs. âIf you can show how AIOps improves mean time to detect, repair, and resolve, it becomes easier to get buy-in from leadership,â she said. Itâs not just about toolsâitâs about people, process, and progress.
The best approach? Start with a focused use case, validate the value, and scale from there.
Goal: Reinforce that AIOps is about augmentation, not replacement.
Finally, we looked ahead. While the dream of fully autonomous operations is compelling, both Rama and Isabell agreed itâs not realisticâat least not anytime soon. Isabell said it best: âIâm 100% confident we will never run fully autonomously because the space is just too complex.â
Rama described a more gradual evolution: âInitially, AI operates in the human loop. Then it starts to take the lead. Eventually, humans play a role only in critical situations.â Thatâs the visionâAI as a partner, not a replacement.
Isabell also reminded us that many of the most exciting AI breakthroughsâlike pattern recognition in imagesâcan be applied to operations. But even with all that tech, âthere will still be a certain amount of rigor and discipline needed to run businesses,â she said. The future is a blend of solid engineering and smart automation.
Whatâs one area in your operations where AI could reduce noise or false positives today?
How do you currently capture and reuse learnings from past incidents?
What would it take for your team to trust an AI-generated incident learning or resolution?