🎧 Embracing Explainability in SRE Automation:
In this enlightening episode of The Making of the SRE Omelette Podcast, we sat down with Jerry Cuomo, IBM Fellow, VP, and CTO of Apply Hybrid Cloud and AI, to delve into the world of automation within Site Reliability Engineering (SRE). Here’s a snapshot of the key insights shared:
Automation Unleashed: Jerry kicked off by defining automation as the core essence of building bots and capturing operational processes. "Code is the essence of building a bot... It's the essence of automation and the repeatability that you get in automation" he explained, emphasizing its role in creating scalable, reliable software systems.
Trust Through Transparency: One of the major hurdles in adopting automation is the lack of trust among teams. Jerry highlighted the importance of explainability in overcoming this barrier. "Explainability is really you're not going to get away with, 'Hey, my job is on the line. I'm not going to let this piece of unknown black box software automate my service and reliability.'" he stressed.
Evolution of Trust: Drawing from history, Jerry noted that even with groundbreaking technologies like IBM's autonomous workload management on the mainframe, earning trust took time—often 10 years or more. He underscored the significance of a ‘training mode’ in automation tools, allowing users to preview actions without immediate execution, thus gradually building confidence.
Capturing Tribal Knowledge: Discussing the challenges of scaling SRE practices, Jerry pointed out the pitfalls of relying solely on tribal knowledge. "What does that mean, tribal? It means word of mouth... Maybe in a modern data center, computers that have lots of yellow stickies as a way to express a runbook or a procedure or a process". He advocated for codifying such knowledge to ensure consistency and scalability.
Achieving Work-Life Balance: Addressing the Google SRE principle of dedicating more than 50% of time to automated, repetitive tasks, Jerry acknowledged that this ideal isn't easily attainable without prior experience. "You can't get to that 50% unless you've lived through that 50% and you have experience..." he noted, stressing that experience, both hard-earned and represented in code, is crucial for achieving this balance.
Engagement Questions:
How do you ensure transparency and explainability in the automation tools you use? Share your experiences!
Can you recall instances where explaining the logic behind automated decisions significantly improved team acceptance and trust?
In your organization, how effectively is tribal knowledge captured and integrated into automated workflows?