Intro to Resilience Engineering with Michelle Casey (Episode 101)
Send us a textThis week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover...🏋️♀️ Reliability VS Robustness VS Resilience🧩 What is a complex system?🔢 Safety one/safety two🧠 Mental models😩 Human error...and so much more.Resources from this episode:Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implications_for_the_future_of_resilience_engineeringBuilding and revising adaptive capacity sharing for technical incident response (paper) by Dr Richard Cook and Dr Beth Long https://www.researchgate.net/publication/344259449_Building_and_revising_adaptive_capacity_sharing_for_technical_incident_response_A_case_of_resilience_engineeringSystems Thinking for Incident Analysis (talk) by Laura Nolan from LFI Conf 23 https://www.youtube.com/watch?v=-uXGg3g2ypsHow Complex Systems Fail (website) by Dr. Richard Cook https://how.complexsystems.fail/A Tale of Two Safeties (book) by Erik Hollnagel https://erikhollnagel.com/A Tale of Two Safeties.pdfFrom Safety One to Safety Two (book) by Erik Hollnagel https://www.england.nhs.uk/signuptosafety/wp-content/uploads/sites/16/2015/10/safety-1-safety-2-whte-papr.pdfResilience: It's not you, it's the System (talk) by Dr Carl Horsley https://www.youtube.com/watch?v=ugC3GTKt23UAbove the line / Below the line (paper) by Dr Richard Cook (not original link) https://www.researchgate.net/figure/Above-the-Line-Below-the-Line-framework-adapted-with-permission-Cook-Woods-2016_fig3_333091997How Your Systems Keep Running Day After Day (talk) by John Allspaw https://www.youtube.com/watch?v=xA5U85LSk0MBehind Human Error (book) https://www.amazon.com.au/Behind-Human-Error-David-Woods/dp/0754678342The Field Guide to Human Error Investigations (book) by Sydney Dekker https://www.humanfactors.lth.se/fileadmin/lusa/Sidney_Dekker/books/DekkersFieldGuide.pdfThe Howie Guide (paper) by Dr Laura Maguire, Nora Jones and Vanessa Granda https://howie-guide.pagerduty.com/Resilience Engineering: Where do I start? (website) by Lorin Hochstein https://www.resilience-engineering-association.org/resources/where-do-i-start/The STELLA report (paper) https://snafucatchers.github.io/DORA Communtiy Discussion - Resilience Engineering (discussion) https://www.youtube.com/watch?v=g3cEJ7njJbcThis Is Fine! (podcast) by Colette Alexander and Clint Byrum https://www.thisisfinepod.com/the-pod
--------
39:36
--------
39:36
Learning with John Allspaw (Episode 100)
Send us a textThis week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss...📒 Classroom VS situated learning🤝 The myth of the perfect handover ITIL as a coping strategy to try and make sense of the organic, wild, and messy🥕 How you cannot incentivise to avoid incidents (it doesn't work that way)❤️🩹 You can't understand how something is broken unless you know how it's supposed to work in the first place...and much more.Resources from this episode:Pre-Accident Investigations by Todd Conklin https://www.amazon.com.au/Pre-Accident-Investigations-Introduction-Organizational-Safety/dp/1409447820Working at the Center of the Cyclone by Dr. Richard Cook https://itrevolution.com/articles/center-of-the-cyclone-dr-richard-cook/To join the Resilience in Software Foundation head over to: https://resilienceinsoftware.org/You can find John on:Website: https://www.kitchensoap.com/LinkedIn: https://www.linkedin.com/in/jallspaw/You can find Adapative Capacity Labs here: https://www.adaptivecapacitylabs.com/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre
--------
48:17
--------
48:17
Focusing on What Matters with Trent Hornibrook (Episode 99)
Send us a textThis week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, and beyond. We discuss...🔌 Empowering engineers to implement change in your org🧑🍼 Focusing on what matters (customer & business > technology)👀 Not just adding more monitoring as the output of each PIR😎 How autonomy can lead to accountability🌳 How to influence change in an organisation...and much more.You can find Trent on:LinkedIn: https://www.linkedin.com/in/trenthornibrook/You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre
--------
29:28
--------
29:28
The Root Cause Fallacy with Andrew Hatch (Episode 98)
Send us a textThis week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore...🌌 Is the root cause of every incident the big bang?🦖 How the value of root cause degrades as complexity increases🫣 That if the culture is not blameless, people will hide things🌳 Alternative approaches to root cause analysis such as branching timelines🙋 Getting someone without skin in the game to facilitate your blameless post-mortems...and much more.You can find Andrew on:LinkedIn: https://www.linkedin.com/in/hatchman76/Check out Andrew's SREcon21 talk 'Learning from Complex Systems' which covers many of the topics introduced in this episode: https://www.youtube.com/watch?v=5pKGW61RyvoYou can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre
--------
32:22
--------
32:22
Synthetic Monitoring with David Dick (Episode 97)
Send us a textThis week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover...🤖 What is synthetic monitoring?🦾 What are the benefits and drawbacks to using it?☢️ Non-web based synthetics (the tough stuff)🍹 Combining RUM and synthetics🫢 Does synthetics need an OTEL-like framework?...and much more.You can find David on:LinkedIn: https://www.linkedin.com/in/david-dick/You can find more about 2 Steps at https://2steps.io/#You can find Stephen on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Bluesky: https://bsky.app/profile/slightreliability.bsky.socialYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre