CrowdStrike: a root cause analysis
đ„ Tomorrow is the opening ceremony of the 33rd Olympic Games. Paris last hosted the Olympics 100 years ago; itâs the second city after London to host three Olympic games (1900, 1924, 2024).
Issue #127: how to communicate across continents & how to be non-judgmental
By Harris Sockel
At Medium, like most tech companies, we do something called a Root Cause Analysis (RCA) whenever a bug takes down our platform. âTakes downâ is kind of drastic â we do it whenever a critical service, like this email that landed in your inbox today, doesnât work. Itâs essentially a way for us to ask: Why? And also: How can we make sure it doesnât happen again?
The best RCAs Iâve seen keep asking why until they reach the deepest possible root cause. Often, this is something cultural. Itâs a habit or process, not simply a technical bug.
By now, we all know what happened on Friday:
Hereâs a quick root cause analysis, using Medium posts.
Why did this happen?
CrowdStrike, security software used by businesses to safeguard Windows from cyberattacks, pushed out a faulty update.
Why did the update fail?
Hereâs a great analogy from writer and editor Dinah Davis. Imagine youâve installed a Nest thermostat in your house. Then, the makers of Nest drop an update thatâs supposed to help cool or heat your house better, but it doesnât play nice with your homeâs infrastructure. Instead, your boiler or A/C goes berserk. CrowdStrike updated one layer of Windows in a way that caused Windowsâ deeper layers to crash. (Want more detail? Head here.)
How did that mismatch happen?
Kevin Beaumont, a CrowdStrike customer and cybersecurity professional, digs deep into this question on Medium. Many companies buy cybersecurity software to comply with federal regulations. Theyâre doing it to check a box. They donât always understand the software they buy and they donât check on it.
âThis is going to sound controversial,â he writes, but âI think we put way too much trust in cybersecurity nerds like me, and thereâs a lack of transparency and accountability.â Cybersecurity software needs god-mode privileges in Windows to function. This is fine-ish, most of the time, yet these companies are âpushing out updates constantly, often many times a day, with zero customer visibility, zero accountability and zero regulatory scrutiny.â
How can we avoid another incident like this? Build redundant systems: If you run an airline, get a backup operating system. Also, test any update before it deploys. And lastly, to all the VPs of Ops at airlines, hotels, and hospitals out there: Ask questions whenever you find yourself doing anything simply to check a box.
đ One more story: about pen pals
I am obsessed with English teacher Evan Purcellâs tell-all about how he tried and failed to set up cross-continental pen pal programs while teaching in China and Kazakhstan. The best part of this post? Real letters written by tweens. I mean: Iâm so happy to be your pen pal!!!!! This is very cool! Tell me everything about yourself!!! Iâm sure youâre the smartest, coolest boy.
Thinking back to his first failed pen pal experiment, Purcell realizes heâd given his students too much freedom â especially when they were writing their first letters to new friends overseas. An uncurious, un-genuine introduction changes the tenor of an entire conversation. Another lesson he learned: American sarcasm doesnât translate well.
Your daily dose of practical wisdom: on (not) passing judgment
Think back to your most regretful 15 minutes before rushing to judge any stranger.
Learn something new every day with the Medium Newsletter. Sign up here.
Edited and produced by Scott Lamb & Carly Rose Gillis
Questions, feedback, or story suggestions? Email us: tips@medium.com