CrowdStrike: a root cause analysis

The Medium Newsletter
The Medium Blog
Published in
Sent as a

Newsletter

3 min readJul 25, 2024

đŸ„‡ Tomorrow is the opening ceremony of the 33rd Olympic Games. Paris last hosted the Olympics 100 years ago; it’s the second city after London to host three Olympic games (1900, 1924, 2024).
Issue #127: how to communicate across continents & how to be non-judgmental
By
Harris Sockel

At Medium, like most tech companies, we do something called a Root Cause Analysis (RCA) whenever a bug takes down our platform. “Takes down” is kind of drastic — we do it whenever a critical service, like this email that landed in your inbox today, doesn’t work. It’s essentially a way for us to ask: Why? And also: How can we make sure it doesn’t happen again?

The best RCAs I’ve seen keep asking why until they reach the deepest possible root cause. Often, this is something cultural. It’s a habit or process, not simply a technical bug.

By now, we all know what happened on Friday:

via Kevin Beaumont: “What I learned from the ‘Microsoft global IT outage’”

Here’s a quick root cause analysis, using Medium posts.

Why did this happen?

CrowdStrike, security software used by businesses to safeguard Windows from cyberattacks, pushed out a faulty update.

Why did the update fail?

Here’s a great analogy from writer and editor Dinah Davis. Imagine you’ve installed a Nest thermostat in your house. Then, the makers of Nest drop an update that’s supposed to help cool or heat your house better, but it doesn’t play nice with your home’s infrastructure. Instead, your boiler or A/C goes berserk. CrowdStrike updated one layer of Windows in a way that caused Windows’ deeper layers to crash. (Want more detail? Head here.)

How did that mismatch happen?

Kevin Beaumont, a CrowdStrike customer and cybersecurity professional, digs deep into this question on Medium. Many companies buy cybersecurity software to comply with federal regulations. They’re doing it to check a box. They don’t always understand the software they buy and they don’t check on it.

“This is going to sound controversial,” he writes, but “I think we put way too much trust in cybersecurity nerds like me, and there’s a lack of transparency and accountability.” Cybersecurity software needs god-mode privileges in Windows to function. This is fine-ish, most of the time, yet these companies are “pushing out updates constantly, often many times a day, with zero customer visibility, zero accountability and zero regulatory scrutiny.”

How can we avoid another incident like this? Build redundant systems: If you run an airline, get a backup operating system. Also, test any update before it deploys. And lastly, to all the VPs of Ops at airlines, hotels, and hospitals out there: Ask questions whenever you find yourself doing anything simply to check a box.

💌 One more story: about pen pals

I am obsessed with English teacher Evan Purcell’s tell-all about how he tried and failed to set up cross-continental pen pal programs while teaching in China and Kazakhstan. The best part of this post? Real letters written by tweens. I mean: I’m so happy to be your pen pal!!!!! This is very cool! Tell me everything about yourself!!! I’m sure you’re the smartest, coolest boy.

Thinking back to his first failed pen pal experiment, Purcell realizes he’d given his students too much freedom — especially when they were writing their first letters to new friends overseas. An uncurious, un-genuine introduction changes the tenor of an entire conversation. Another lesson he learned: American sarcasm doesn’t translate well.

Your daily dose of practical wisdom: on (not) passing judgment

Think back to your most regretful 15 minutes before rushing to judge any stranger.

Learn something new every day with the Medium Newsletter. Sign up here.

Edited and produced by Scott Lamb & Carly Rose Gillis

Questions, feedback, or story suggestions? Email us: tips@medium.com

--

--

Responses (12)