Great computer meltdown: Say hello to CrowdStrike

The TLDR

On Friday (IST), around 8.5 million computers around the world crashed—almost at the same time. The irony: It was caused by an update to a security software. We look at what happened—and what it teaches us about our interconnected world.

Splainer is making changes: Last week, we ran a two-part series on the dismal state of the news industry—and how minnows like splainer are finding novel strategies to survive. Starting today, we are putting some of those new ideas to work:

The first big change is that we dropped everything in one big edition: Big Story, the quiz, good reads and curious facts. Much of this used to be spread across the week.
Headlines That Matter was sent in a separate email today—so you can read it in your inbox or on the app/site.
Yes, it’s annoying but it’s only on Monday. A tech fix requires moving way too much furniture on the back end.
For the rest of the week—Tuesday through Friday—you will only receive the headlines. We have a new, expanded format to make sure you stay updated through the week.

Please send questions, complaints and advice to me at lakshmi@splainer.in. Ok, I’m done. On with the Big Story:)

Ok, tell me what happened…

The gist of it: A security software known as CrowdStrike sent an update to about 8.5 million computers around the world. It contained some bad code. Users of Microsoft Windows were soon confronted with the ‘Blue Screen of Death’—which looks like this:

Behold the chaos: Yup, millions of computers crashed—at over half of Fortune 500 companies and US agencies—including the top U.S. cybersecurity agency. Disruptions were spread across train systems, banks, hotels, television stations, and more. But they were most visible at airports—where thousands of flights were cancelled around the world. This is what the airspace over the United States looked like:

Point to note: The impact in India was relatively mild—except for Indigo which cancelled 300 flights. There were some disruptions at banks, hospitals and airports. But most of them managed by resorting to manual processes. Example: Handwritten boarding cards.

Why this matters: This is the single worst cyber event in the history of the world. The closest is the 2017 WannaCry cyber-attack that impacted around 300,000 computers in 150 countries.

All this because of a bit of code?

Yes—a bit of code in software that is supposed to keep your computer secure.

Say hello to CrowdStrike: The cybersecurity company controls about 18% of the $8.6 billion global market for something called “endpoint detection and response software.” In the old days, the security software would hunt for malicious software—which wasn’t enough as hackers grew more sophisticated. So today we have this:

Now, products known as “endpoint detection and response” software that CrowdStrike develops do far more. They continually scan machines for any signs of suspicious activity and automate a response. But to do this, these programs have to be given access to inspect the very core of a computers’ operating systems for security defects. This access gives them the ability to disrupt the very systems they are trying to protect.

CrowdStrike's antivirus platform Falcon had deep system access on “endpoints” like laptops, servers, and routers. It constantly updates itself “automatically and regularly to defend against new and evolving threats.” That’s why it brought all those computers down.

FYI: Falcon is expensive—$50 per computer. That’s why companies only installed it on the most important machines

The disastrous update: CrowdStrike sent out one of its frequent updates to Falcon. There was a defect in the code for Windows—but not Mac or Linux systems (the company blog has the nerdy details). CrowdStrike owns the mistake—but hasn’t shared how this happened.

Experts say that since the company sends out so many updates—it was easy to miss checking some of them: "What it looks like is, potentially, the vetting or the sandboxing they do when they look at code, maybe somehow this file was not included in that or slipped through.” In other words, it was likely human error:

“An engineer at CrowdStrike is having a really bad day,” [cybersecurity researcher Mikko Hyppönen] says. Hyppönen suggests that CrowdStrike could have shipped software different to what they had been testing or mixed up files, or there could’ve been a combination of different factors. “Software like this has to go through extensive testing,” Hyppönen says. “That's what we do. That's what CrowdStrike, of course, does. You have to be really careful about what you ship, which is tough to do because security software is updated very frequently.”

But, but, but: This is the second time that a company run by George Kurtz—CrowdStrike CEO—has caused this kind of a catastrophic meltdown:

On April 21, 2010, the antivirus company McAfee released an update to its software used by its corporate customers. The update deleted a key Windows file, causing millions of computers around the world to crash and repeatedly reboot. Much like the CrowdStrike mistake, the McAfee problem required a manual fix. Kurtz was McAfee's chief technology officer at the time. Months later, Intel acquired McAfee. And several months after that Kurtz left the company. He founded CrowdStrike in 2012 and has been its CEO ever since.

Ah Silicon Valley tech bros—always failing upwards!

And how did they fix it…

You’re gonna laugh. Apparently, the immediate hack that worked was “turning off and on again” your computer as many as 15 times! CrowdStrike’s advice:

The company’s initial “workaround” guidance for dealing with the incident says Windows machines should be booted in a safe mode, a specific file should be deleted, and then rebooted. “The fixes we’ve seen so far mean that you have to physically go to every machine, which will take days, because it’s millions of machines around the world which are having the problem right now,” says Hyppönen from WithSecure.

In other words, there is no magical tech fix that can make the nasty code go away.

What’s the moral of this story? CrowdStrike bad?

Well, Microsoft certainly wants you to. This is the analogy their execs rolled out to cover its behind:

If you have an automobile, and you take that automobile to the fuel station and you get fuel that is not quality fuel or corrupted fuel, your automobile is not going to work properly. The fuel is traversing throughout the entire system of your engine, and it will impact performance. It may impact the vehicle on a whole.

But, but, but: Others see this as the inevitable fallout of Microsoft’s monopoly:

The outage “is the result of a software monopoly that has become a single point of failure for too much of the global economy,” said [antitrust activist] George Rakis… He accused Microsoft of squelching competition by locking in customers and called for it to be “broken up.”

Sounds extreme but many industry experts say Microsoft has been complacent about its many, many security problems—which have been painfully visible in recent years:

Security issues have long been Microsoft’s Achilles’ heel, as computers and servers running its software have been the target of repeated hacks by criminal groups, as well as state-sponsored actors in Russia and China… a report by the Department of Homeland Security’s Cyber Safety Review Board found that, “Microsoft’s security culture was inadequate and requires an overhaul, particularly in light of the company’s centrality in the technology ecosystem.”

The company has been too busy focusing on its lucrative cloud business to care about its older products like Windows. This, in turn, has forced companies to rely on software like Falcon to protect themselves: “If they have a security-first culture, it would either be safer for products like these to exist or these products wouldn’t be needed at all.”

The bottomline: We leave the last word to cybersecurity expert Javad Abed:

The CrowdStrike incident is a stark reminder that relying on a single cybersecurity tool, regardless of a vendor's reputation, creates a dangerous single point of failure. And implementing multiple layers with multiple vendors is crucial for business continuity and protecting critical operations.

In other words, stop scrimping on Plan B.

Reading List

Wired (paywalled) offers the most detailed coverage of the outage issue and what caused it. Bloomberg (paywalled) provides useful insights on Falcon, and the relationship between Microsoft and CrowdStrike. Wall Street Journal (splainer gift link) has more on the problems within Microsoft. USA Today looks at possible solutions to prevent such a situation from happening again. For more on how airports in India were affected by the outage, check out Economic Times.

Please log in to read this edition