A Different Kind of "One Bad Day"
In cybersecurity, the concept of "one bad day" emphasizes how a single, well-executed cyberattack can cause significant disruption, financial loss, reputational damage and long-term consequences for an organization. This underscores the importance of robust cybersecurity measures, continuous monitoring and proactive defense strategies. High-profile examples like the WannaCry ransomware attack, the SolarWinds supply chain attack, and the Colonial Pipeline incident illustrate how one significant event can disrupt operations and cause widespread damage.
The WannaCry ransomware attack, also known as WannaCrypt, WCry or WanaCrypt0r 2.0, was a widespread and devastating cyberattack that occurred in May 2017. It targeted computers running the Microsoft Windows operating system by encrypting data and demanding ransom payments in Bitcoin. The attack began on May 12, 2017, and quickly spread across the globe, affecting hundreds of thousands of computers in over 150 countries.
The SolarWinds supply chain attack, discovered in December 2020, was a highly sophisticated cyber espionage campaign that targeted SolarWinds, a prominent IT management and monitoring company. This attack is considered one of the most significant and complex cyber incidents in recent history, affecting numerous high-profile organizations and government agencies. The attackers compromised SolarWinds' Orion software platform, which is used for network management and IT monitoring by thousands of organizations globally.
The Colonial Pipeline incident, which occurred in May 2021, was a significant cyberattack that disrupted one of the largest fuel pipelines in the United States. The attack underscored the vulnerabilities in critical infrastructure and highlighted the far-reaching impact of ransomware attacks. The attack occurred on May 7, 2021, when Colonial Pipeline, which operates a pipeline system spanning over 5,500 miles and supplying nearly half of the East Coast's fuel, was targeted by a ransomware attack.
CrowdStrike Event: A Different Kind of "One Bad Day"
A faulty software update issued by security giant CrowdStrike issued on July 18, 2024 resulted in a massive overnight outage that affected Windows computers around the world, disrupting businesses, airports, train stations, banks, broadcasters and the healthcare sector. CrowdStrike said the outage was not caused by a cyberattack but was the result of a "defect" in a software update for its flagship security product, Falcon Sensor. The defect caused any Windows computers with Falcon installed to crash without fully loading. An estimated 8.5 million Windows devices were impacted.
In one day, we experienced the worst tech outage to date, causing more outages and disruptions than any cybersecurity attack ever recorded. Obviously, these were all bad days, but how should we categorize the CrowdStrike event? Is it a "one bad day" event? The reason I ask is that we often get caught up with adversaries and intent and lose focus on probability and consequence. If someone maliciously does something like WannaCry, SolarWinds or Colonial Pipeline, then that's a cyberattack, and everyone jumps in to discuss how their technology or service would have prevented such events. But what about when the event is caused by a mistake or unintentional action?
The Real Risk: Routine Actions by Well-Intentioned Personnel
I believe our greatest risk to critical infrastructure is not a malicious event but rather routine actions by well-intentioned personnel. It’s simple: the engineer or technician doing their job but making a mistake, like a typo, taking something out of service without fully understanding the impact, or lifting the wrong wire, is far more likely to cause problems. This is much more probable than someone successfully bypassing numerous security layers or an employee becoming so upset that they intentionally cause harm, especially at the operational technology (OT) layer. From a probability standpoint, an engineer making a change that negatively impacts a process is far more likely than an external threat or malicious activity. Few cyberattacks successfully penetrate down to the OT layer, and there have been few documented cases of disgruntled employees taking actions against the ICS environment. However, there are numerous instances every year of well-intentioned individuals changing ICS configurations and causing process upsets or unit trips. Perhaps assigning superhero villain names to these operational upsets and shutdowns could capture more attention from vendors, media and analysts. Here are just a few real-world examples:
Example 1: The Clashmaster
They’re known for creating chaos by pitting systems against each other, causing conflicts and breakdowns in industrial processes.
The control system was inadvertently configured so that two applications were writing to the output on the same flow controller. At a certain time, both applications became active and began "fighting" each other, moving the controller up and down frequently. This led to a failure in the valve packing, which caused a unit shutdown.
Example 2: The Faultline Phantom
This villain exploits gaps in documentation and testing procedures to create system-wide failures and shutdowns, blending into the background while causing significant disruptions.
A major North American petrochemical facility extended the periods between turnarounds, which forced them to perform online interlock testing.
-
The procedure called for bypassing an SIS output and ramping the transmitter value to test the interlock.
-
As expected, the interlock in the SIS tripped but did not trip the shutdown valve.
-
Due to inadequate documentation, the testers were unaware of a configured link to operator start-up assistance logic in DCS.
-
The DCS logic sensed the interlock trip, placed all controllers in manual, and set all valve outputs to the fail-safe position (shutdown).
Example 3: Alarmatrix
This villain manipulates alarm systems to trigger unintended shutdowns and system failures, using misconfigured settings to wreak havoc on critical infrastructure.
The alarm enunciation was used in a boiler logic program. The trip point of the alarm was changed to a value that was not consistent with when the boiler logic program should activate. When the alarm activated, it started a program that led to the shutdown of the boilers.
Example 4: Range Reaver
This villain thrives on miscommunication and oversight in calibration processes, causing discrepancies and errors that lead to critical failures and shutdowns.
The instrument tech was scheduled to change the range on an instrument. This change required the DCS engineer to change the range of the corresponding tag. The instrument tech was called to another task and never re-ranged the instrument. The DCS engineer did not know that the instrument tech did not make the change and proceeded with changing the range on the tag. This resulted in the operator seeing a value that was off by a factor of 10. The operator, seeing the incorrect process value, took action to correct this situation. This led to ruptured tubing and a unit shutdown.
Even though I've tried to inject some humor by giving these events villain names, they actually stem from routine actions by well-intentioned personnel (essentially, the real superheroes) making changes to ICS configurations. Better management of the automation system configuration could have prevented each situation.
Conclusion: Focus on Likelihood and Consequence
If we want to start taking OT risk reduction seriously, we can’t get fixated on intent. We must focus on likelihood and consequence. Going back to a simple risk equation - risk = likelihood x consequence - if the consequence of an inadvertent change has the same effect as a cyberattack, causing the process to go down, and the likelihood of an inadvertent change is exponentially greater than a successful external attack, then the answer is clear on where we should focus. It may not be the most popular or widely discussed topic, but it's one we all need to be talking about.