How Did We Get in This Mess?
You’ve seen it in movies. Probably over-dramatized, but not by much. A big problem has come up in some control room (it doesn’t matter what kind). It could be because of a malfunction, terrorists, or even aliens. The computer control screens light up and start flashing. Horns are blaring and rotating beacons activate. Everyone is shocked and confused — it’s a chaotic scene.
Sound far-fetched? It’s not. In the minutes leading up to actual, major process accidents, operators have often been faced with hundreds to thousands of alarms occurring within just a few minutes. They experience alarm listings scrolling too fast to read, process computer graphics covered in bright flashing symbols and loud horns that recur as soon as they are silenced. The alarm system becomes a nuisance distraction to the operator instead of a useful tool to help deal with abnormal conditions. The likelihood of a successful outcome to the situation diminishes, and process outages or even damage can result. So, how did our alarm systems get this bad?
A Short History
Like many problems, this one began with the best of intentions. In the good old days (say, before the 1990s), a typical control room had a wall full of individual process indicators, lights, switches and moving-pen charts. These items took up a lot of space, which was always in short supply. The alarm system was simple – a rectangular array of a few dozen (at most) labeled windows that individually lit up and flashed based on their process connection. This lightbox also incorporated a horn that would sound and an acknowledge button to silence the horn and change the flashing light to a steady light. (This acknowledge button was often equipped by the operators with a wedge of paper or coin to hold it in and keep the infernal noise from happening in the first place. This user modification would almost certainly be in place on the night shift, but it might get removed during the day.)
The control wall concept had many positive things going for it. Considerable thought went into instrument placement and grouping. Normal ranges on the instruments were marked. Trends were always visible if the paper and ink were replaced. The overall health of the process could usually be determined at a glance. The alarm display would often produce repeatable patterns depending on the type of upset.
An early 1990s control wall with an alarm lightbox at the top
These systems also had many disadvantages. Inter-controller connectivity was almost non-existent. Creating complex control schemes was difficult and expensive. The introduction of new controls involved either a costly relocation of adjacent elements or sacrificing their logical placement. Communication of control system information to other systems was generally impractical. And data analysis? Forget about it.
Regarding alarms on the lightbox, the addition of a new one was expensive. Their total number was limited by space availability and cost. Therefore, each one was individually evaluated and justified (which was a good thing!)
This was the situation prior to the digital revolution and the introduction of modern controls, such as Distributed Control Systems (DCSs) and Supervisory Control and Data Acquisition (SCADA) systems. These systems provide substantial operations and business advantages, including expandability, ease of reconfiguring control strategies, and process data history/analysis. Almost everything in the control system becomes changeable without much trouble. (However, all these attributes bring with them their own problems.) For these advantages, older-style control systems such as the one pictured were converted to DCSs and SCADA systems beginning in the 1990s.
The situation for alarms is far different in a DCS than in an older system. Since alarms are displayed in computerized scrolling lists and in graphics, they have unlimited space. And since every “point” in the DCS is essentially a software construct, alarms became free! Most point types in a DCS have several pre-programmed alarms just waiting for the control engineer or other user to configure and activate them by touching a few keys. No justifying, no wiring, no tubing, no plastic engraving – just click, click, click, and you have a new alarm.
And create them we did! With no consistent guidelines available, massive over-configuring of DCS alarms became common. After all, if the manufacturer supplied the functionality of a High, High-High, and even HHH alarm, well then, they must be there for a good reason, so let’s use them all!
With no guidelines or cost for creating alarms, poor practices arose – such as all alarms being enabled by default, settings made by inconsistent rules of thumb or settings by an individual’s preference. Consistency was low; similar process systems implemented by different teams would have significantly different alarm configurations and behavior. (Engineers love to be creative when we aren’t given any guidelines!) Alarms were often used as an easy method to indicate status (something is on or off) rather than indicating an actual abnormal situation (something is off, but it is supposed to be on.)
The result? For the operator, this meant that while their former “control wall” likely had less than 100 possible alarms, their new DCS console likely had 2,000 to 4,000 configured alarms producing hundreds to thousands of alarm annunciations every day! Even in steady-state process operation, the alarm system is activating almost constantly, creating far more alarm occurrences than the operator can possibly understand and act upon. During a process upset, there is an order of magnitude increase in the number and speed of alarm occurrences, rendering the alarm system useless and creating an active hindrance to the operator’s ability to deal with the situation. Time and time again, investigative reports after major industrial accidents have shown that overloaded, bypassed or ignored alarm systems have played a significant role in making a situation worse.
An overloaded alarm summary page, which could be one of many pages during an abnormal situation.
Major accidents are only the beginning. It is well known that an ineffective alarm system can make an ordinary process upset either worse or last longer. Such upsets can cost companies a lot of money.
This bad situation was made even worse by the ease of modifying alarms in a DCS. Not only could engineers modify the alarm configuration, but so could operators, maintenance technicians, contractors, managers, and even college interns. Alarm change is easily made from a console keyboard, and at many installations, such changes had little security or oversight for years.
Since the 1990s, manufacturing sites have had rigorous Management of Change (MOC) policies to address almost any physical change in the facility itself, but these often did not apply to changes in the alarms. For decades, many alarm systems have had settings that change from day to day because they are at the individual whim of various people. This is crazy. Imagine if pilots boarding an airliner had no idea where the previous pilots left the settings for the aircraft alarms! For many years, the configuration, alteration, and bypassing of alarms in a DCS has often been ineffectively covered by MOC policies and practices.
The result? Widespread cases of overloaded and ineffective alarm systems.
Where Are We Now?
The alarm problem began to be identified and written about in the early 1990s. Investigations of some major accidents began to mention the DCS alarm systems as significant contributing factors to the accidents.
An example from the UK’s Health and Safety Executive (HSE) report on a 1994 major refinery accident:
• There were too many alarms and they were poorly prioritized
• The control room displays did not help the operators understand what was happening
• In the last 11 minutes before the explosion, the two operators had to recognize, acknowledge, and act on 275 alarms
A variety of articles were written on alarm management and several companies began to offer various products and services to address the issue, including software designed to analyze alarm occurrences. The concept of alarm rationalization was developed to improve existing systems, and dynamic and real-time alarm management software was introduced.
The Abnormal Situation Management (ASM®) Consortium was formed in 1994 and began studying aspects of the problem and acted to greatly increase awareness of it. In 1999, the UK's Engineering Equipment and Materials Users Association (EEMUA) produced a seminal reference document (their Publication 191) on the topic.
In 2006, PAS (now a part of Hexagon) published the first edition of The Alarm Management Handbook, based on hundreds of successful alarm improvement projects. That book has been widely regarded as having the best and most practical knowledge on making alarm systems effective and is now in a second edition.
That book was followed in 2008 by The High Performance HMI Handbook, which discussed ways to make process graphics effective. Among many other topics, the HMI book thoroughly details how to accomplish the effective display of alarms in process graphics.
In 2008, PAS co-authored the Electric Power Research Institute’s recommended practice for alarm management and participated in the American Petroleum Institute’s creation of a similar recommended practice for the pipeline industry. And in 2009, PAS helped write the ANSI/ISA-18.2 Alarm Management standard, which was a major development with regulatory implications. Alarm management is now a thoroughly documented topic, and the knowledge for fixing an alarm system is widely available.
Regulatory Agencies and Alarm Management: YOU WILL COMPLY
The regulatory environment concerning alarm management is complex. The mandatory statements in standards such as ISA 18.2 (and the IEC 62682 international version of 18.2) can and are enforced by regulatory agencies worldwide. This is generally via a “general duty” clause in the regulations, such as “The employer SHALL document that equipment complies with recognized and generally accepted good engineering practices,” a.k.a. “RAGAGEP.”
Standards (such as ISA 18.2 and IEC 62682) are developed by a rigorous consensus process. They are “recognized and generally accepted good engineering practices.” As such, they become enforceable because of these general duty clauses. Fines and penalties have been assessed for failure to comply with standards. You can volunteer to be on an ISA standard development or review committee and help shape their direction. Please contact us for more information or if you have questions
In this Taming the Wild Alarm System blog series, we will cover several practical ways to improve your alarm system. We will start with identifying and fixing your worst nuisance alarms with some straightforward methods to achieve a lot of improvement with little effort. In addition to this series, I also recommend checking out the white paper, Making a Big Dent In Nuisance Alarms.
Review other Taming the Wild Alarm System topics in this blog series:
- How Did We Get In This Mess?
- The Most Important Alarm Improvement Technique in Existence
- SHUT UP! Fixing Chattering and Fleeting Alarms
- Just How Bad is Your Alarm System?
- Horrible Things We Find During Alarm Rationalization
- Why did they have to call it “Philosophy?”
- Beyond Alarm Management – Doing More with a Powerful Tool