Taming the Wild Alarm System, Part 2
The Most Important Alarm Improvement Technique in Existence
There is a single method that has more effect, at lower cost and with lower effort, than any other technique at improving an existing, poorly performing alarm system. But what do we mean by “poorly performing?” Here are examples from some of the worst-performing alarm systems we have encountered (and all of which were solvable!)
- Many different control systems with individual alarms that occur over 100,000 times per month
- An alarm system with over 70% of all alarm occurrences (about a thousand a day) caused by instruments that were not working and needing maintenance
- A system so dominated by a few nuisance alarms that 98% of all alarm occurrences came from just seven alarms – averaging over 600 a day
- A system without good management of change, where uncontrolled and untracked manual alarm suppression eliminated 98% of all alarm occurrences (about 18,000 a day) from the operator view. This included suppression of some very important alarms
- Many systems with over 25,000 alarms per day on average, with some exceeding 100,000 – that is 1 alarm every 3 seconds, to more than 1 alarm per second
- A system that was in continuous alarm flood, averaging almost 40 alarms per minute for over four days
- A single alarm that occurred over 200,000 times in ONE DAY
- A large networked multi-site facility that generated over a BILLION alarms per year – 2.7 million a day
At first glance, problems like these seem overwhelming. How do you possibly deal with 50,000 alarms a day? Hah – that’s an easy one! We assure you, with some smartly applied effort, cases like these can be vastly improved in just a few days to a few weeks.
There is a seven-step process for improving existing alarm systems. It is simple and proven effective in over a thousand alarm improvement projects.
- Develop an Alarm Philosophy document. This is how to do alarms right!
- Analyze your existing alarm data to establish a baseline and identify your problem areas.
- Perform “bad actor” alarm resolution.
- Perform alarm documentation and rationalization (D&R) and create a master alarm database.
- Implement alarm audit and enforcement technology for management of change.
- Implement real-time alarm management techniques, such as state-based alarming.
- Control and maintain your improved system, with ongoing analyses and work processes.
The first three steps are often initiated simultaneously. These three steps are easy, fast, cheap and do not require a lot of internal resources. They are also powerful, which is why they are placed at the start.
The alarm philosophy is important but is not a “prerequisite” for finding and fixing your most frequent alarms. The step of alarm analysis also involves setting up to monitor alarm system performance going forward. Both of these are mandatory requirements of the ISA 18.2 alarm management standard. But even the initial baseline by itself can point you to the crucial Step 3 – finding and fixing your most frequent and nuisance alarms – the “bad actors!” We will cover all the other steps in future blogs.
This bad actor resolution step can cut your alarm rates by 60% to 80% or more. It can go a long way to solving problems like those already mentioned. It can be accomplished in as little as a few days or weeks of part-time effort. It does not have to involve consultants. While there are plenty of problems that it does not solve (such as poor alarm priority selection), it is a great way to make an impressive start that will lend credibility to the entire alarm improvement effort. It will help you get buy-in and develop momentum.
There are several categories of bad actor (nuisance) alarms and several methods for dealing with them. With enough bad actors, an alarm system is rendered useless. This may lead to hazardous plant conditions, since important or critical alarms are lost in the “sea” of bad actor alarms.
Experience shows that a comparatively few configured alarms cause most of the alarm occurrences, which feed all the high alarm rate issues. “Few” means 20 to 50 individual configured alarms. No one ever intentionally designed an alarm to occur 20,000+ times a month, but they exist and they can be fixed!
The top 20 most frequent alarms usually comprise anywhere from 25% to 95% of the entire system load. If those alarms are dealt with successfully, then major system improvement will occur. It is amazing that such high numbers of nuisance alarms exist, because it is doubtful the best control engineer in a company could intentionally design alarms to behave in the ways we will discuss. Yet they do exist; all varieties are in almost every system we analyze.
Figure 1: “Top 10” Most Frequent Alarms on a Single System – 8 Weeks of Data
In Figure 1, only 10 alarms comprise 96% of the total alarm load. The chart is based on only eight weeks of data and several of the alarms went off over 100,000 times. This performance was never intentional and fixing only these 10 alarms would reduce system load by 96%! Interestingly, five of the 10 (the “BADPV” alarms) are indicating specific instruments that are malfunctioning. Fixing 5 instruments should not be difficult.
Here are some Step 3 before-and-after examples from fifteen different control systems:
Figure 2: Improvement Amounts from Alarm Bad Actor Resolution
In the above systems, less than 50 alarms each were analyzed by the techniques we will cover. The average percent reduction achieved was over 65%. This is a substantial gain for a little bit of work! Wouldn’t you be pleased if you analyzed about 30 alarms and cut your alarm rate by more than half? Here’s how.
Here are the major types of nuisance alarms:
- Chattering alarms (quickly clear, then immediately repeat)
- Fleeting alarms (last only a few seconds before clearing, and might repeat later)
- Stale alarms (are in effect continuously for days, weeks, or months)
- Suppressed alarms (the operator does not see when they occur, but whose suppression is not controlled and tracked)
- Duplicate alarms (dynamic, where one condition causes multiple but different points to alarm)
- Duplicate alarms (configured, where multiple linked points all alarm if any of them has an alarm)
- Nuisance instrument diagnostic alarms (such as “bad measurement” types)
The first two – chattering and fleeting alarms are the worst! They are the largest contributor to high alarm rates. But fixing them often requires a calculation technique that takes longer to describe than the remainder of this blog space allows. So, they will be fully addressed in the next blog in this series. (If you can’t wait, see the references at the end.)
Stale (Long-Standing) Alarms
Stale alarms come in and remain in alarm for extended periods. Looking for continuous alarms in effect for more than 24 hours is a good starting point. We have found alarms that have been in effect for months and even years. (It is amazing what people will put up with.) They clutter the alarm screens and devalue the perceived importance of all alarms.
Are there truly many abnormal conditions requiring operator action to avoid a consequence that last more than a day? Or for months? Such alarms are often reflecting stable unit conditions, such as equipment that is intentionally shut down. They generally indicate alarms that were not configured in accordance with the principles contained in The Alarm Management Handbook.
Stale alarms are dealt with by understanding the process states and hardware involved. They are usually eliminated by reconfiguring them, so they comply with the very definition of an alarm. Alarms that go stale are often not alarms at all – they are merely status indications. They often simply indicate if an item of some sort is “on” or “off.” One should almost NEVER create an alarm that is based on some item just being “on” or “off.” There are always valid circumstances that an item should be off. Instead, the alarm should indicate that “this item is SUPPOSED to be on but is off” (or vice versa). Such a situation is abnormal and requires operator action. Design of such an alarm may require some imagination, or implementation of some logic or a simple state-based alarm method. There will be more on state-based alarming in a future blog.
An initial analysis of a system used for determining the bad actor resolution list must also identify any configured alarms that are suppressed. This means that the alarm is still configured, but some sort of override has been selected to eliminate its annunciation to the operator. Almost all control systems have this capability, and it is often abused. Alarm suppression is often uncontrolled. We have found very important alarms that were suppressed for months with no one being aware of that. At the end of the bad actor resolution step, there should be no suppressed alarms left. Alarms are often suppressed because of nuisance behavior, such as chattering, which can be fixed. Suppression must be rigorously controlled, visible, and tracked. This is a technique called “alarm shelving.”
Duplicate Alarms: Naturally, there are two types of duplicate alarms.
1. Dynamic Duplicate Alarms
These are alarms that consistently occur within a short time period of other specific alarms. If you use your alarm analysis software to list the alarms always occurring within, for example, one second of each other, you will likely find a good list to work on. Such alarms are highly likely to be multiple annunciations, in different ways, of the same process event. For example, if a pump stops, one might immediately get low discharge pressure, low flow, and low amps alarms. Those others could be valid alarms when the pump is running, but not when it is intentionally stopped and those values are expected.
The individual situation will determine which alarms are kept and which are not, or what logic adjustments must be made.
2. Configured Duplicate Alarms
Interconnections between points in a DCS can create cases of duplicate alarm configuration. For example, a process measurement sensor point may be connected to a selector point, to a totalizer point, to a logic point, to a controller point, and so forth. Often a “bad measurement” type of alarm is configured on each point (usually by default), and thus if the sensor point goes into that condition, several simultaneous alarms will result. These distract the operator by annunciating multiple alarms caused from a single event (the one bad sensor). There should only be one such alarm, configured on the point where the operator is most likely to take the action. If the sensor point feeds a separate controller point, the controller would be the proper point to alarm on the bad measurement. This is because the operator action to be taken from a bad reading is likely to put the controller in manual mode and adjust the output manually. The controller point itself will show that the input measurement has gone bad.
Nuisance Instrument Diagnostic Alarms
It is quite common but still surprising to see large amounts of alarm occurrences indicating a bad measurement or similar instrument problem. These are often in the hundreds or thousands!
Figure 3: Alarm System Dominated by Instrument Diagnostic Alarms
When a loop was designed, did someone tell the control engineer the following? “Oh, and by the way, I want this sensor to go into ‘Bad Measurement’ frequently, and I want at least 650 ‘Bad Measurement’ alarms per week at a minimum.” And, if that had been told to the best control engineer in the company, could they have done it? Probably not! Yet, we find these on almost every system we look at.
Since no instrument was designed to be in such a state, every one of these situations can be fixed, and they should not be tolerated. They are misconfigured in range, in “measurement clamping”, or there is an installation problem (e.g., impulse leads filling up). The original justification for installing a flow meter probably did not include a specification that it was OK if it didn’t work half of the time! But people put up with it. We wouldn’t put up with a broken speedometer on our car.
These situations must be addressed. An instrument malfunction removes a process indicator from the operator’s view. The time operators spend confirming the instrument problem reduces their attention to other operator duties. If an non-working instrument is not needed, it should be removed, following a management-of-change (MOC) procedure. An indefinitely broken instrument could be considered to be an MOC violation.
Decades ago, the available analog instrument sensors had a significant tradeoff between accuracy (significant digits) and range; you could obtain high accuracy only over a small range, probably less than the possible variation of the process. Control engineers were well aware of this tradeoff and were accustomed to designing within those constraints. But when such sensors with constrained ranges were implemented in a DCS, the “bad measurement” alarms occur frequently and do not represent abnormality.
The digital electronic revolution that gave us the DCS also gave us much-improved measurement sensors. Modern sensors can generally provide all of the accuracy needed over the entire range the process is likely to vary. But some installations continue to follow the older configuration practices and do not consider the consequences of generating lots of bad measurement alarms during conditions such as startup and shutdown. Controller points will usually have “shed modes.” These are predetermined actions taken when an input measurement goes bad, such as go full output, go zero output, maintain last output. These should be chosen with care, but minimize the possibility for the measurement to go bad in the first place!
The default should now be to configure the instrument range for the entire range of possible values the process can have (including shutdown or ambient conditions), and then see if the accuracy you get is enough. If not (rarely, with modern transmitters), buy a better transmitter! But don’t configure the range where you know you will get a bad measurement state at expected conditions.
Differential pressure flows are often the worst offender. If, at zero flow, there is a slight imbalance in the leads, the meter attempts to report a slight backwards or negative flow. The flow range might not be configured for a slight negative, so the bad measurement condition and alarm occurs. Such points should be configured to handle the zero case. A cutoff can be configured and clamped at a zero value, so a small negative flow number is not actually produced, which could also affect some downstream calculations.
Most DCSs have the ability to clamp an analog value at the ends of the range rather than go into a bad measurement state. This ability should be fully understood and used properly. (This means more reading of the documentation!)
Ongoing Work Process
A work process must be in place to identify and resolve new nuisance alarms. The process will change or be modified, sensors will age or develop problems, and new nuisance alarms will appear. Ongoing alarm analyses can spot and report these, but it must be someone’s job to take action and correct the situation. We have seen that once nuisance alarms are initially resolved, the operators will notice that, realize it can be done, and not be very tolerant of new nuisance alarms! This is a good thing.
For much more detail, we recommend this free white paper: Making a Big Dent In Nuisance Alarms
And, of course, The Alarm Management Handbook, Second Edition
Review other Taming the Wild Alarm System topics in this blog series:
- How Did We Get In This Mess?
- The Most Important Alarm Improvement Technique in Existence
- SHUT UP! Fixing Chattering and Fleeting Alarms
- Just How Bad is Your Alarm System?
- Horrible Things We Find During Alarm Rationalization
- Why did they have to call it “Philosophy?”
- Beyond Alarm Management – Doing More with a Powerful Tool