Recently, I was in a meeting dealing with fieldbuses and their application to Safety Instrumented Systems (SIFs). I will note that it was a good discussion overall and I learned a few things, which is always good. But during the exchange, we got a little sidetracked on a reliability / availability issue for a while - triggered by issues around redundancy.
To start with, I guess old habits die hard as neither of the words reliability or availability are actually “defined” as key terms in subsection 3 (Abbreviations and Definitions) of the ANSI/ISA-84.00.01-2004 Part 1 (IEC 61151-1 Mod) safety standard, yet they keep appearing in safety conversation. For those of you that may be purists, reliability seems to be used in many safety discussions as a substitute for Probability of Failure on Demand (PFD) or possibly for Safety Availability (1-PFD), and is not the “reliability” that is associated with Mean Time Between Failure (MTBF) which is actually related to hardware “availability.”
But rather than maintaining these distinct differences where reliability references the probability that the safety action will occur when required, and availability references the fraction of uptime for the process, I often find that people use the two words interchangeably crossing definitions in both directions.
One reason for the confusion is the lack of familiarity with the safety standards. This is not something I want to address here, but I certainly do encourage anyone that touches safety automation in any way to take the time to get some basic education in this area (the EC50 ISA course or other resources).
But as today’s title suggests, redundancy plays a key part in creating the confusion also. In most applications of the words reliability and availability, they mean that things keep working or keep running, and it is easy equate having redundancy to achieving those conditions. In recent years, I have met many colleagues that are of the opinion that the only way to achieve SIL3 reliability is through redundancy. Undoubtedly this comes from a history of marketing safety automation system architectures where the key discussion has been around Dual Modular Redundancy (DMR), Triple Modular Redundancy (TMR), and Quadruple Modular Redundancy (QMR) and the fault tolerance of each of these architectures. It has proven interesting to watch and see how fault tolerance or availability has become more important in the messaging than safety or reliability. I know that both are important, but when we are talking about safety, I want to suggest that we need to keep our focus on that part and how it is achieved in order to make safe products and safe solutions. An interesting whitepaper on how architecture of a system does not matter can be found on Control Global’s website.
Check back tomorrow for the rest of the discussion and leave comments with your thoughts and views.