What is a Common Mode Failure and Why Are They Problematic?
Common mode failures occur when multiple components or systems fail in an identical manner simultaneously. For a practical example, let’s take the example of a multi-rotor Electric Vertical Take Off and Landing vehicle (eVTOL). In this example, if we choose to use identical software and hardware for all motor controllers (ESCs), any latent defect would be present in the control systems of every motor.
Imagine that during vertical mode (take off, landing, and hovering), a latent defect in software abruptly causes the electric motors to stop. Perhaps the common software misinterpreted an input or sensor signal, resulting in not one motor but ALL engines stopping. The overall effect is an immediate loss of thrust and, consequently, an immediate loss of altitude and, if not recovered, a crash. Since the same input is distributed to all motors, running the same hardware, running the same software, they will all behave identically.
What Are These Types of Latent Defects?
Latent defects are not necessarily code bugs. While software bugs can be a culprit, a latent defect may be a software math error, such as divide-by-zero, floating point arithmetic NANs, or another untapped fault. Sometimes these may originate from inputs or signals that should not normally be present (negative airspeed, negative altitude, etc., pitch greater than 90 degrees, etc.). Often latent defects are just missed conditions that were not identified during design. Hardware components may also be the culprit. All complex electronic systems may have some latent defect or operating conditions that will cause unintended function. These defects will only be demonstrated if just the right circumstances are present.
Common Mode Faults are especially troublesome because all systems that use the same software or hardware (common) may fail simultaneously. For today’s eVTOL, this could be perilous if, for instance, all electric motors suddenly stopped working or all flight surfaces suddenly changed position and locked.
How do We Mitigate Common Mode Failures?
We mitigate common mode failures through a variety of means. First is by rigorous testing of all individual components of the systems. We rigorously test the functionality of regular operation and also out-of-bounds conditions like over-heat, over-voltage, or abnormal software inputs and conditions like divide by zero, null pointer references, etc. The minimum testing for each component type is defined in the Test Case Selection Criteria Standard, a common document developed during development and V&V.
Second, we mitigate through isolated, dissimilarly redundant hardware and software. Level A Avionics, the highest level of safety criticality, such as those responsible for primary flight controls and propulsion, must take into account isolation, redundancy, and dissimilarity.
Dissimilarity is the use of functionality equivalent yet different software and hardware. Engineers will choose different architectures and vendors for hardware design such as Motorola, TI, Xilinx, or Infineon-based solutions. With different hardware, the software will be forced to be different with some reduced commonality. All these solutions will run and execute the same from a PCB card level. Isolating the functionality at the PCB card level ensures that common mode faults are less likely to propagate and cause a complete system failure.
Putting it All Together
Back to our original example, the engine controllers should have isolated, redundant systems that “Vote” the controls to the electric motor. These isolated redundant systems should have not only dissimilar hardware but dissimilar software. There are several combinations of alternating dissimilar, isolated systems in a manner that will prevent common mode failures such that a loss of all propulsion simultaneously will be extremely unlikely.