(原创翻译哟)
Redundancy and Backup Model -Engineering
冗余备份模型--工程学
In engineering,redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe.
工程学中,冗余是指复制关键的部件或者系统的主要功能,意图提高系统的可靠性,通常使用备份或者自动防故障装置。
In many safety-critical systems, such as fly-by-wire and hydraulic systems in aircraft, some parts of the control system may be triplicated, which is formally termed triple modular redundancy (TMR). An error in one component may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are expected to fail independently, the probability of all three failing is calculated to be extremely small; often outweighed by other risk factors, e.g., human error. Redundancy may also be known by the terms "majority voting systems",or "voting logic".
在许多安全导向的系统上,比如飞机上的全数字电传操作和液压系统,控制系统上的某些部件也许会被一式三份,专业术语:三重模块冗余(TMR)。一个部件出错将会被另外两个备份部件所取代。在三重冗余系统中,系统拥有三个替补部件,在三个替补部件全部发生故障前,系统能够一直保持正常运转。因为每一个部件发生故障的概率都很小,而且部件之间互不影响,所以三个部件全部出现故障的几率是非常小的。常常低于其他风险因素,例如...人为错误。冗余也常被称为“多数表决系统”或“表决逻辑系统”。
Forms of redundancy
冗余的构成
There are four major forms of redundancy, these are:
冗余有四种主要形式,分别是:
Hardware redundancy, such as DMR and TMR
·硬件冗余,如:DMR和TMR
Information redundancy, such as Error detection and correction methods
·信息冗余,如:错误检查和矫正法
Time redundancy, including transient fault detection methods such asAlternate Logic
·时间冗余,包括临时故障检查法 ,如候补逻辑
Software redundancy such as N-version programming
·软件冗余,如:N版本编程
A modified form of software redundancy, applied to hardware may be:
一个改良后的软件冗余,可能应用于硬件:
Distinct functional redundancy, such as both mechanical and hydraulic braking in a car. Applied in the case of software, code written independently and distinctly different but producing the same results for the same inputs.
不同功能的冗余,如:机械和液压都用于汽车制动。就软件方面的应用来说,代码独立编写并明显不同,但是却能产生的相同的结果和输入。
DMR:A machine which is Dual Modular Redundant has duplicated elements which work in parallel to provide one form of redundancy. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work. For instance: the Submarine Command System SMCS used on submarines of the Royal Navy employs duplicated central computing nodes, interconnected by a duplicated LAN.
DMR:双重模块冗余(Dual Modular Redundant )机器,通过复制元素、并行运作,来提供一种冗余。一个典型的例子是 复杂的电脑系统,它会复制很多节点,当一个节点发生故障,另一个节点就准备好接替它的工作。再举个例子:潜艇指挥系统 (SMCS :the Submarine Command System ),被用在皇家海军的潜艇上,采用复制中央计算节点,通过复制的局域网来互相连接。
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. Examples include 1ESS switch. A machine with three replications of each element is termed triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
容错同步机器能复制元素,并行运作。任何时候,每个复制元素的状态都是一样的。对每个复制的输入都是一样的,并且输出也跟预期的一样。使用表决电路来对复制元素的输出进行比较。每个元素有两个复制品的机器被称为 双重模块冗余(DMR)。 表决电路只能侦测不匹配的状况,而依靠其他方法来恢复。例子包括 1ESS(TheNumber One Electronic Switching System 第一电子交换系统)。每个元素有三个复制品的机器被称为三重模块冗余(TMR)。当表决电路观察到表决数为二比一时,就会决定那些复制品是故障的。在这种情况下,表决电路会输出正确的结果,并且抛弃错误的版本。在此之后,错误复制品的内部状态被假设为跟其他两个复制品不一样,同时表决电路会切换至DMR模式。该模型可用于任何存在大量复制品的情况。
TMR:In computing,triple modular redundancy, sometimes called
triple-mode redundancy
TMR:在电脑运算中,双重模块冗余有时被称为双重模式冗余
[1]
(TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault. If the voter fails then the complete system will fail. However, in a good TMR system the voter is much more reliable than the other TMR components. Alternatively, if there is another stage of TMR logic following the current one (for example, in systems such as the Saturn Launch Vehicle Digital Computer), then three voters are used – one for each copy of the next stage of logic.
TMR是一种 多模块容错形式的冗余,三个系统的执行过程和执行结果是是通过表决系统处理来产出的一个单一的输入输出。三个系统中的任何一个发生故障,其他两个系统都能纠正错误并且修复这个错误。如果表决电路发生故障 ,那么整个系统都将会瘫痪。然而,在一个优秀的TMR系统中,表决电路是通常是系统中最可靠的部件。或者,若在当前TMR逻辑系统中存在另外一个阶段(例如,土星运载火箭上的数字计算机系统),那么把三个表决电路中的每一个都会被备份,为逻辑系统的下一阶段做准备。
The TMR concept can be applied to many forms of redundancy, such as software redundancy in the form of N-version programming.
TMR概念可以应用在许多冗余形式中,例如N版本编程的软件冗余形式。
Some ECC memory uses triple modular redundancy hardware (rather than the more common Hamming code), because triple modular redundancy hardware is faster than Hamming error correction hardware.
一些ECC内存(Error-correcting code memory:ECC memory寄存式内存,能够实现错误检查和纠正技术的内存条)使用三重模块冗余硬件(比常见的汉明码(Hamming code是一个错误校验码码集)要好),因为三重模块冗余硬件要比汉明码的错误纠正技术硬件更加迅速。
[2]
Space satellite systems often use TMR,
航天卫星系统经常使用TMR
[3][4][5]
although satellite RAM usually uses Hamming error correction.
尽管卫星的RAM经常使用汉明码错误纠正技术
[6]
To utilize triple modular redundancy, a ship must have at least three chronometers. At one time, the cost of three sufficiently accurate chronometers was more than the cost of a smaller merchant vessel.
使用三重模块冗余,船上必须最少有三个精密计时器。在以前,三个足备的、准确的精密计时器的成本要高于一艘小型商业轮船。
[7]
Some vessels carried more than three chronometers – for example, the HMS Beagle carried 22 chronometers.
一些轮船 携带超过三个精密计时器---例如,皇家海军猎兔犬号携带了22个精密计时器.
[8]
Some communication systems use N-modular redundancy as a simple form offorward error correction. For example, 5-modular redundancy communication systems (such as FlexRay) use the majority of 5 samples – if any 2 of the 5 results are erroneous, the other 3 results can correct and mask the fault.
一些通讯系统使用多模块冗余 如简单形式的向前纠错技术。例如,五重模块冗余通讯系统(如:FlexRay车载网络标准)使用大多数的5个样本---如果5个中的任意2个结果是错误的,那么其他3个结果就会纠正并且修复这个错误。
N-version programming(NVP), also known as multiversion programming, is a method or process in software engineering where multiple functionally equivalent programs are independently generated from the same initial specifications.
N版本编程(NVP),也被称为多版本编程,是软件工程中的一种方法或过程:相同初始规格独立生成的多功能等价程序。
[1]
The concept of N-version programming was introduced in 1977 by Liming Chen and Algirdas Avizienis with the central conjecture that the "independence of programming efforts will greatly reduce the probability of identical software faults occurring in two or more versions of the program".
N版本编程的概念是由陈立明与Algirdas Avizienis 在1977年的中心推测中提出的,独立编程的成果可以巨大的降低发生在两个或更多版本的相同软件中的故障几率。
[1][2]
The aim of NVP is to improve the reliability of software operation by building in fault tolerance or redundancy.
NVP的目的在于通过建立容错或冗余机制来提高软件使用的可靠性
[1]
[edit]Function of redundancy
功能冗余
The two functions of redundancy are passive redundancy and active redundancy. Both functions prevent performance decline from exceeding specification limits without human intervention using extra capacity.
双功能冗余属于被动冗余和主动冗余。两个功能可以阻止因超出规格限制而导致的性能下降 同时不需人为介入使用额外的能力。
Passive redundancy uses excess capacity to reduce the impact of component failures. One common form of passive redundancy is the extra strength of cabling and struts used in bridges. This extra strength allows some structural components to fail without bridge collapse. The extra strength used in the design is called the margin of safety.
被动冗余使用额外的能力来减少组件故障所造成的影响。一种常见形式的被动冗余 是在桥梁上使用超高强度的钢桁和支柱。这种高强度能够允许一些部件的老化但不至于使桥垮塌。
Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not cause blindness but depth perception is impaired. Hearing loss in one ear does not cause deafness but directionality is impaired. Performance decline is commonly associated with passive redundancy when a limited number of failures occur.
一个被动冗余的实例是眼睛和鼻子。视觉系统失去一只眼睛不至完全失明,但却会深深损害知觉。听力系统失去一只耳朵不至耳聋,但肯定会受到损害。性能下降是常常是跟发生少数失效的被动冗余关联在一起的。
Active redundancy eliminates performance decline by monitoring performance of individual device, and this monitoring is used in voting logic. The voting logic is linked to switching that automatically reconfigures components. Error detection and correction and the Global Positioning System (GPS) are two examples of active redundancy.
主动冗余通过监控个别设备的性能来消除性能下降,并且这种监控题使用的是表决逻辑系统。表决逻辑系统是跟开关连接在一起的,能够实现自动装配部件。错误检测与纠正技术和全球定位系统是两个主动冗余的例子。
Electrical power distribution provides an example of active redundancy. Several power lines connect each generation facility with customers. Each power line includes monitors that detect overload. Each power line also includes circuit breakers. The combination of power lines provides excess capacity. Circuit breakers disconnect a power line when monitors detect an overload. Power is redistributed across the remaining lines.
一个主动冗余的例子是电力分配系统。不同电线连接每代消费者的设备。每条电线都包含检测仪以侦察电量是否超荷。每条电线还包含着电路断接器。组合电线可以提供一个额外的能力。断电器会在检测仪侦察到电量超荷时切断电源
[edit]Voting logic
表决逻辑系统
Voting logic uses performance monitoring to determine how to reconfigure individual components so that operation continues without violating specification limitations of the overall system. Voting logic often involves computers, but systems composed of items other than computers may be reconfigured using voting logic. Circuit breakers are an example of a form of non-computer voting logic.
表决逻辑系统使用性能检测器来决定怎样在没有表决规格限制综合系统的情况下重新装配个别部件,来让系统持续运行。表决逻辑系统常常涉及电脑,但是系统由其他项目组成,除了计算机也许会用表决逻辑系统来重新装配。一个无电脑表决逻辑系统就是断电器
Electrical power systems use power scheduling to reconfigure active redundancy. Computing systems adjust the production output of each generating facility when other generating facilities are suddenly lost. This prevents blackout conditions during major events like earthquake.
电力系统使用电力调度来重新配置的主动冗余。运算系统会在其他发电设备突然瘫痪的时候调节每个发电设备的发电量。以防止在重要时期出现断电的情况,如地震。
The simplest voting logic in computing systems involves two components: primary and alternate. They both run similar software, but the output from the alternate remains inactive during normal operation. The primary monitors itself and periodically sends an activity message to the alternate as long as everything is OK. All outputs from the primary stop, including the activity message, when the primary detects a fault.
运算系统中最简单的表决逻辑系统涉及两个部件:主要部件和替补部件。它们运行的程序是一样的,但是在正常操作期间替补部件的输出是保持无影响的状态。主要部件能自我监控,只要一切正常它就会定期给替补部件发送一个活动消息。一旦主要部件侦察到故障,那么一切来自它的输出包括活动消息都会被停止。
The alternate activates its output and takes over from the primary after a brief delay when the activity message ceases. Errors in voting logic can cause both to have all outputs active at the same time, can cause both to have all outputs inactive at the same time, or outputs can flutter on and off.
当主要部件停止发送活动信息时,替补部件将激活它的输出,并在短暂的延迟后接管主要部件的工作。表决系统的错误可以导致两者在同一时间让所有的输出活动生效与失效,或者输出不稳定。
A more reliable form of voting logic involves an odd number of 3 devices or more. All perform identical functions and the outputs are compared by the voting logic. The voting logic establishes a majority when there is a disagreement, and the majority will act to deactivate the output from other device(s) that disagree. A single fault will not interrupt normal operation. This technique is used with avionics systems, such as those responsible for operation of the space shuttle.
一个更加可靠的表决逻辑系统包含 3个奇数或者更多的设备。所有运作相同的功能以及输出都会通过表决逻辑系统来进行比较。当它们发生分歧的时候,表决系统会确定一个优胜者,优胜者会撤销与其不一致的设备的输出。单个故障不会导致正常操作被中断。这种技术被航空电子设备系统所采用,例如那些航天飞机的操作就由它来负责。
[edit]Calculating the probability of system failure
计算系统故障的几率
Each duplicate component added to the system decreases the probability of system failure according to the formula:
公式得出系统每增加一个复制部件都能降低系统故障的几率:
where:
哪里:
- number of components
部件的数量
- probability of component i failing
部件故障的几率
- the probability of all components failing (system failure)
所有部件故障的几率(系统故障)
This formula assumes independence of failure events. That means that the probability of a component B failing given that a component A has already failed is the same as that of B failing when A has not failed. There are situations where this is unreasonable, such as using two power supplies connected to the same socket, whereby if one socket failed, the other would too.
这个公式假设:故障事件之间是各自独立的。这意味着 部件B故障的概率,跟在部件A故障的几率、部件A已经故障的几率和部件B已经故障的几率都是一样的。那种情况是不合理的,例如使用两个电源供应器连接同一的插座,由此,如果其中一个供应器坏了,那么另外一个也会坏。
It also assumes that at only one component is needed to keep the system running. If components are needed for the system to survive, out of , the probability of failure is。
它还假设只需要一个部件就能保持系统运转。如果系统只有通过这个部件才能活下来,离开了它,系统就会出故障。
[citation needed
]引用需要
, Assuming all components have equal probability of failure
This model is probably unrealistic in that it assumes that components are not replaced in time when they fail.
假设所有部件的故障率相同,在这个假设中,部件不会在它们发生故障的时候被取代,这种模型很可能是不切实际的。