三个贯穿本章的三个abstraction：
- process
- link
- failure-detector

2.1 Distributed Computation

process an message
- the set of process is static => process set 是不会变的，即使process crash, then recovery;
- messages are uniquely identified => 可以通过sequence number 或者timestamp来实现，例如<processId, timestamp>(timestamp是单调的)
Automata and Step
每个process的执行进行一个自动机，而且是相同的；（这部分其实不需要过分去解释）
Safety and Liveness
- Safety: 保证分布式计算永远不会出错（出错的概念本来就比较抽象）；或者说是保证永远按照预期的方式执行；
- Liveness: 保证某件事情（naive）最终会发生；比如message最终被deliver了；

分布式系统需要同时满足Safety and Liveness;

2.2 Abstracting Processes

Process Failures

occur whenever: 在任何时候都可能发生；
process is unit of failure: 一旦failure发生，就认为整个process就failure（一个process上执行的分布式算法一般由多个component组成）；

Crash

在该上下文中，一旦process crash，就不再recovery，也称为crash fault或者crash stop；有一点需要注意，in practice，并不意味者process禁止recovery，或者说process recovery会破坏该上线文下设计的算法。只是说不需要process recovery，即使recovery了，也不会被算法纳入process set中。如果说实现上，那就是一旦P认为Q已经crash了，那么P在未来不会与Q有任何交互，即使Q已经recovery了。

Omissions（可以关注一下，这种情况发生了该怎么处理）

not send a message which is supported to send.:造成的原因可能是内存溢出等，注意，不是message lost，因为lost可以retransmission，而是根本就没有send。（这个情况这本书今后不会再讨论）；

Crash with Recoveries

该上下文中，recovery的process仍然被认为是correct，这里并不限制recovery次数，也就是说process在未来的某一段时间仍然在正确得（correctly）处理事情。但是failure-recovery肯定会带来记忆丢失的问题，一般的解决方案就是在deliver时将message存储到持久性存储设备中。

Eavesdropping Faults

Untrusted和Byzantine暂时不去了解

Arbitrary Faults

2.4 Abstracting Communication

这节主要介绍了通信模型 => link

Link Failures

the probability for a message to reach its destination is nonzero.: 我们认为通信不是完全断块的，因为通过retransmission，只要网络在某一段时间是可用的，那么message最终会reach its destination。本节有如下几种模型：

for crash-stop: Fair-Loss，Stubborn，Perfect
for crash-recovery: Logged Perfect
for Byzantine: Authenticated Perfect （不介绍）

Fair-Loss Links

模型的具体定义这里不写了，后面需要的话可以翻书。

Properties:

Fair-loss: Fair-loss: If a correct process p infinitely often sends a message m to a correct process q, then q delivers m an infinite number of times.
Finite duplication: If a correct process p sends a message m a finite number
of times to process q, then m cannot be delivered an infinite number of times by q.
No creation: If some process q delivers a message m with sender p, then m
was previously sent to q by process p.
解释起来就是link不是完全不可用的，在某段时间是可用的，因为通过retransmission的方式，message一定可以送达目的地；message不会被link 无限retransmission；link不会凭空create message;

Stubborn Links

是对Fair-Loss的再一次抽象，隐藏底层的retransmission;

Properties:

Stubborn delivery: If a correct process p sends a message m once to a correct process q, then q delivers m an infinite number of times.
No creation: If some process q delivers a message m with sender p, then m was previously sent to q by process p.

Perfect Links

因为Fair-loss可以通过无限的retransmission来达到目的，但是无限总是太浪费了。通过检测message是否已经被deliver了，然后抑制message retransmission，就相对比较高效了。将上面两个概念（技术）附加到Fair-loss上就转变成了Stubborn。

Properties:

Reliable delivery: If a correct process p sends a message m to a correct
process q, then q eventually delivers m.
No duplication: No message is delivered by a process more than once.
No creation: If some process q delivers a message m with sender p, then m
was previously sent to q by process p.

Logged Perfect Links

在crash-recovery中，会存在记忆丢失的问题，所以需要持久化存储。文中介绍的方式是，在process进行deliver时，先将message写入storage，然后向上层component发送已经deliver的信息；然后上层在每次receive message后需要先根据storage校验该message是否已经被deliver了，如果没有deliver，再交给下层进行deliver;
其实上述的过程在Perfect Links中也应该是如此，只不过现在message在持久化存储中，而不是挥发性存储中（内存）；

Properties:

Reliable delivery: If a process that never crashes sends a message m to a
correct process q, then q eventually log-delivers m.
No duplication: No message is log-delivered by a process more than once.
No creation: If some process q log-delivers a message mwith sender p, then
m was previously sent to q by process p.

Authenticated Perfect Links

不介绍

2.5 Timing Assumptions

这一节主要想讲的是，Asynchronous System对process和link完全没有时间假设（assumption）。准确来说是异步系统对物理时间不需要假设。这里所谓的假设其实是对message transmission的时间和message procession的时间的假设。也是就是需要对上述行为话费的时间做upper的假设，这也就意味这，如果这个cost time的无限的话，那么系统就可能崩溃了。（对异步系统的概念其实我也不太了解）。举几个例子：

在两阶段提交中，如果由于link或者process的问题，在第一阶段协调者一直无法收到其中一个参与者的响应，那么协调者会无限次地发送预提交请求，而不是通过超时机制停止此次事物；
在共识算法中，如果leader在心跳过程中没有收到某个follower的响应，那么leader会无限次地发送心跳，而不是通过time out认为follower failure了。

Synchronous System对message transmission的时间和message procession的时间是有upper假设的。

Asynchronous System

对物理时间没有假设

Synchronous System

对message transmission的时间和message procession的时间是有upper假设的。

Partial Synchrony

在Synchronous System中，如果这个upper假设是合理的，那么皆大欢喜。但是在实际情况下，可能由于网络拥堵等因素，导致真是的cost大于假设的upper, 那么就会导致错误的判断。这时我们就需要一种弥补的手段，retransmission是可选的方式。所以Partial Synchrony希望通过retransmission来处理那种cost大于upper而导致处于问题的后续弥补手段；
还有设置的upper需要能够真实反映cost的情况，如果upper设置太小了，那么上述的错误判断会增加；如果upper设置太大了，虽然降低了错误判断情况，但是系统性能降低了；

2.6 Abstracting Time

Failure Detection

异步系统对时间没有假设，而同步和半同步系统对时间有upper假设。但是如果直接将对时间的假设添加到process和link上可能会造成复杂性，所以通过抽象failure-detection来达到上述目的，同时降低复杂性。
failure-detection用来检测process是否crash或correct，这个检测不是必须地正确的，也就是说检测允许错误（实际没有crash，但是检测出来crash了）；
可以通过向远程process周期性发送心跳的方式来进行检测。这里可能link的异常也会影响检测错误，但是在Fair-loss模型中，link不是完全不可用的，所以错误检测可以得到弥补。

Perfect Failure Detection（PFD）

Perfect Failure Detection是正对Synchronous System的，它强调PFD最终会检测到所有的crash process；而且不会出现detect fail（不会检测错误）。而且一旦process被检测出来crash，那么该process将被永久标示为crash。
个人觉得，PFD是一个理想模型，是建立在对同步系统的upper cost假设之上。而在现实环境中，该upper cost假设并不总是那么可靠，所以实际上会导致检测错误（检测方式一般都是通过timeout机制）。
但是也要注意，这并不是说PFD无法实现。就像crash-stop一样，并不否认recovery的存在，只是recovery的process将不再认为是系统的一部分。PFD也一样，即使存在检测错误，那么将错就错好了，即使该process实际是correct，或者crash后又recovery了，那么也不会再是系统的一部分。这样做主要可以简化算法的实现。

Properties:

Strong completeness: Eventually, every process that crashes is permanently
detected by every correct process.
Strong accuracy: If a process p is detected by any process, then p has
crashed.

Leader Election

Eventually Perfect Failure Detection（EPFD）

在Perfect Failure Detection中，由于现实中upper cost的假设不绝对正确，导致failure detect 失败。PDF采用将错就错的方式来简化算法，但是这导致了correct process被误杀，或者对后来recovery的process判了无期徒刑。EPFD则认为自己的每次failure detect都不一定是正确的（这种想法是正确的，因为网络系统是异步的，只能通过response知道remote process在过去的某个时间点是活跃的，而不能判断它是否不活跃），只能称为failure suspicion。当未来某一刻再次检测到remote process活跃了，那么该为它洗去嫌疑人身份。这个模型就现实很多了。

Properties:

Strong completeness: Eventually, every process that crashes is permanently
suspected by every correct process.
Eventual strong accuracy: Eventually, no correct process is suspected by
any correct process.

Eventually Leader Election

Byzantine Leader Election

2.7 Distribution-System Model（DSM）

DSM是process abstraction、link abstraction和time abstraction（failure detect）的组合产物。

Fail-Stop

cash-stop
perfect-link
perfect failure detector

Fail-Noisy

crash-stop
perfect-link
eventually perfect failure detector

Fail-silent

crash-stop
perfect-link
no failure-detector
同步系统和半同步系统都可以构建出failure-detector。甚至可以这么认为同步系统是perfect failure detector，半同步系统是eventually perfect failure detector，而no failure-detector 就属于异步系统了（异步系统确实无法检测远程process是否failure）。

Fail-recovery

crash-recovery
stubborn-link：是为了让recovery的process能够deliver crash期间错过的message?
eventually perfect failure detector：因为recovery后需要继续参与到算法中

Fail-arbitrary

跳过

Fail-Randomized

跳过

Basic Abstraction——Introduction to reliable and secure distributed programming

2.1 Distributed Computation

2.2 Abstracting Processes

Process Failures

Crash

Omissions（可以关注一下，这种情况发生了该怎么处理）

Crash with Recoveries

Eavesdropping Faults

Arbitrary Faults

2.4 Abstracting Communication

Link Failures

Fair-Loss Links

Properties:

Stubborn Links

Properties:

Perfect Links

Properties:

Logged Perfect Links

Properties:

Authenticated Perfect Links

2.5 Timing Assumptions

Asynchronous System

Synchronous System

Partial Synchrony

2.6 Abstracting Time

Failure Detection

Perfect Failure Detection（PFD）

Properties:

Leader Election

Eventually Perfect Failure Detection（EPFD）

Properties:

Eventually Leader Election

Byzantine Leader Election

2.7 Distribution-System Model（DSM）

Fail-Stop

Fail-Noisy

Fail-silent

Fail-recovery

Fail-arbitrary

Fail-Randomized

推荐阅读更多精彩内容