DDIA Ch9

Consistency Guarantees

主要还是要了解更强的 consistency models， once we have seen it, we will be in a better position to decide which one best fits our needs.

But while there is some overlap, they are mostly independent concerns: transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

注意，这里的Consistency model 跟 isolation level 是不一样的，isolation主要考虑race condition，而 distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults

换句话说，distributed consistency 是平衡不同 replicas 之间的 state 而产生的

Linearizability

这是一个 consistency model, 不是 isolation level

The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic.

Linearizability 定义就是让client 认为只有一个copy，尽管我们有多个，然后让他们看到只有一个copy 并且all operations on it are atomic
![[DDIA-Figure-9-2.png]]

In this example, the register has two types of operations:

$read(x) ⇒ v$ means the client requested to read the value of register x, and the database returned the value v.

$write(x, v) ⇒ r$ means the client requested to set the register x to value v, and the database returned response r (which could be ok or error).

这里用到了arrow function的方式来定义，编程语言应该是借用了数学的方式来表达一个function

![[DDIA-Figure-9-3.png]]

这张图表示了read 新的 value 之后所有的read 都要读到新的 value

9-4 这张图加了新的 operation.

In figure9-4 we add a third type of operation besides read and write:

$cas(x, v_{old}, v_{new}) ⇒ r$ means the client requested an atomic compare-and-set operation (see “Compare-and-set” on page 245). If the current value of the register x equals $v_{old}$ , it should be atomically set to $v_{new}$ . If $x ≠ v_{old}$ then the operation should leave the register unchanged and return an error. r is the database’s response (ok or error).

![[DDIA-Figure-9-4.png]]

First client B sent a request to read x, then client D sent a request to set x to 0, and then client A sent a request to set x to 1. Nevertheless, the value returned to B’s read is 1 (the value written by A). This is okay: it means that the database first processed D’s write, then A’s write, and finally B’s read. Although this is not the order in which the requests were sent, it’s an acceptable order, because the three requests are concurrent. Perhaps B’s read request was slightly delayed in the net‐ work, so it only reached the database after the two writes.

Client B’s read returned 1 before client A received its response from the database, saying that the write of the value 1 was successful. This is also okay: it doesn’t mean the value was read before it was written, it just means the ok response from the database to client A was slightly delayed in the network.

This model doesn’t assume any transaction isolation: another client may change a value at any time. For example, C first reads 1 and then reads 2, because the value was changed by B between the two reads. An atomic compare-and-set (cas) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).

The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Alice and Bob in Figure 9-1.

这几段话一定要反复看，而且要提醒自己，这个是consistency model，不是isolation level，所以这里讨论的是DB 返回的 value 是 consistent 的。

~~所以说cas （compare and set）只是为了保证 consistency 的一个操作？并不是新introduce 的操作~~

是新的操作，看之后我写的解释

也就是说，在下一次read 操作之前，你都要进行一次 cas 操作，确保中间没有人update value。因为从9-3 这张图来看， client B 是进行了两次 read 操作，然后9-4 这里在第二次 read 之前进行了 cas 操作，这样就确保了 consistency？

哦！第三个 bullet point 解答了我的疑问，就是你任何操作（operation）之后都要进行 cas 操作，这样才会检查你的操作是否跟DB 的 state consistent。
书中原话：

An atomic compare-and-set (cas) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).

D的 cas 没有成功是因为别人改了（client B的cas）

不对， cas是新的操作，并不只是为了检查你之前操作是否成功，因为最后client B的read 没有成功是因为client C有一个 cas 操作。

新的概念一定要有清晰的定义！
[[2021-12-27#一定要了解新的概念的定义]]

Linearizability 本身是一个abstract model，也就是说他是为了consistency 定义的model，跟isolation level 完全不一样，不能搞混了

Serializability

Serializability is an isolation property of transactions, where every transaction may read and write multiple objects (rows, documents, records)—see “Single- Object and Multi-Object Operations” on page 228. It guarantees that transac‐ tions behave the same as if they had executed in some serial order (each transaction running to completion before the next transaction starts). It is okay for that serial order to be different from the order in which transactions were actually run [12].

Serializability 是 transaction 层面的guarantee，他是在有多个 transaction 同时进行的情况下，保证他们跟顺序执行是一样的

而 Linearizability 则是 consistent guarantee, (recency guarantee)。他是在 register (individual object) 层面的读写操作的 guarantee。他并不会把不同的操作合并起来 (it doesn't group operations together into transactions) 所以说它会有write skew 的问题，也就是说 Linearizability 是一个 weaker guarantee compare to Serializability

最大的不同应该在于 Linearizability 是作用于单个object的，而Serialiazabiltiy可以作用于多个object

文中后面提到了Linearizability 通常是在coordination发生的时候要用到，也就是多个node 之间要确认 linearizable 之后达成 consensus？所以说ZooKeeper 一定要用 Linearizable storage 来进行 coordinate 的操作

serializable 是更强的guarantee，书中也提到了，serializable 通常都是 linearizable的

A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability (strong-1SR) [4, 13]. Implementations of serializability based on two-phase locking (see “Two-Phase Lock‐ ing (2PL)” on page 257) or actual serial execution (see “Actual Serial Execution” on page 252) are typically linearizable.

Linearizability

Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew (see “Write Skew and Phantoms” on page 246), unless you take additional measures such as materializing conflicts (see “Materializing conflicts” on page 251).

自己给出定义之后果然有很大的好处，因为随时可以查自己之前的理解，并且如果有问题可以及时修正…… 我在读这段话的时候就查了自己之前对于2PL（2 phase locking 的定义）

A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability (strong-1SR) [4, 13]. Implementations of serializability based on two-phase locking (see “Two-Phase Locking (2PL)” on page 257) or actual serial execution (see “Actual Serial Execution” on page 252) are typically linearizable.

[[DDIA Ch7 Transactions#Two phase locking]]

在第七章这里，作者解释的很好，所以我是基于他的基础上给了自己的解释。 2PL就像java 里面的对象锁，所有的读写都要等之前的操作完成才能轮到他，不像snapshot isolation, writers never block readers, readers nerver block writers

Linearizability 应用场景

ZooKeeper 就需要用到Linearizability guarantee 来确保他们的 distributed locks and leader election is correct.

Two Communication channel

文中给出Alice bob 的例子之所以会有问题，就是因为他们之间还有一个 communication channel，alice's mouth and bob's ear

同样的，如果你的系统里面也有多出来的 communication channel, 你就需要 linearizable storage，不然的话就会出问题，下面这张图给出了这个例子
![[DDIA-Figure-9-5.png]]

这个系统如果不是linearizable storage, 那么 message queue 如果比file storage本身存储速度快的话，当image resizer fetch的时候，file storage 还没完成存储的操作，image resizer 拿到的就是旧的数据，甚至什么也没有，然后返回旧的data，造成 file storage permanently inconsistent

This problem arises because there are two different communication channels between the web server and the resizer: the file storage and the message queue. Without the recency guarantee of linearizability, race conditions between these two channels are possible. This situation is analogous to Figure 9-1, where there was also a race condition between two communication channels: the database replication and the real-life audio channel between Alice’s mouth and Bob’s ears.

Linearizable 并不是唯一的解决办法，read your own writes 也可以，只不过会让问题更复杂（因为linearizable 是DB保障的，read your own writes 好像需要特殊的config， unique transaction id）

Implementing Linearizable Systems

首先要判断不同的replication model 是否linearizable
![[DDIA-不同replication是否Linearizable.png]]

CAP theorem

Thus, a better way of phrasing CAP would be either Consistent or Available when Partitioned

![[DDIA-CAP-Definition.png]]

The CAP theorem as formally defined [30] is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network parti‐ tions,vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems [9, 40].

Concurrency 有问题的根本原因在于违反了因果律

书中这段话总结的太好了，而且解释了ordering, linearizability, and consesus 都是因为 concurrency 有可能违反因果律造成的原因

There are several reasons why ordering keeps coming up, and one of the reasons is that it helps preserve causality. We have already seen several examples over the course of this book where causality has been important:

In “Consistent Prefix Reads” on page 165 (Figure 5-5) we saw an example where the observer of a conversation saw first the answer to a question, and then the question being answered. This is confusing because it violates our intuition of cause and effect: if a question is answered, then clearly the question had to be there first, because the person giving the answer must have seen the question (assuming they are not psychic and cannot see into the future). We say that there is a causal dependency between the question and the answer.

A similar pattern appeared in Figure 5-9, where we looked at the replication between three leaders and noticed that some writes could “overtake” others due to network delays. From the perspective of one of the replicas it would look as though there was an update to a row that did not exist. Causality here means that a row must first be created before it can be updated.

In “Detecting Concurrent Writes” on page 184 we observed that if you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. This happened before relationship is another expression of causality: if A happened before B, that means B might have known about A, or built upon A, or depended on A. If A and B are concur‐ rent, there is no causal link between them; in other words, we are sure that nei‐ ther knew about the other.

In the context of snapshot isolation for transactions (“Snapshot Isolation and Repeatable Read” on page 237), we said that a transaction reads from a consistent snapshot. But what does “consistent” mean in this context? It means consistent with causality: if the snapshot contains an answer, it must also contain the ques‐ tion being answered [48]. Observing the entire database at a single point in time makes it consistent with causality: the effects of all operations that happened cau‐ sally before that point in time are visible, but no operations that happened cau‐ sally afterward can be seen. Read skew (non-repeatable reads, as illustrated in Figure 7-6) means reading data in a state that violates causality.

Our examples of write skew between transactions (see “Write Skew and Phan‐ toms” on page 246) also demonstrated causal dependencies: in Figure 7-8, Alice was allowed to go off call because the transaction thought that Bob was still on call, and vice versa. In this case, the action of going off call is causally dependent on the observation of who is currently on call. Serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” on page 261) detects write skew by track‐ ing the causal dependencies between transactions.

In the example of Alice and Bob watching football (Figure 9-1), the fact that Bob got a stale result from the server after hearing Alice exclaim the result is a causal‐ ity violation: Alice’s exclamation is causally dependent on the announcement of the score, so Bob should also be able to see the score after hearing Alice. The same pattern appeared again in “Cross-channel timing dependencies” on page 331 in the guise of an image resizing service.

Concurrency would mean that the timeline branches and merges again

CSAPP里面那张图！！

causal order is not a total order

![[DDIA-Causal Order is not Total Order.png]]

这张图清楚明了的说明了causal order is not a total order

Concurrency would mean that the timeline branches and merges again—and in this case, operations on different branches are incomparable (i.e., concurrent). We saw this phenomenon in Chapter 5: for example, Figure 5-14 is not a straight-line total order, but rather a jumble of different operations going on concurrently. The arrows in the diagram indicate causal dependencies—the partial ordering of operations.

So what is the relationship between the causal order and linearizability? The answer is that linearizability implies causality

所以linearizable 是 causal order 的子集（更小的圆），换句话说，只要你是linearizable, 那么你一定在causal order 的集合里面！

Causal consistency 是可以实现的

现在还是research的领域，但是我们可以实现causal consistency without paying the price at Linearizability

The good news is that a middle ground is possible. Linearizability is not the only way of preserving causality—there are other ways too. A system can be causally consistent without incurring the performance hit of making it linearizable (in particular, the CAP theorem does not apply). In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures [2, 42].

In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently. Based on this obser‐ vation, researchers are exploring new kinds of databases that preserve causality, with performance and availability characteristics that are similar to those of eventually consistent systems [49, 50, 51].

The techniques for determining which operation happened before which other operation are similar to what we discussed in “Detecting Concurrent Writes” on page 184. That section discussed causality in a leaderless datastore, where we need to detect concurrent writes to the same key in order to prevent lost updates. Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key. Version vectors can be generalized to do this [54].

Version vector 能解决causal consistency 的问题

![[DDIA-Version-Vector.png]]

Lamport timestamp

lamport timestamp 主要思想是每次node 收到request 或者 response，都会attach自己目前看到的最大的counter，这样所有node 都会update 自己的counter

every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request.

Using Total Order Broadcast

Consensus services such as ZooKeeper and etcd implement total order broadcast.

Another way of looking at total order broadcast is that it is a way of creating a log (as in a replication log, transaction log, or write-ahead log): delivering a message is like appending to the log. Since all nodes must deliver the same messages in the same order, all nodes can read the log and see the same sequence of messages.

这不是有点像比特币/区块链建立的公共账本嘛？

In general, if you think hard enough about linearizable sequence number gener‐ ators, you inevitably end up with a consensus algorithm.

到最后都要通过consensus algorithm 来实现？

This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consen‐ sus [28, 67]. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!

You may have heard about the FLP result [68]—named after the authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms for achieving consensus. What is going on here?

what！这么nb嘛

The answer is that the FLP result is proved in the asynchronous system model (see “System Model and Reality” on page 306), a very restrictive model that assumes a deterministic algorithm that cannot use any clocks or timeouts. If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes (even if the suspicion is sometimes wrong), then consensus becomes solvable [67]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [69].

![[DDIA-Impossibility of Consensus.png]]

Atomic Commit

如果一个 transaction 在多个node 上执行，然后他们都需要commit，这时候你就需要 atomic commit 的 guarantee了，不然的话有的node success，有的node fail 就会造成问题

In these cases, it is not sufficient to simply send a commit request to all of the nodes and independently commit the transaction on each one. In doing so, it could easily happen that the commit succeeds on some nodes and fails on other nodes, which would violate the atomicity guarantee:

Some nodes may detect a constraint violation or conflict, making an abort neces‐ sary, while other nodes are successfully able to commit.

Some of the commit requests might be lost in the network, eventually aborting due to a timeout, while other commit requests get through.

Some nodes may crash before the commit record is fully written and roll back on recovery, while others successfully commit.

Two-Phase Commit

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes

2 phase commit 是一个为了achieve atomic transaction commit(多个node) 的算法

是分布式里面的经典算法
[13, 35, 75] 这三个reference 回头有空看一下

reference

[13] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com.
http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx

[35] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “Notes on Dis‐ tributed Databases,” IBM Research, Research Report RJ2571(33471), July 1979.
http://domino.research.ibm.com/library/cyberdig.nsf/papers/A776EC17FC2FCE73852579F100578964/$File/RJ2571.pdf

[75] C. Mohan, Bruce G. Lindsay, and Ron Obermarck: “Transaction Management in the R* Distributed Database Management System,” ACM Transactions on Database Systems, volume 11, number 4, pages 378–396, December 1986. doi: 10.1145/7239.7266
https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf

basic flow

The basic flow of 2PC is illustrated in Figure 9-9. Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).

![[DDIA-Figure-9-9.png]]

其实2PL 跟 2PC 不同是因为 2PL 只是为了isolation， 2PC是不同 node 之间要agree，两个都有两个阶段， 2PL 是拿锁和释放锁的两个阶段， 2PC则是让所有node 返回他们是否commit（第一阶段），然后根据返回结果决定 commit 还是 abort（第二阶段）

2PC 引入了一个新的 component， coordinator（transaction manager），跟consensus algorithm 里面的coordinator 有什么区别？

注释里面有说到，atomic commit and consensus are reducible to each other （他们之间可以互换）

xii. Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [70, 71]. Nonblocking atomic commit is harder than consensus—see “Three-phase commit” on page 359.

2PC 中间有很多细节，主要就是保证所有node say yes 之后， coordinator write to disk commit，之后发送commit request 如果timeout，那么就一直retry，因为这时候已经决定commit了，不管哪个node down 掉，必须一直retry。

Coordinator failure

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain.

![[DDIA-Figure-9-10.png]]

Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.

所以在coordinator 决定commit 并且写入WAL之后，他就成了single point failure，因为如果这时候它 crash 了，必须所有node 都要等他 recover 之后才能决定是否 commit。因此有了 Three-Phase Commit

Three-phase commit

就是一个nonblocking atomic commit protocol. 但他assume a network with bounded delay and nodes with bounded resposne times

In general, nonblocking atomic commit requires a perfect failure detector [67, 71]— i.e., a reliable mechanism for telling whether a node has crashed or not. In a network with unbounded delay a timeout is not a reliable failure detector, because a request may time out due to a network problem even if no node has crashed. For this reason, 2PC continues to be used, despite the known problem with coordinator failure.

XA transactions

因为分布式系统一般整合了不同的vendor（不同DB， message queue之类的），所以需要一个protocol 来让他们 implement 2PC.

X/Open XA (short for eXtended Architecture) is a standard for implementing two- phase commit across heterogeneous technologies [76, 77]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, DB2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).

XA 并不是network protocol，只是一个C的API for interfacing with a transaction coordinator.

XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator. Bindings for this API exist in other languages; for example, in the world of Java EE applications, XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by many drivers for databases using Java Data‐ base Connectivity (JDBC) and drivers for message brokers using the Java Message Service (JMS) APIs.

这么来看java的support 好强

所以实际上coordinator 只是application的一个library！并不是单独的service like ZooKeeper

所以是application 层面的东西，我们build application的时候带上XA coordinator？

The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service).

跑 application 的服务器上面用这个library，如果这个application process crash 了，那么就要重启这个server，然后coordinator 读WAL的记录来 recover

If the application process crashes, or the machine on which the application is running dies, the coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be restarted, and the coordinator library must read the log to recover the commit/abort outcome of each transaction.

这句话提醒了我，不仅是前端，大多数application即使server side 也一样是stateless…… 所有state 都从DB那里拿到

Many server-side applications are developed in a stateless model (as favored by HTTP), with all persistent state stored in a database,

The consensus problem is normally formalized as follows: one or more nodes may propose values, and the consensus algorithm decides on one of those values. In the seat-booking example, when several customers are concurrently trying to buy the last seat, each node handling a customer request may propose the ID of the customer it is serving, and the decision indicates which one of those customers got the seat.

其实就是说 consensus algorithm 决定谁来改数据

In this formalism, a consensus algorithm must satisfy the following properties [25]

Uniform agreement

No two nodes decide differently.

Integrity

No node decides twice.

Validity

If a node decides value v, then v was proposed by some node.

Termination

Every node that does not crash eventually decides some value.

Consensus algorithms

The best-known fault-tolerant consensus algorithms are Viewstamped Replication (VSR) [94, 95], Paxos [96, 97, 98, 99], Raft [22, 100, 101], and Zab [15, 21, 102].

Most of these algorithms actually don’t directly use the formal model described here (proposing and deciding on a single value, while satisfying the agreement, integrity, validity, and termination properties). Instead, they decide on a sequence of values, which makes them total order broadcast algorithms, as discussed previously in this chapter

这里有一个问题就是single leader 结构就类似 consensus algorithm里面的 coordinator，但是如果leader fail了，你需要consensus 去选leader，然后consensus algorithm needs a leader…… full circle

如何解决？

Epoch numbering and quorums

All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number (called the ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft) and guarantee that within each epoch, the leader is unique.

也就是新给了一个数字，每次生成这个数字的时候，都会有一个unique leader

如果有conflict，比如之前epoch number的leader恢复了，然后split brain condition，这时候higher epoch number leader win

Every time the current leader is thought to be dead, a vote is started among the nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered and monotonically increasing. If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.

在leader 决定做任何事情之前，都需要从nodes 那里边获得投票（ quorum of nodes）只有大多数nodes 同意之后， leader propose的事情才可以进行

这跟政治就很像了

Before a leader is allowed to decide anything, it must first check that there isn’t some other leader with a higher epoch number which might take a conflicting decision. How does a leader know that it hasn’t been ousted by another node? Recall “The Truth Is Defined by the Majority” on page 300: a node cannot necessarily trust its own judgment—just because a node thinks that it is the leader, that does not neces‐ sarily mean the other nodes accept it as their leader.

所以2PC 相比于 consensus algorithm 来说，他需要每个nodes 都participate， consensus 有 fault tolerant

This voting process looks superficially similar to two-phase commit. The biggest dif‐ ferences are that in 2PC the coordinator is not elected, and that fault-tolerant consen‐ sus algorithms only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met. These differences are key to the correctness and fault tolerance of a consensus algorithm.

To understand this, it is helpful to briefly explore how a service like ZooKeeper is used. As an application developer, you will rarely need to use ZooKeeper directly, because it is actually not well suited as a general-purpose database. It is more likely that you will end up relying on it indirectly via some other project: for example, HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background.

感觉ZooKeeper 也太强了

One example in which the ZooKeeper/Chubby model works well is if you have sev‐ eral instances of a process or service, and one of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should take over. This is of course useful for single-leader databases, but it’s also useful for job schedulers and similar stateful systems.

Another example arises when you have some partitioned resource (database, message streams, file storage, distributed actor system, etc.) and need to decide which parti‐ tion to assign to which node. As new nodes join the cluster, some of the partitions need to be moved from existing nodes to the new nodes in order to rebalance the load (see “Rebalancing Partitions” on page 209). As nodes are removed or fail, other nodes need to take over the failed nodes’ work.

These kinds of tasks can be achieved by judicious use of atomic operations, ephem‐ eral nodes, and notifications in ZooKeeper. If done correctly, this approach allows the application to automatically recover from faults without human intervention. It’s not easy, despite the appearance of libraries such as Apache Curator [17] that have sprung up to provide higher-level tools on top of the ZooKeeper client API—but it is still much better than attempting to implement the necessary consensus algorithms from scratch, which has a poor success record

总结

这一章从不同角度看 consistency and consensus 花了专门篇幅去讲 linearizabillity，一种consistency model

linearizability 主要目的是为了让有 replica的数据看起来就只有一个copy，然后所有操作都是atomic的

他是一个model！

还看了因果律（causality），因果律决定了系统事件的先后顺序（ordering of events) 跟 linearizability不一样的是，causality 是可以有multiple events mingled together
![[DDIA-Causal Order is not Total Order.png]]
而linearizability 则是确保了所有operations 的顺序, 它是total order的

Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model: some things can be concurrent, so the version history is like a timeline with branching and merging. Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.

所以causality is a weaker guarantee/ weaker consistency model

linearizability come with cost

但有些时候只保证了因果律并不能完全保证application的逻辑是对的，比如用户注册用户名，他不知道是否另一个process 也在注册相同的用户名，causality 只知道先后顺序，而且有lamport timestamp 也只能在事后才知道先后顺序，而用户注册这种行为必须在当下就决定。这时候就需要consensus了

然后发现很多问题都可以reducible to consensus, 所以consensus 解决了之后很多问题也就迎刃而解了

这些问题包括:

Linearizable compare-and-set registers

The register needs to atomically decide whether to set its value, based on whether its current value equals the parameter given in the operation.

Atomic transaction commit

A database must decide whether to commit or abort a distributed transaction.

Total order broadcast

The messaging system must decide on the order in which to deliver messages.

Locks and leases

When several clients are racing to grab a lock or lease, the lock decides which one successfully acquired it.

Membership/coordination service

Given a failure detector (e.g., timeouts), the system must decide which nodes are alive, and which should be considered dead because their sessions timed out.

Uniqueness constraint

When several transactions concurrently try to create conflicting records with the same key, the constraint must decide which one to allow and which should fail with a constraint violation.

consensus algorithm 通常都是由total order broadcast 实现的，这里可能还要再研究一下

而consensus algorithm 通常也需要一个leader（跟 single leader replica很像）

但如果这个leader fail了，整个系统就没法make progress了，所以跟我们的目标不一样（分布式系统的目标就是为了有fault tolerance），如果leader down 掉之后，我们有三种方式handle：

Wait for the leader to recover, and accept that the system will be blocked in the meantime. Many XA/JTA transaction coordinators choose this option. This approach does not fully solve consensus because it does not satisfy the termina‐ tion property: if the leader does not recover, the system can be blocked forever.

Manually fail over by getting humans to choose a new leader node and reconfig‐ ure the system to use it. Many relational databases take this approach. It is a kind of consensus by “act of God”—the human operator, outside of the computer sys‐ tem, makes the decision. The speed of failover is limited by the speed at which humans can act, which is generally slower than computers.

Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse network conditions [107].

第三种方式就是consensus algorithm，这一章最后都在描述ZooKeeper

Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus.

所以consensus通常只在single leader 结构下使用？可能因为multi leader 或者 leaderless 用 global consensus的代价太高了，就跟bitcoin 交易时间很长一样