forked from gakalaba/multidispatch-paper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcorrectness.tex
193 lines (133 loc) · 16.8 KB
/
correctness.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
\section{Correctness Proof}
\label{sec:correctness}
\begin{lem}
\label{lemma1}
Consider a pair of operations $e_1, e_2$ such that $e_1 <_p e_2$, and further assume each operation is committed, coordinated, and been assigned timestamp $\varepsilon_1, \varepsilon_2$, at their respective shards. Then $\varepsilon_1 <= \varepsilon_2$.
\end{lem}
\begin{proof}
We case upon the presence of shard leader failures:
\noindentparagraph{Case 1.} \textit{No failures}.
Let $e_1$ and $e_2$ be two operations such that $e_1 <_p e_2$.
Consider the timestamp assignment process for $e_1$ and $e_2$. Let $\epsilon_1$ and $\epsilon_2$ be the timestamps assigned to $e_1$ and $e_2$, respectively.
Since $e_1 <_p e_2$, $e_2$ sends a coordination request to $e_1$. Upon receiving the coordination request, $e_1$ responds with its timestamp $\epsilon_1$, which it cannot send until it is committed and coordinated. Consequently, upon receiving $\epsilon_1$, $e_2$ sets its timestamp $\epsilon_2 = \max(\epsilon_1 + 1, \epsilon_{\text{shard}2})$.
This implies $\epsilon_2 >= \epsilon_1 + 1$, or $\epsilon_1 < \epsilon_2$
\noindentparagraph{Case 2.} \textit{With failures}.
We case upon which shard fails.
\noindentparagraph{Subcase A.} \textit{Shard with $e_2$ fails after $e_2$ is committed, before it is ordered}.
Assume shard with $e_2$ fails.
During failover, all committed-but-not-ordered entries, including $e_2$, are processed and assigned new timestamps. Specifically, each entry gets a predecessor and successor timestamp via calls made to \texttt{getPredecessor} and \texttt{getSuccessor}, respectively, and the new shard leader sorts all committed-but-not-ordered entries by successor timestamp. Let $\epsilon_1$ and $\epsilon_2$ be the timestamps assigned to $e_1$ and $e_2$, respectively, where $\epsilon_1$ was never recalculated since the shard never failed.
By the failover protocol, $\epsilon_2 = \max(\epsilon_1 + 1, \epsilon_{\text{shard}2}) \geq \epsilon_1 + 1$, ensuring $\epsilon_1 < \epsilon_2$.
\noindentparagraph{Subcase B.} \textit{Shard with $e_1$ fails after $e_1$ is committed, before it is ordered}.
Assume shard with $e_1$ fails.
By the failover protocol, the new shard leader will call \texttt{getPredecessor} and \texttt{getSuccessor} for all committed-but-not-ordered entries, including $e_1$, giving them a predecessor and successor timestamp, respectively, then it will sort them based on successor timestamp, and finally in sorted order assign them new timestamps.
We prove by induction that during failover, for all timestamps assigned to the sorted set of committed-but-not-ordered entries, $\epsilon_i < \epsilon_i\_\text{succ}$.
\textbf{Base Case:} Consider the timestamp, $\epsilon_0$, assigned to the first processed entry, $e_0$, in the sorted set of committed-but-not-ordered entries. $\epsilon_0 < \epsilon_0\_succ$
\\
When the new leader processes the 0th committed-but-not-ordered entry during failover, it will assign it a timestamp of
$$\epsilon_0 = \max(\epsilon_0\_\text{pred} + 1, shard\_ts_0)$$
If $\epsilon_0 = \epsilon_0\_pred + 1$, we know this is at most the value it used to coordinate its successor prior to failure, and that successor must have assigned itself a timestamp strictly larger than this sent value.
Else $\epsilon_0 = shard\_ts_0$.
Let's call the last ordered entry in the ordered log at the start of failover $e_k$. The initial value of the shard timestamp, $\text{shard\_ts}_0$, assigned at the beginning of recovery is the timestamp of the last ordered entry, $\epsilon_k + 1$.
It is possible that any of the committed-but-not-ordered had been assigned a timestamp prior to failover that was sent to a successor. We know all of these assigned timestamps must have been larger than $\epsilon_k$, or at least as large as $shard\_ts_0$. And since during coordination, successors assign themselves strictly increasing timestamps, we also know the timestamps of any successors to entries after $e_k$ had timestamps strictly greater than $shard\_ts_0$.
Therefore, $\epsilon_0 < \epsilon_{0\_\text{succ}}$.
\textbf{Inductive Step:} Assume that the claim holds for some positive integer $i$. That is, consider the timestamp, $\epsilon_i$, assigned to some committed-but-not-ordered entry processed in the sorted order during failover, $e_i$, then $\epsilon_i < \epsilon_i\_\text{succ}$.
Now, we want to show that the claim also holds for $i+1$, i.e., we want to show $\epsilon_{i+1} < \epsilon_{i+1}\_\text{succ}$.
When the new leader processes the i+1th committed-but-not-ordered entry during failover, it will assign it a timestamp of
$$\epsilon_{i+1} = \max(\epsilon_{i+1}\_\text{pred} + 1, shard\_ts_{i+1})$$
If $\epsilon_{i+1} = \epsilon_{i+1}\_\text{pred} + 1$, we know this is at most the value it used to coordinate its successor prior to failure, and that successor must have assigned itself a timestamp strictly larger than this sent value.
Else $\epsilon_{i+1} = shard\_ts_{i+1}$. Between assigning every committed-but-not-ordered entry, the shard updates its timestamp to be the last timestamp assigned to an entry plus one, or $shard\_ts_{i+1} = \epsilon_i + 1$. Moreover, we know that $\epsilon_{i}\_\text{succ} <= \epsilon_{i+1}\_\text{succ}$ due to sorted order by successor timestamps. Using our inductive hypothesis, we know $\epsilon_i < \epsilon_i\_\text{succ}$ or $\epsilon_i + 1 <= \epsilon_i\_\text{succ}$. From all of this we can show,
$$\epsilon_{i+1} = shard\_ts_{i+1} = \epsilon_i + 1 <= \epsilon_i\_\text{succ} <= \epsilon_{i+1}\_\text{succ}$$.
% All operations must send coordination requests to their predecessors, by line XXX. When $e_2$ sends a coordination request to $e_1$, $e_1$ cannot respond until it is committed and coordinated.
% When an operation is committed and coordinated, the shard assigns a timestamp for the entry with value $\varepsilon = \max(\varepsilon_{\textit{pred}}+1, \varepsilon_{\textit{shard}})$.
% When a shard leader sends a response to a coordination request, it sends $\varepsilon$, by line XXX.
% Thus $e_1$ will send $\varepsilon_1$ to $e_2$. When $e_2$ receives a response to its coordination request from $e_1$, it will set its timestamp $\varepsilon_2 = \max(\varepsilon_1+1, \varepsilon_{\textit{shard2}})$.
% Clearly this implies $\varepsilon_1 < \varepsilon_2$.
\end{proof}
\begin{lem}
\label{lemma2}
Consider a pair of operations on the same shard $e_1, e_2$ such that $e_1 <_x e_2$, and further assume each operation is committed, coordinated, and been assigned timestamp $\varepsilon_1, \varepsilon_2$ at their shard. Then $\varepsilon_1 < \varepsilon_2$.
\end{lem}
\begin{proof}
We case upon the presence of shard leader failures:
\noindentparagraph{Case 1.} \textit{No failures}.
A shard leader assigns a timestamp to a committed and coordinated entry as
$$\epsilon = \max(\epsilon_{\text{pred}} + 1, \epsilon_{\text{shard}})$$ and immediately afterwards updates its own value to be
$$\epsilon_{\text{shard}} = \epsilon + 1$$
Therefore, if $e_1$ is ordered in the shard log before $e_2$, and entries can only be added to the log once they are committed and coordinated,
$$\epsilon_2 >= \epsilon_{\text{shard}} = \epsilon_1 + 1$$ or
$$\epsilon_1 < \epsilon_2$$
\noindentparagraph{Case 2.} \textit{With failures}.
After we sort the committed-but-not-ordered entries in the log by their successor timestamps, we run the same process to assign new timestamps as described in case 1, which produces monotonically increasing values.
% By line XXX, a shard adds an operation to the log after it is committed and coordinated. By line XXX, when $e_1$ is committed and coordinated, the shard assigns a timestamp for the entry with value $\varepsilon_1$, adds $e_1$ to its ordered log, and update its shard timestamp to be $\varepsilon_{\textit{shard}} = \max(\varepsilon_1, \varepsilon_{\textit{shard}}) + 1$. Because of this, any operation ordered in the log after $e_1$ is assigned a timestamp
% $\varepsilon_2 = \max(\varepsilon_{\textit{pred}}+1, \varepsilon_{\textit{shard}})$.
% It follows that $\varepsilon_1 < \varepsilon_2$.
%Algorithms~\ref{clientprotocol}, ~\ref{shardprotocolmessages},and~\ref{shardprotocolmessages}
% All operations must send coordination requests to their predecessors, by line 11 of Algorithm~\ref{clientprotocol}. When $e_1$ receives the coordination request sent by the client for $e_2$, $e_1$ cannot respond until it is committed and coordinated, by line 9 of Algorithm~\ref{shardprotocolcoord}.
% By
% When a request is committed and coordinated, it sets its timestamp ε = max(εpred+1, εshard).
% When a request sends a response to a coordination request, it send ε, by line XXX. Thus e1 will send ε1 to e2, where ε1 = max(ε1pred+1, εshard1).
% When e2 receives a response to its coordination request from e1, it will set its timestamp ε2 = max(ε1+1, εshard2).
% Trivially, ε1 < ε2 = max(ε1+1, εshard2)
\end{proof}
\begin{lem}
\label{lemma3}
Consider a sequence of operations $e_1,...,e_n$ such that the following hold: (1) successive pairs alternate between belonging to $<_p$ and $<_x$; (2) the last pair belongs to $\rightarrow$; and (3) $e_1$ and $e_n$ are at the same shard, $X$. Further assume each operation is committed, coordinated, and been assigned timestamp $\varepsilon_1, ..., \varepsilon_n$ at their respective shards. Then $\varepsilon_1 < \varepsilon_n$.
\end{lem}
\begin{proof}
We case upon the presence of shard leader failures:
\noindentparagraph{Case 1.} \textit{No failures}.
By Lemmas \ref{lemma1} and \ref{lemma2}, we know that $\epsilon_1 < \epsilon_{n-1}$. We also know $e_{n-1}$ can't execute until it is committed and coordinated, which must happen after $e_1$ is committed and coordinated. If there is a real-time edge between $e_{n-1}$ and $e_n$, and $e_{n-1}$ cannot return to its client until it is committed and coordinated, then we know $e_{n-1}$ is committed and coordinated before $e_n$ arrives at shard $X$. Thus, if $e_n$ is at the same shard as $e_1$, $e_n$ must arrive at shard $X$ after $e_n$ is committed and coordinated.
Using similar reasoning shown in lemma \ref{lemma2}, if $e_1$ is committed and coordinated and ordered in the shard log before $e_n$, then $\epsilon_1 < \epsilon_n$.
\noindentparagraph{Case 2.} \textit{With failures}.
Consider (the only interesting case) where shard $X$ fails after $e_1$ and $e_n$ have been committed and coordinated but not ordered. We know that the new leader will contain both $e_1$ and $e_n$ in its set of committed-but-not-ordered entries since they were committed prior to failure and by the quorum intersection property, the new leader will have both of them.
By the failover protocol, the new leader will first get the predecessor and successor timestamps for all committed-but-not-ordered entries, including $e_1$ and $e_n$, via calls made to \texttt{getPredecessor} and \texttt{getSuccessor}.
As in Case 1 (\textit{No failures}), we know the old leader had assigned timestamps to $e_1$ and $e_n$ as
$$\epsilon_1 = \max(e_{1\text{pred}} + 1, \text{shard\_ts}) = \max(e_{1\text{pred}} + 1, \text{shard\_ts}_0)$$
$$\epsilon_n = \max(e_{1\text{pred}} + 1, \epsilon_1 + 1) \geq \epsilon_1 + 1 > \epsilon_1$$
The values for $\epsilon_1$ and $\epsilon_n$ are also the values $e_1$ and $e_n$ would have sent to their successors prior to failover (if any existed). So we know if \texttt{getSuccessor} returns a non-infinity value for $e_1$ and $e_n$, they will be at least
$$\epsilon_1\_\text{succ} > \epsilon_1 = \max(\epsilon_1\_\text{pred} + 1, \text{shard\_ts})$$
$$\epsilon_n\_\text{succ} \geq \max(\epsilon_n\_\text{pred} + 1, \epsilon_1 + 1) \geq \epsilon_1 + 1$$
We know $e_1$ had a successor since by construction it precedes another entry, $e_{n-1}$, which has executed and responded to a client. Moreover, since $e_{n-1}$ has executed and responded to its client, and this failover is happening after $e_{n-1}$ has executed and responded to its client, \texttt{getSuccessor}$(e_1)$ will return a non-infinity value $v_1 \geq \epsilon_1\_\text{succ}$.
We don't know if $e_n$ had a successor. If it did not, \texttt{getSuccessor}($e_n$) will return infinity during failure recovery. If it did, it will return a value $v_n = \epsilon_n\_\text{succ} > \epsilon_n \geq \epsilon_1 + 1$.
After obtaining the predecessor and successor timestamps, the new leader will sort all the committed-but-not-ordered entries by their successor timestamps, $\epsilon_i\_\text{succ}$. We know the following to be true about $e_1$ and $e_n$ after the sort completes:
If $\epsilon_n\_\text{succ} = \infty$, then it is clear $e_1$ will be sorted before $e_n$. Otherwise, we also know
$$\texttt{getSuccessor}(e_1) = v_1 \geq \epsilon_1\_\text{succ} > \epsilon_1$$
$$\texttt{getSuccessor}(e_n) = v_n \geq \epsilon_n\_\text{succ} > \epsilon_1 + 1$$
Thus, $e_1$ will still be sorted before $e_n$.
After the sort, we assign new monotonically increasing timestamps in sorted order, incrementing the shard timestamp between each assignment (as shown in previous lemmas), giving
$$\epsilon_n >= \max(\epsilon_n\_\text{pred} + 1, \epsilon_1 + 1)$$
or,
$$\epsilon_1 < \epsilon_n$$
\end{proof}
Let $<_\psi$ be a partial order defined over pairs of complete operations $e_1, e_2$, as follows:
\begin{enumerate}
\item $e_1 <_x e_2 \implies e_1 <_\psi e_2$
\item $e_1 <_p e_2 \implies e_1 <_\psi e_2$
\item $e_1 \rightarrow e_2 \implies e_1 <_\psi e_2$, and
\item $e_1 <_\psi e_2 \land e_2 <_\psi e_3 \implies e_1 <_\psi e_3$
\end{enumerate}
\begin{lem}
\label{lemmamain}
The partial order $<_\psi$ is acyclic.
\end{lem}
\begin{proof}
We prove the lemma by contradiction, assuming the partial order contains a cycle. Consider a minimum cycle of $n$ operations $e_1,e_2,...,e_n$. Observe that $<_\psi$ is irreflexive by definition, so $n \geq 2$. First, we prove three useful properties of the minimum cycle.
\noindentparagraph{Property 1.} \textit{The cycle contains no consecutive $<_p$ edges.} {Assume for the sake of contradiction there exist two consecutive $<_p$ edges $e_i <_p e_{i+1} <_p e_{i+2}$. By the transitivity of $<_p$, there exists a shorter edge $e_i <_p e_{i+2}$ which is part of the shorter cycle $..., e_i, e_{i+2}, ..., e_i$, a contradiction.}
\noindentparagraph{Property 2.} \textit{The cycle contains no consecutive $<_x$ edges.} {Using similar reasoning as in the above, $e_i <_x e_{i+1} <_x e_{i+2}$ would imply the existence of a shorter cycle that contains the edge $e_i <_x e_{i+2}$, a contradiction.}
\noindentparagraph{Property 3.} \textit{The cycle contains at most one $\rightarrow$ edge.} Assume for the sake of contradiction that the cycle contains two $\rightarrow$ edges, $e_i \rightarrow e_j .... e_k \rightarrow e_l$, where $i < j \leq k < l$, reindexing the cycle if necessary. From the definition of $\rightarrow$, we know $\resps(e_i) < \invs(e_j)$ and $\resps(e_k) < \invs(e_l)$. We also know $\invs(e_l) < \resps(e_i)$, otherwise a shorter cycle would exist containing the edge $e_i \rightarrow e_l$. Similarly, we also know $\invs(e_j) < \resps(e_k)$, otherwise a shorter cycle would exist containing the edge $e_k \rightarrow e_j$. We arrive at a contradiction, since $\resps(e_i) < \invs(e_j) < \resps(e_k) < \invs(e_l) < \resps(e_i)$, which gives $\resps(e_i) < \resps(e_i)$.
\begin{lem}
\label{lemma5}
Consider 3 operations, $e_1, e_2, e_3$, such that $e_1 \rightarrow e_2 <_p e_3$. Then $e_1 \rightarrow e_3$.
\end{lem}
\begin{proof}
If $e_1 \rightarrow e_2$, then by the definition of real time, $\text{resp}(e_1) < \text{inv}(e_2)$. Moreover, if $e_2 <_p e_3$, we also know by the definition of $<_p$ that $\text{inv}(e_2) < \text{inv}(e_3)$. Thus, we have $\text{resp}(e_1) < \text{inv}(e_3) < \text{inv}(e_3)$. But this gives us $e_1 \rightarrow e_3$.
\end{proof}
We carry on to prove $<_\psi$ is acyclic. By Property 3, we have two cases:
\noindentparagraph{Case 1.} \textit{There are zero $\rightarrow$ edges in the minimum cycle.} Since $n \geq 2$, and by Properties 1 and 2, we know at least one edge is a $<_x$ edge and at least one edge is a $<_p$ edge. Also by properties 1 and 2, we know the cycle must alternate between $<_{x_i}$ and $<_{p_j}$ edges.
Without loss of generality, re-index the cycle such that $e_1 <_{x_2} e_2 ... <_{p_1} e_1$.
By repeated application of Lemmas~\ref{lemma1} and ~\ref{lemma2}, we see $\varepsilon_1 < \varepsilon_n$. But by Lemma~\ref{lemma2}, $\varepsilon_n < \varepsilon_1$, a contradiction.
\noindentparagraph{Case 2.} \textit{There is one $\rightarrow$ edge in the cycle.}
Since $n \geq 2$, and by Properties 1 and 2, we know at least one edge is a $<_x$ edge and at least one edge is a $<_p$ edge. Also by properties 1 and 2, we know the cycle must alternate between $n-1$ $<_{x_i}$ and $<_{p_j}$ edges, and include a single $\rightarrow$ edge. By \ref{lemma5}, we also know that the edge following the $\rightarrow$ must be $<_x$.
Without loss of generality, re-index the cycle such that $e_1 ... e_{n-1} \rightarrow e_n <_x e_1$.
Since $n \geq 2$, and by lemmas \ref{lemma3}, we know $\epsilon_1 < \epsilon_n$. But by lemma \ref{lemma2}, $\epsilon_n < \epsilon_1$, a constradiction.
\end{proof}