-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdiscussion-notes.txt
147 lines (108 loc) · 5.15 KB
/
discussion-notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
1. Verify if we can run dbServer, dbClient and a thread that invokes dbClient to make request to dbServer periodically -
in a single RM process - DONE
2. Understand how we can dynamically join a raft cluster -
A. like increase size from 3 to 4 to 5 without changing config
B. Node ips is fixed for cluster, we are okay without join-cluster-api
3. How to increase the term period, so that election doesn't happen too much ?
4. Verify if we can tell Ratis that this server is leader now ? - TODO
5. We use zk for leader election and store shared config - this uses extra machines or do they run in same node ?
6. If we use Ratis along with zk then, both might choose different leaders for their clusters - which is bad, bad thing.
7. In current architecture of 2 nodes (primary & secondary) - writes are accepted even if one of the node goes down
BUT
if we use ratis cluster of 3 nodes, if one node goes down, then writes won't be accepted (maybe some config) into
dbServer, unless another node joins in.
8. It means dbClient should try to async upload to dbServer with periodic retries and it should keep track of
whatever stuff it has synced with dbServer ? - which defeats the purpose of dbServer itself ?
9. How does the dbClient know from where it has to continue syncing with dbServer after establishing quorum.
------------------
1. RaftRMStateStore
Backup file system as safety
2. RMActiveServer is started by master
- createAndInitActiveServices(), transitionToActive(), StandByTransitionRunnable
- AdminService
- ActiveStandByElectorBasedElectorService
3. Learn to use Ratis leader election - so that we can use in code.
4. submitClientRequestAsync - directly using server
HDDS-858
RMprocess - Active
- RMstate map
- RaftServer {
StateMachine {
Map -> is our RMstate.
applyTrx {
- this means that log entry is replicated already
- manually we have to apply entry on RMstate.
}
}
}
- updateAppState(appId, appState) {
RaftServer.submitClientRequest(appId, appState)
}
RMprocess - Standby
ResourceManager.java -> line 153
Use raft for leader election to understand candidate to leader transition
CuratorBasedElectorService
MemoryRMStateStoreService
2-cluster, it won't accept write
Raft1[1w,2w,3w], Raft2[1w,2w,3w]
Raft1 dies
Raft2[1w,2w,3w] - RMbacklog[4w, 5w]
Raft1 rises up
who reads from RMbacklog and then calls RaftServer.submitClientRequest([4w, 5w]) ?
any case write fails -> store reqs into zkFs
RMreq -> RaftPeer(req) -> zkWAL (need to chunk transactions into smaller ones)
Pros: if active dies then standby will have latest state and don't need to read from zkWAL
--------------------------------
TODO
* Recovery of raft peer in case of missing latest transaction ? WAL or edit logs or snapshots etc ?
* RMactive but RaftServer can be any state but RaftServer.submitClientRequest should be replicated - without using
client.
- Follower node can't submit client request other wise it will get NotLeaderException.
- Only the Leader node can do so and that request gets replicated.
* Read Ozone code and compare WALs
- I read and compared OM, OMRatisServer, OMStateMachine
- our own distrbuted WAL with raft support
I'm lost again - I should read Raft paper and compare with StateMachine Implementation.
----------------------------------------
rm1 -> trxns in db, state store should be trxns, understand whether levelDB, zk , fs use trxn or key-value overwrite.
rm2 -> read from db
complete state + edit logs
nn1 [cs] , editlogs = [create, delete]
edit log entry
{cs}, [am,new] -> [am,running]
nn2 [cs],
LEADER [e4, e5] - wal
(term-2)
FOLLOWER-1 []
F-LEADER [e6,e7] (term-3)
L-Follower []
hdfs, zk
1. 3rms
2.
2 rms
hdfs editlogs
3.
active will cleanup delta logs
4. poc for fs,
explore fsImage, editLog
use test cases to validate wal functionality
------------------------------------------
TO_ASK:
Hi morning, i was reading about zookeeper and it's implementation.
1. I understand how it's zab replication algo works and there is no inherent leader election algo for rm nodes.
2. Zk just provides atomic primitives in a tree data structure, maintains snapshot & trx logs of it's cluster.
3. I experimented with zk primitives in code.
The way i see is -
1. The rm that's able to create a leader ephemeral node and get quorum votes becomes leader.
2. The rm followers keep a watch on above ephemeral node.
3. As soon as rm-primary dies, the ephemeral node is deleted and the followers get this event.
4. They will check the breadcrumb and kill primary rm process (idk how).
5. They will perform step-1.
Now, the question is
1. How is our rm ha working ? Is the rm-secondary watching the nodes in zk and applying the same commands to its local state ?
2. How can we trigger an event that sends the zk cluster updates by leader to our secondary rm ?
3. How are we syncing the rm state store with zk cluster ? If possible, figure out any yarn rm design doc.
4. Does rm maintain state store or it connects directly with zk ?
Kafka has release Kraft to replace zookeeper in their cluster:
1. https://developer.confluent.io/learn/kraft/
2. https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum