Compute columns in post-PeerDAS checkpoint sync #6760

dapplion · 2025-01-07T08:34:34Z

Issue Addressed

Addresses #6026.

Post-PeerDAS the DB expects to have data columns for the finalized block.

Proposed Changes

Instead of forcing the user to submit the columns, this PR computes the columns from the blobs that we can already fetch from the checkpointz server or with the existing CLI options.

Note 1: (EDIT) Pruning concern addressed

Note 2: I have not tested this feature

Note 3: @michaelsproul an alternative I recall is to not require the blobs / columns at this point and expect backfill to populate the finalized block

jimmygchen

Nice and simple! Should have done this ages ago 😅

Just a TODO comment to address, but we can probably start testing with this!

jimmygchen · 2025-01-09T05:40:13Z

@michaelsproul I think this approach is simple and maintains same behavour as Deneb - but keen to hear if you have any preference or benefit of going with the alternatives discussed earlier (breaking the existing db invariant and populate the columns via backfill / p2p etc)?

Note 1: We should only store the columns that are meant to be in custody. Otherwise, later pruning won't delete the extras. However, accessing the NodeID and computing the custody indices is quite cumbersome as the logic is inside the network init. Ideas?

Yeah it is a bit of a pain, this code is executed before network is initialised so we don't have the node ID yet, and I'm not sure if it's worth doing a big refactor for this. Why wouldn't blob pruning delete the extras? Looks like it iterate through all columns for the root, unless i'm looking at the wrong thing?

lighthouse/beacon_node/store/src/hot_cold_store.rs

Lines 2849 to 2859 in 8a6e3bf

    
           let indices = self.get_data_column_keys(block_root)?; 
        
           if !indices.is_empty() { 
        
               trace!( 
        
                   self.log, 
        
                   "Pruning data columns of block"; 
        
                   "slot" => slot, 
        
                   "block_root" => ?block_root, 
        
               ); 
        
               last_pruned_block_root = Some(block_root); 
        
               ops.push(StoreOp::DeleteDataColumns(block_root, indices)); 
        
           }

@dapplion I think this should work - would you mind doing a quick test this locally before we merge?

jimmygchen · 2025-01-09T07:12:28Z

@michaelsproul I think this approach is simple and maintains same behavour as Deneb - but keen to hear if you have any preference or benefit of going with the alternatives discussed earlier (breaking the existing db invariant and populate the columns via backfill / p2p etc)?

Oh I think the downside with this approach is that the checkpoint server would have to serve blobs, i.e. be a supernode. The latter approach above would allow syncing columns from peers instead, and checkpointz server doesn't have to store all columns (if we have 16mb blobs per block, that'd be ~2TB of blob data that the checkpointz server have to store).

Or if we're ok with checkpointz server serving blobs, we could potentially introduce a --checkpoint-server flag to for servers to store checkpoint blobs only (so it can serve them) instead of having to store all columns for 18 days?

jimmygchen · 2025-01-10T06:35:13Z

This PR addresses #6026

mrabino1 · 2025-01-10T06:41:48Z

@jimmygchen forgive my ignorance on the specs here (and indeed the existing implementation).. quick question. instead of storing the blobs on a checkpoint server, wouldn't it be easier to store the hash only such that the clients upon launch would p2p query / gossip those newly requested blobs? From my understanding, the objective of the checkpoint server is to allow a newly launched node to get going with minimal effort... but just like the state, LH still had to backfill... wouldn't a similar logic apply here? thx .

jimmygchen · 2025-01-10T06:58:05Z

Hi @mrabino1

Yes it is possible to download the blobs from peers, with the alternative @dapplion mentioned:

Note 3: @michaelsproul an alternative I recall is to not require the blobs / columns at this point and expect backfill to populate the finalized block

The proposed solution in this PR behaves similarly to Mainnet today - we download and store the checkpoint state, block and blobs - so this approach requires minimal changes in Lighthouse, and will likely work without additional effort - it may not be the final solution but could be useful for testing now.

We could consider the alternative to download from peers - it means we'd have to break the current invariant of storing complete block and blobs in the database, and will need to have some extra logic to populate the finalized block when backfill completes.

michaelsproul · 2025-01-12T23:51:06Z

I think breaking the invariant that we have all blobs/columns for blocks probably makes sense, seeing as there are multiple cases now where this is useful. The other main use case is importing partial blob history, e.g. all txns related to some L2. This is being worked on in:

Import expired blobs #5391

I think it makes sense to restructure the DB storage as well so that we don't index by block_root -> list of blobs/data columns. For blobs it's too late (we need a schema upgrade), but we should change the data column schema before we ship PeerDAS on a permanent testnet (we can break schema versions between devnets without issue IMO).

michaelsproul · 2025-01-12T23:52:10Z

I think this PR is fine for now until we get our ducks in a row

mergify · 2025-01-22T12:14:01Z

queue

🛑 The pull request has been removed from the queue `default`

Pull request #6760 has been dequeued by a dequeue command.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

jimmygchen · 2025-01-22T12:16:02Z

@mergify dequeue

mergify · 2025-01-22T12:16:11Z

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #6760 has been dequeued by a dequeue command

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

mergify · 2025-01-22T12:16:12Z

dequeue

✅ The pull request has been removed from the queue `default`

jimmygchen · 2025-01-22T12:16:13Z

Hold off merging for now as this needs a retest.

Squashed commit of the following: commit 8a6e3bf Author: dapplion <[email protected]> Date: Tue Jan 7 16:30:57 2025 +0800 Compute columns in post-PeerDAS checkpoint sync

jimmygchen · 2025-01-31T04:37:45Z

@dapplion so I've done a round of testing locally, and I was able to sync to head and start backfilling. I've found two potential bugs on head sync and backfill sync, but I think this PR is good and we can continue investigating the two sync issues separately, as it's not related to downloading checkpoint blocks/blobs.

Here's the breakout of events:

Checkpoint block & blobs download, beacon chain initialised:

Jan 31 02:12:39.886 INFO Loaded checkpoint block and state       block_root: 0xfc9fa8a47c14b86804a8c387717a59fa0f5f15a7aec2ddbe7df89ca5ace1266b, state_slot: 704, block_slot: 704, service: beacon
Jan 31 02:12:40.086 INFO Block production enabled                method: json rpc via http, endpoint: Auth { endpoint: "http://localhost:8551/", jwt_path: "./mock-el.jwt", jwt_id: None, jwt_version: None }
Jan 31 02:12:40.100 WARN Error connecting to eth1 node endpoint  endpoint: http://localhost:8551/, auth=true, service: deposit_contract_rpc
Jan 31 02:12:40.100 ERRO Error updating deposit contract cache   error: Invalid endpoint state: RequestFailed("eth_chainId call failed ServerMessage { code: -32601, message: \"The method eth_chainId does not exist/is not available\" }"), retry_millis: 60000, service: deposit_contract_rpc
Jan 31 02:12:40.114 INFO Beacon chain initialized                head_slot: 704, head_block: 0xfc9fa8a47c14b86804a8c387717a59fa0f5f15a7aec2ddbe7df89ca5ace1266b, head_state: 0x7483f0a75da9bf820be898538c53de8f94ef7abc85c033d27555a099a8b1d374, service: beacon

Syncing head chain, saw a few Waiting for peers to be available on custody column subnets and it never started (#1 bug here, will investigate and raise an issue)

Jan 31 02:12:40.289 DEBG New head chain started syncing, id: 1, from: 22, to: 24, end_root: 0x1aaf888bd2b3459e3e2be06bb0cecf1da71b3186b07864b2e783ceb79c1211f0, current_target: 22, batches: 0, peers: 1, state: Stopped, service: range_sync, service: sync, module: network::sync::range_sync::chain_collection:365
Jan 31 02:12:40.289 DEBG Waiting for peers to be available on custody column subnets, chain: 1, service: range_sync, service: sync, module: network::sync::range_sync::chain:1176
Jan 31 02:12:40.289 INFO Sync state updated, new_state: Syncing Head Chain, old_state: Stalled, service: sync, module: network::sync::manager:706

A few mins later, it triggered finalized chain sync as our sync distance reached the threshold for head sync:

Jan 31 02:17:40.213 DEBG Syncing new finalized chain, id: 2, from: 22, to: 25, end_root: 0x97afb75e642ce09cbbf7bba7a9359b307fd615e1873d481fda591007d2f1b037, current_target: 22, batches: 0, peers: 1, state: Stopped, service: range_sync, service: sync, module: network::sync::range_sync::chain_collection:300
Jan 31 02:17:40.213 DEBG Sending BlocksByRange request, id: 3, peer: 16Uiu2HAmFhbHqwjNp7dUj1zFteHtBvGr8Vj2Vwe2dPX53iakcUvP, epoch: 22, count: 32, method: BlocksByRange, service: sync, module: network::sync::network_context:369
Jan 31 02:17:40.214 DEBG Sending DataColumnsByRange requests, id: 3, peer: 16Uiu2HAmG8PTszKMacuF73p6G3D1uNGBMTEUeLJ6AFHnFQU2WEFd, columns: [51, 6, 47, 57, 73], epoch: 22, count: 32, method: DataColumnsByRange, service: sync, module: network::sync::network_context:438
Jan 31 02:17:40.214 DEBG Sending DataColumnsByRange requests, id: 3, peer: 16Uiu2HAmUtTsXUWxTpNmUihUed9Jxt7vCKsZyrujVKg5Tx8WYbYT, columns: [82, 87, 114], epoch: 22, count: 32, method: DataColumnsByRange, service: sync, module: network::sync::network_context:438
Jan 31 02:17:40.214 DEBG Requesting batch, batch_state: [d,E,E,E,E], start_slot: 704, end_slot: 735, downloaded: 0, processed: 0, processed_no_penalty: 0, state: Downloading(16Uiu2HAmFhbHqwjNp7dUj1zFteHtBvGr8Vj2Vwe2dPX53iakcUvP, 3), batch_ty: blocks_and_columns, epoch: 22, chain: 2, service: range_sync, service: sync, module: network::sync::range_sync::chain:984

Synced to head pretty quickly and started backfilling:

Jan 31 02:17:52.005 INFO Downloading historical blocks, est_time: --, distance: 704 slots (1 hr 10 mins), service: slot_notifier, module: client::notifier:212
Jan 31 02:17:52.005 INFO Synced, slot: 828, block: 0x62d6e8ca8fbaace6094883b062236f8fa25837a72ac67947a3cf1cd5aadb414b, epoch: 25, finalized_epoch: 23, finalized_root: 0x97afb75e642ce09cbbf7bba7a9359b307fd615e1873d481fda591007d2f1b037, exec_hash: 0x82308fec9b1c5276518db8f0c48e4b9ee286bf4e69ffcb4f74629c7bfb22ada4 (verified), peers: 4, service: slot_notifier, module: client::notifier:290

Backfill failed, and:

Jan 31 02:17:58.902 DEBG Backfill advanced, processing_target: 18, validated_epoch: 18, service: backfill_sync, service: sync, module: network::sync::backfill_sync:822
Jan 31 02:17:58.902 ERRO Backfill sync failed, error: InvalidSyncState("Batch not found for current processing target 17"), service: backfill_sync, service: sync, module: network::sync::backfill_sync:474
Jan 31 02:17:58.902 ERRO Backfill sync failed, error: InvalidSyncState("Batch not found for current processing target 17"), service: sync, module: network::sync::manager:909
Jan 31 02:17:58.902 INFO Sync state updated, new_state: Synced, old_state: Syncing Historical Blocks, service: sync, module: network::sync::manager:706

The node thought it completed backfill, we're likely not handling the InvalidSyncState scenario above correctly (bug #2: will look into this and raise an issue)

Jan 31 02:18:04.003 INFO Historical block download complete, service: slot_notifier, module: client::notifier:221

jimmygchen

I've raised
#6837 for the long term solution to download checkpoint data columns from peers.

I'll raise two separate issues for the bugs I found during the test. The change in this PR works and I was able to sync to head, so I'm happy to merge this one, and investigate the other issues separately.

jimmygchen · 2025-01-31T04:47:15Z

@mergify requeue

mergify · 2025-01-31T04:47:19Z

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

mergify · 2025-01-31T05:19:40Z

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

☑️ test-suite-success

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

…tsync

dapplion · 2025-01-31T19:16:10Z

@jimmygchen thanks for taking over the testing! Did you check that the finalized block had blobs?

jimmygchen · 2025-02-03T04:45:17Z

@jimmygchen thanks for taking over the testing! Did you check that the finalized block had blobs?

Ah good point, I missed that - I'll retest.

jimmygchen · 2025-02-03T07:33:35Z

Yep, I can confirm this is working - we still have the issue described in #6895, but able to download the checkpoint blobs and sync to head.

Feb 03 07:05:25.090 DEBG Downloading finalized blobs, service: beacon, module: client::builder:482
Feb 03 07:05:25.592 DEBG Downloaded finalized blobs, service: beacon, module: client::builder:488

More logs:

Feb 03 07:05:24.538 INFO Starting checkpoint sync, remote_url: http://127.0.0.1:32834/, service: beacon, module: client::builder:386
Feb 03 07:05:24.538 DEBG Downloading deposit snapshot, service: beacon, module: client::builder:409
Feb 03 07:05:24.631 WARN Remote BN does not support EIP-4881 fast deposit sync, error: Error fetching deposit snapshot from remote: HttpClient(, kind: decode, detail: invalid type: null, expected struct DepositTreeSnapshot at line 1 column 12), service: beacon, module: client::builder:438
Feb 03 07:05:24.631 DEBG Downloading finalized state, service: beacon, module: client::builder:450
Feb 03 07:05:25.078 DEBG Downloaded finalized state, slot: Slot(256), service: beacon, module: client::builder:460
Feb 03 07:05:25.078 DEBG Downloading finalized block, block_slot: Slot(256), service: beacon, module: client::builder:464
Feb 03 07:05:25.090 DEBG Downloaded finalized block, service: beacon, module: client::builder:479
Feb 03 07:05:25.090 DEBG Downloading finalized blobs, service: beacon, module: client::builder:482
Feb 03 07:05:25.592 DEBG Downloaded finalized blobs, service: beacon, module: client::builder:488
Feb 03 07:05:25.663 INFO Loaded checkpoint block and state, block_root: 0x43105229a1f4c9ab932963fcc8ac4ff8af689d4907968d454f4227a83dbb7c18, state_slot: 256, block_slot: 256, service: beacon, module: client::builder:506
Feb 03 07:05:25.744 DEBG Storing cold state, slot: 0, strategy: snapshot, service: freezer_db, module: store::hot_cold_store:1660
Feb 03 07:05:26.274 INFO Block production enabled, method: json rpc via http, endpoint: Auth { endpoint: "http://localhost:8551/", jwt_path: "./mock-el.jwt", jwt_id: None, jwt_version: None }, module: beacon_node:151
Feb 03 07:05:26.278 DEBG eth1 endpoint error, error: eth_chainId call failed ServerMessage { code: -32601, message: "The method eth_chainId does not exist/is not available" }, endpoint: http://localhost:8551/, auth=true, service: deposit_contract_rpc, module: eth1::service:67
Feb 03 07:05:26.278 WARN Error connecting to eth1 node endpoint, endpoint: http://localhost:8551/, auth=true, service: deposit_contract_rpc, module: eth1::service:73
Feb 03 07:05:26.278 ERRO Error updating deposit contract cache, error: Invalid endpoint state: RequestFailed("eth_chainId call failed ServerMessage { code: -32601, message: \"The method eth_chainId does not exist/is not available\" }"), retry_millis: 60000, service: deposit_contract_rpc, module: eth1::service:729
Feb 03 07:05:26.307 INFO Beacon chain initialized, head_slot: 256, head_block: 0x43105229a1f4c9ab932963fcc8ac4ff8af689d4907968d454f4227a83dbb7c18, head_state: 0x719dcfec56029b345664033bf8b2fa0b7715977fa86dc0633a79cd7404758624, service: beacon, module: beacon_chain::builder:1053

Compute columns in post-PeerDAS checkpoint sync

8a6e3bf

dapplion added the das Data Availability Sampling label Jan 7, 2025

jimmygchen self-assigned this Jan 7, 2025

jimmygchen added the ready-for-review The code is ready for review label Jan 7, 2025

jimmygchen approved these changes Jan 9, 2025

View reviewed changes

jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review ready-for-merge This PR is ready to merge. labels Jan 9, 2025

jimmygchen added the under-review A reviewer has only partially completed a review. label Jan 9, 2025

realbigsean requested review from jimmygchen and realbigsean and removed request for jimmygchen and realbigsean January 12, 2025 22:38

realbigsean requested review from realbigsean and removed request for realbigsean January 13, 2025 00:16

jimmygchen mentioned this pull request Jan 22, 2025

Make Checkpoint sync work for PeerDAS #6026

Closed

jimmygchen added under-review A reviewer has only partially completed a review. and removed ready-for-merge This PR is ready to merge. labels Jan 22, 2025

mergify bot added a commit that referenced this pull request Jan 22, 2025

Merge of #6760

f626759

mergify bot mentioned this pull request Jan 22, 2025

merge queue: embarking unstable (f008b84) and #6760 together #6838

Closed

5 tasks

jimmygchen added do-not-merge and removed under-review A reviewer has only partially completed a review. labels Jan 22, 2025

jimmygchen approved these changes Jan 31, 2025

View reviewed changes

jimmygchen added ready-for-merge This PR is ready to merge. and removed do-not-merge labels Jan 31, 2025

mergify bot added a commit that referenced this pull request Jan 31, 2025

Merge of #6760

23db792

mergify bot mentioned this pull request Jan 31, 2025

merge queue: embarking unstable (e4183f8) and #6760 together #6894

Closed

6 tasks

jimmygchen added 2 commits January 31, 2025 16:25

Merge remote-tracking branch 'origin/unstable' into peerdas-checkpoin…

d7e37f5

…tsync

Lint fix only.

b59ffd7

mergify bot merged commit 027bb97 into sigp:unstable Jan 31, 2025
31 checks passed

dapplion deleted the peerdas-checkpointsync branch January 31, 2025 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute columns in post-PeerDAS checkpoint sync #6760

Compute columns in post-PeerDAS checkpoint sync #6760

dapplion commented Jan 7, 2025 •

edited

Loading

jimmygchen left a comment

jimmygchen commented Jan 9, 2025

jimmygchen commented Jan 9, 2025

jimmygchen commented Jan 10, 2025

mrabino1 commented Jan 10, 2025

jimmygchen commented Jan 10, 2025

michaelsproul commented Jan 12, 2025

michaelsproul commented Jan 12, 2025

mergify bot commented Jan 22, 2025 •

edited

Loading

jimmygchen commented Jan 22, 2025

mergify bot commented Jan 22, 2025

mergify bot commented Jan 22, 2025

jimmygchen commented Jan 22, 2025

jimmygchen commented Jan 31, 2025 •

edited

Loading

jimmygchen left a comment

jimmygchen commented Jan 31, 2025

mergify bot commented Jan 31, 2025

mergify bot commented Jan 31, 2025

dapplion commented Jan 31, 2025

jimmygchen commented Feb 3, 2025

jimmygchen commented Feb 3, 2025 •

edited

Loading

Compute columns in post-PeerDAS checkpoint sync #6760

Compute columns in post-PeerDAS checkpoint sync #6760

Conversation

dapplion commented Jan 7, 2025 • edited Loading

Issue Addressed

Proposed Changes

jimmygchen left a comment

Choose a reason for hiding this comment

jimmygchen commented Jan 9, 2025

jimmygchen commented Jan 9, 2025

jimmygchen commented Jan 10, 2025

mrabino1 commented Jan 10, 2025

jimmygchen commented Jan 10, 2025

michaelsproul commented Jan 12, 2025

michaelsproul commented Jan 12, 2025

mergify bot commented Jan 22, 2025 • edited Loading

🛑 The pull request has been removed from the queue default

jimmygchen commented Jan 22, 2025

mergify bot commented Jan 22, 2025

mergify bot commented Jan 22, 2025

✅ The pull request has been removed from the queue default

jimmygchen commented Jan 22, 2025

jimmygchen commented Jan 31, 2025 • edited Loading

jimmygchen left a comment

Choose a reason for hiding this comment

jimmygchen commented Jan 31, 2025

mergify bot commented Jan 31, 2025

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

mergify bot commented Jan 31, 2025

dapplion commented Jan 31, 2025

jimmygchen commented Feb 3, 2025

jimmygchen commented Feb 3, 2025 • edited Loading

dapplion commented Jan 7, 2025 •

edited

Loading

mergify bot commented Jan 22, 2025 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

✅ The pull request has been removed from the queue `default`

jimmygchen commented Jan 31, 2025 •

edited

Loading

jimmygchen commented Feb 3, 2025 •

edited

Loading