Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: insufficient tokens under high write load and WAL failover #138655

Closed
sumeerbhola opened this issue Jan 8, 2025 · 0 comments
Closed
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-perturbation Bugs found by the perturbation framework T-admission-control Admission Control

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jan 8, 2025

Observations from https://cockroachlabs.slack.com/archives/C06UFBJ743F/p1736348529278319?thread_ts=1736262662.853989&cid=C06UFBJ743F:

A single flush adds a huge amount of data (~1.6GiB and 800 files) to L0. AC's ioLoadListener notices this at the same time as Pebble compactions (out of L0) start ramping up. Within 30s the backlog is cleared. During this time period, AC tokens are insufficient for regular work. The memtable backlog did not cause a write stall since WAL failover was active (the primary disk was not actually stalled, but slower because of high bandwidth usage -- this test was not using AC's bandwidth limiting).

The ioLoadListener can use a low value of compaction bandwidth for deciding on tokens for various reasons:

  • [AC deficiency] Compactions haven't yet ramped up enough. The ioLoadListener samples at 15s intervals. The assumption has been that if it sees L0 overload, Pebble has been seeing it for a few seconds. This doesn't apply when all the L0 bytes were added via a single flush. In the above experiment, there probably wasn't a ramp up issue since 800 files results in an uncompensated score of 1.6 and 2 sublevels (prior to the flush) result in a score of 2.0. It was running with compaction concurrency of 3, and competing with L3 and L4 which also had high scores.
  • [AC/Pebble inconsistency] Pebble uses L0CompactionThreshold=2, L0CompactionFileThreshold=500. 1 sublevel results in a score of 1. 500 files result in a score of 1. So 1 sublevel==500 files.
    AC uses 20 sublevels == 1000 files in its scoring. This is wildly inconsistent. It the latter had a higher file count threshold, AC would give out more tokens.
  • [AC deficiency] During WAL failover ioLoadListener stops changing its estimates of the compacted bytes. This means that if WAL failover is only due to slight slowness and the compactions out of L0 are successfully ramping up, ioLoadListener will ignore this ramp up and give out fewer tokens. It should update its estimates if the observed compaction bytes are increasing.

Jira issue: CRDB-46185

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control T-admission-control Admission Control labels Jan 8, 2025
@sumeerbhola sumeerbhola self-assigned this Jan 8, 2025
@andrewbaptist andrewbaptist added the O-perturbation Bugs found by the perturbation framework label Jan 8, 2025
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 8, 2025
The current value of 1000 was too inconsistent with Pebble's compaction
scoring, in that compaction scoring had 1 sub-level == 500 L0 files, and
admission control had 1 sub-level == 50 L0 files. The value is increased
to 4000, so 1 sub-level == 200 L0 files. There is a long code comment
elaborating on this.

Informs cockroachdb#138655

Epic: none

Release note (ops change): The default value of cluster setting
admission.l0_file_count_overload_threshold is changed to 4000.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 9, 2025
WAL failover can happen even when the throughput sustainable by the primary
is high (since we tune WAL failover to happen at a 100ms latency threshold).
So if the workload increase is resulting in more compactions, we should
incorporate that into the tokens given out by the ioLoadListener.

Informs cockroachdb#138655

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Jan 9, 2025
138699: admission: increase file count based overload threshold r=aadityasondhi a=sumeerbhola

The current value of 1000 was too inconsistent with Pebble's compaction scoring, in that compaction scoring had 1 sub-level == 500 L0 files, and admission control had 1 sub-level == 50 L0 files. The value is increased to 4000, so 1 sub-level == 200 L0 files. There is a long code comment elaborating on this.

Informs #138655

Epic: none

Release note (ops change): The default value of cluster setting admission.l0_file_count_overload_threshold is changed to 4000.

Co-authored-by: sumeerbhola <[email protected]>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 10, 2025
WAL failover can happen even when the throughput sustainable by the primary
is high (since we tune WAL failover to happen at a 100ms latency threshold).
So if the workload increase is resulting in more compactions, we should
incorporate that into the tokens given out by the ioLoadListener.

Informs cockroachdb#138655

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Jan 10, 2025
138708: admission: incorporate higher compaction rate during WAL failover r=aadityasondhi a=sumeerbhola

WAL failover can happen even when the throughput sustainable by the primary
is high (since we tune WAL failover to happen at a 100ms latency threshold).
So if the workload increase is resulting in more compactions, we should
incorporate that into the tokens given out by the ioLoadListener.

Informs #138655

Epic: none

Release note: None


Co-authored-by: sumeerbhola <[email protected]>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jan 10, 2025
WAL failover can happen even when the throughput sustainable by the primary
is high (since we tune WAL failover to happen at a 100ms latency threshold).
So if the workload increase is resulting in more compactions, we should
incorporate that into the tokens given out by the ioLoadListener.

Informs cockroachdb#138655

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Jan 10, 2025
138640: kserver: enable leader leases for TestFlowControlCrashedNodeV2 r=miraradeva a=miraradeva

In 2c89915, TestFlowControlCrashedNode and TestFlowControlCrashedNodeV2 were set up to run with leader leases. TestFlowControlCrashedNodeV2 soon became flaky (#136292) because after one of the two nodes was stopped, the other node (leader) could step down due to `CheckQuorum`. TestFlowControlCrashedNode flaked similarly under race.

This commit adds a third node to the tests, so the leader can retain leadership in the presence of a single crash. It also enables leader leases back for TestFlowControlCrashedNodeV2.

Part of: #136806

Release note: None

138708: admission: incorporate higher compaction rate during WAL failover r=aadityasondhi a=sumeerbhola

WAL failover can happen even when the throughput sustainable by the primary
is high (since we tune WAL failover to happen at a 100ms latency threshold).
So if the workload increase is resulting in more compactions, we should
incorporate that into the tokens given out by the ioLoadListener.

Informs #138655

Epic: none

Release note: None


Co-authored-by: Mira Radeva <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-perturbation Bugs found by the perturbation framework T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

2 participants