`Timeline::set_state` is still racy #10297

arpad-m · 2025-01-07T15:35:28Z

The set_state function of Timeline forbids some state transitions like from Stopping -> Active, or Broken -> any other state. However,
if the function gets called concurrently, it can happen that both threads call current_state before the other thread had the opportunity to write. Then we might have forbidden state transitions.

Say a timeline is Loading, and thread A wants to set it to Stopping. Thread B wants to set it to Active. Then, if both threads read current_state() at the same time, they might both call send_replace. It can happen that thread A gets through first, so the timeline is first Stopping. Then, thread B changes the state to Active, causing timeline state to go from Stopping to Active, something that we don't want to allow.

Earlier issue: #2727

The text was updated successfully, but these errors were encountered:

## Problem Auto-offloading as requested by the compaction task is racy with unarchival, in that the compaction task might attempt to offload an unarchived timeline. By that point it will already have set the timeline to the `Stopping` state however, which makes it unusable for any purpose. For example: 1. compaction task decides to offload timeline 2. timeline gets unarchived 3. `offload_timeline` gets called by compaction task * sets timeline's state to `Stopping` * realizes that the timeline can't be unarchived, errors out 6. endpoint can't be started as the timeline is `Stopping` and thus 'can't be found'. A future iteration of the compaction task can't "heal" this state either as the timeline will still not be archived, same goes for other automatic stuff. The only way to heal this is a tenant detach+attach, or alternatively a pageserver restart. Furthermore, the compaction task is especially amenable for such races as it first stores `can_offload` into a variable, figures out whether compaction is needed (which takes some time), and only then does it attempt an offload operation: the time difference between "check" and "use" is non-trivially small. To make it even worse, we start the compaction task right after attach of a tenant, and it is a common pattern by pageserver users to attach a tenant to then immediately unarchive a timeline, so that an endpoint can be started. ## Solutions not adopted The simplest solution is to move the `can_offload` check to right before attempting of the offload. But this is not a good solution, as no lock is held between that check and timeline shutdown. So races would still be possible, just become less likely. I explored using the timeline state for this, as in adding an additional enum variant. But `Timeline::set_state` is racy (#10297). ## Adopted solution We use the lock on the timeline's upload queue as an arbiter: either unarchival gets to it first and sours the state for auto-offloading, or auto-offloading shuts it down, which stops any parallel unarchival in its tracks. The key part is not releasing the upload queue's lock between the check whether the timeline is archived or not, and shutting it down (the actual implementation only sets `shutting_down` but it has the same effect on `initialized_mut()` as a full shutdown). The rest of the patch is stuff that follows from this. We also move the part where we set the state to `Stopping` to after that arbiter has decided the fate of the timeline. For deletions, we do keep it inside `DeleteTimelineFlow::prepare` however, so that it is called with all of the the timelines locks held that the function allocates (timelines lock most importantly). This is only a precautionary measure however, as I didn't want to analyze deletion related code for possible races. ## Future changes It might make sense to move `can_offload` to right before the offload attempt. Maybe some other properties might have changed as well. Although this will not be perfect either as no lock is held. I want to keep it out of this change to emphasize that this move wasn't the main reason we are race free now. Fixes #10220

arpad-m added the c/storage/pageserver Component: storage: pageserver label Jan 7, 2025

This was referenced Jan 7, 2025

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

Closed

Handle race between auto-offload and unarchival #10305

Merged

arpad-m added the t/bug Issue Type: Bug label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Timeline::set_state` is still racy #10297

`Timeline::set_state` is still racy #10297

arpad-m commented Jan 7, 2025

Timeline::set_state is still racy #10297

Timeline::set_state is still racy #10297

Comments

arpad-m commented Jan 7, 2025

`Timeline::set_state` is still racy #10297

`Timeline::set_state` is still racy #10297