Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Redefine the computation of segment replication metrics in Node Stats #16801

Open
vinaykpud opened this issue Dec 7, 2024 · 4 comments · May be fixed by #17055
Open

[Feature Request] Redefine the computation of segment replication metrics in Node Stats #16801

vinaykpud opened this issue Dec 7, 2024 · 4 comments · May be fixed by #17055
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance

Comments

@vinaykpud
Copy link
Contributor

vinaykpud commented Dec 7, 2024

Is your feature request related to a problem? Please describe

Currently, during segment replication, primary shards collect stats from each of their replicas. However, search replicas will not report stats to the primary shard.
Node stats return the following metrics, which are based on stats collected by the primary from its replicas:

segments.segment_replication.max_bytes_behind
segments.segment_replication.total_bytes_behind
segments.segment_replication.max_replication_lag

We want to

  1. Unify the way stats is calculated and reported for replicas and search replica
  2. Redefine how lag is computed

Describe the solution you'd like

The idea is to stop comparing replicas with the primary, as is done currently, and instead report stats for each replica based on its ongoing replication events, similar to cat recovery. This approach decouples the primary from stats calculation, and the primary shard will no longer need to maintain segment replication checkpoint information for all its replicas in the ReplicationTracker.

We can generate metrics to answer the following questions:

  1. How many bytes are left to be downloaded to the replica?
  2. How long is the ongoing replication taking?

This will decouple the behavior so that we can use point in time stats of each shard for the node stats calculation for segment replication.

Please note that we are currently modifying how search replicas compute and return stats in cat segrep. Details are discussed here: #15534.

Related component

Search:Performance

@vinaykpud vinaykpud added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 7, 2024
@sachinpkale
Copy link
Member

search replicas will not report stats to the primary shard.

I assume "search replicas" term is from Reader-Writer separation proposal #15237 and it would be using pull based replication. Is that correct?

How many bytes are left to be downloaded to the replica?

If replica is overloaded and download is slow, search replica would be downloading data from metadata that was polled, let's say 10 mins back. Unless it polls the latest metadata again, will we be able to understand the bytes left to be downloaded?

@gbbafna
Copy link
Collaborator

gbbafna commented Dec 10, 2024

This approach decouples the primary from stats calculation, and the primary shard will no longer need to maintain segment replication checkpoint information for all its replicas in the ReplicationTracker.

But they will still need this information for doing write rejections due to searchers lagging .

I like this method as it will help in pin pointing exact node/shard copy which is having issue, as opposed to a convoluted approach today.

@mch2
Copy link
Member

mch2 commented Dec 12, 2024

I think we can do something like this:

Each replica regardless of type tracks the latest received checkpoint that includes a timestamp of when the primary (writer) refreshed on it. Search replica would have to poll for the latest cp from the store while writer replicas get this already when checkpoints are published. @sachinpkale is correct, the current pull based replication skips the metadata fetch if the search replica is already running replication, so we'd have to add a separate process to fetch the cp or allow the checkpoint step to continue.

Node stats should report the state of the replicas on the node, not the state of the replicas of the primaries on the node (as is today). Each shard then compares its currently served checkpoint to the "latest received". If the sis version is different ie the shard is behind the store, lag becomes the diff between current time & the latest received cp's timestamp which is roughly the replication lag - upload time. Bytes behind is already computed by the diff of two checkpoints as metadata is included. Checkpoints behind could be the diff in sis version, although this metric is pretty useless imo.

Backpressure:
We update the existing async task to enforce pressure by instead fetching the checkpoints of each replica and compute these stats on the fly rather than precomputed metrics on the primary. Search replica state would not be factored in. We will need a separate mechanism to track search replica freshness & fail them accordingly.

Ultimately we are giving up some precision on lag computation for simplicity, which I think is well worth it.

@dblock
Copy link
Member

dblock commented Jan 6, 2025

[Catch All Triage - 1, 2, 3, 4, 5, 6]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

5 participants