Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize replication latency by fix the misjudgment of the WAL iterator status(#13260) #13261

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Paragrf
Copy link

@Paragrf Paragrf commented Dec 31, 2024

Description has been recorded in #13260

There are two scenarios where the iterator incorrectly determines that the file has been fully traversed before reaching the actual end, resulting in a "TryAgain" return. However, when traversal is interrupted within a WAL file, subsequent attempts to call SeekToStartSequence can incur significant delays. Our tracking indicates that in such cases, SeekToStartSequence can take between 80 to 200 milliseconds, and RestrictedRead may be executed up to 100,000 times.

case1:

check current_last_seq_ == versions_->LastSequence() twice, but external writes between the two checks may cause the LastSequence to increase, leading to the success of the first check and the failure of the second

20241231133901
Figure 1: double check in nextImpl

20241231131201
Figure 2: first check in RestrictedRead

After addressing this issue, the delay in replication has been significantly optimized, though occasional delay spikes still occur.

20241231131222
Figure 3: replication Pmax(Red line: control group, Orange line: experimental group)

case2:

current_log_reader_->ReadRecord(record, &scratch_) may return false in kEof branch. In certain scenarios, reaching EOF does not necessarily indicate that the file has truly reached its end. We observed this behavior in some custom log info, which also explains the spikes seen in the experimental group in Figure 3.

Although we have not yet pinpointed the specific scenarios that lead to this false EOF, we can prevent this misjudgment by verifying whether a new live WAL file has actually been generated. This issue can be completely solved after adding this check.

20241231141215
Figure 4: replication Pmax(Red line: control group, Orange line: experimental group)

@Paragrf Paragrf changed the title Fix the misjudgment of the WAL iterator status(#13260) Optimize replication latency by fix the misjudgment of the WAL iterator status(#13260) Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants