Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move clientCron onto a separate timer #1387

Merged
merged 18 commits into from
Feb 4, 2025
Merged

Conversation

JimB123
Copy link
Contributor

@JimB123 JimB123 commented Dec 3, 2024

The serverCron() function contains a variety of maintenance functions and is set up as a timer job, configured to run at a certain rate (hz). The default rate is 10hz (every 100ms).

One of the things that serverCron() does is to perform maintenance functions on connected clients. Since the number of clients is variable, and can be very large, this could cause latency spikes when the 100ms serverCron() task gets invoked. To combat those latency spikes, a feature called "dynamic-hz" was introduced. This feature will run serverCron() more often, if there are more clients. The clients get processed up to 200 at a time. The delay for serverCron() is shortened with the goal of processing all of the clients every second.

The result of this is that some of the other (non-client) maintenance functions also get (unnecessarily) run more often. Like cronUpdateMemoryStats() and databasesCron(). Logically, it doesn't make sense to run these functions more often, just because we happen to have more clients attached.

This PR separates client activities onto a separate, variable, timer. The "dynamic-hz" feature is eliminated. Now, serverCron will run at a standard configured rate. The separate clients cron will automatically adjust based on the number of clients. This has the added benefit that often, the 2 crons will fire during separate event loop invocations and will usually avoid the combined latency impact of doing both maintenance activities together.

The new timer follows the same rules which were established with the dynamic HZ feature.

  • The goal is to process all of the clients once per second
  • We never want to process more than 200 clients in a single invocation (MAX_CLIENTS_PER_CLOCK_TICK)
  • We always process at least 5 clients at a time (CLIENTS_CRON_MIN_ITERATIONS)
  • The minimum rate is determined by HZ

The delay (ms) for the new timer is also more precise, computing the number of milliseconds needed to achieve the goal of reaching all of the clients every second. The old dynamic-hz feature just performs a doubling of the HZ until the clients processing rate is achieved (i.e. delays of 100ms, 50ms, 25ms, 12ms...)

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems right to me. Are there any possible unintended behavior change due to decoupling this from serverCron? Would be good to callout.

src/server.c Outdated Show resolved Hide resolved
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me. The code looks correct to me

src/server.c Outdated Show resolved Hide resolved
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Dec 5, 2024
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 90.62500% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.87%. Comparing base (9f8b174) to head (94e6c58).
Report is 141 commits behind head on unstable.

Files with missing lines Patch % Lines
src/server.c 89.65% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1387      +/-   ##
============================================
+ Coverage     70.68%   70.87%   +0.19%     
============================================
  Files           118      121       +3     
  Lines         63550    65180    +1630     
============================================
+ Hits          44919    46197    +1278     
- Misses        18631    18983     +352     
Files with missing lines Coverage Δ
src/config.c 78.38% <100.00%> (+0.74%) ⬆️
src/server.h 100.00% <ø> (ø)
src/server.c 87.66% <89.65%> (+0.27%) ⬆️

... and 61 files with indirect coverage changes

src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.h Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
JimB123 and others added 11 commits January 24, 2025 17:54
Co-authored-by: Binbin <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'm dubious about the new info field, but not sure I feel that strongly about it.

@madolson madolson added the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Feb 3, 2025
@madolson madolson merged commit c75e866 into valkey-io:unstable Feb 4, 2025
50 checks passed
madolson pushed a commit to valkey-io/valkey-doc that referenced this pull request Feb 4, 2025
Add info field created by valkey-io/valkey#1387

Signed-off-by: Jim Brunner <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Feb 5, 2025
In valkey-io#1387, we move clientCron onto a separate timer, in this new timer,
we should check if pause_cron is on to skip the cron.

It causes the following tests to fail, they rely on debug pause-cron.
```
[err]: query buffer resized correctly in tests/unit/querybuf.tcl
[err]: query buffer resized correctly when not idle in tests/unit/querybuf.tcl
```

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Feb 5, 2025
In #1387, we move clientCron onto a separate timer, in this new timer,
we should check if pause_cron is on to skip the cron.

It causes the following tests to fail, they rely on debug pause-cron.
```
[err]: query buffer resized correctly in tests/unit/querybuf.tcl
[err]: query buffer resized correctly when not idle in tests/unit/querybuf.tcl
```

Signed-off-by: Binbin <[email protected]>
madolson added a commit that referenced this pull request Feb 7, 2025
After introducing, #1387, we saw
a significant increase in other spurious wakeups because of the client
cron that was added, which affected the "instantaneous eventloops per
second" metric (showing it higher than before". All I did was increase
the server hz to get more samples and increase the target value. This
seems to work more consistently now. I also removed retries since the
instantaneous value isn't dependent on number of retries.

Additionally, `assert_lessthan $value [expr $retries*22000]` makes no
sense to me. The value is usually around 30-100us, since all it's doing
is waking up and running a little bit of cron. The retries doesn't make
much sense, since the retries don't impact the instantaneous value. I
just removed the retries and left the 22k value for now, maybe valgrind
is slow.

---------

Signed-off-by: Madelyn Olson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants