-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move clientCron onto a separate timer #1387
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems right to me. Are there any possible unintended behavior change due to decoupling this from serverCron? Would be good to callout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me. The code looks correct to me
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1387 +/- ##
============================================
+ Coverage 70.68% 70.87% +0.19%
============================================
Files 118 121 +3
Lines 63550 65180 +1630
============================================
+ Hits 44919 46197 +1278
- Misses 18631 18983 +352
|
Signed-off-by: Jim Brunner <[email protected]>
Co-authored-by: Binbin <[email protected]> Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'm dubious about the new info field, but not sure I feel that strongly about it.
Add info field created by valkey-io/valkey#1387 Signed-off-by: Jim Brunner <[email protected]>
In valkey-io#1387, we move clientCron onto a separate timer, in this new timer, we should check if pause_cron is on to skip the cron. It causes the following tests to fail, they rely on debug pause-cron. ``` [err]: query buffer resized correctly in tests/unit/querybuf.tcl [err]: query buffer resized correctly when not idle in tests/unit/querybuf.tcl ``` Signed-off-by: Binbin <[email protected]>
In #1387, we move clientCron onto a separate timer, in this new timer, we should check if pause_cron is on to skip the cron. It causes the following tests to fail, they rely on debug pause-cron. ``` [err]: query buffer resized correctly in tests/unit/querybuf.tcl [err]: query buffer resized correctly when not idle in tests/unit/querybuf.tcl ``` Signed-off-by: Binbin <[email protected]>
After introducing, #1387, we saw a significant increase in other spurious wakeups because of the client cron that was added, which affected the "instantaneous eventloops per second" metric (showing it higher than before". All I did was increase the server hz to get more samples and increase the target value. This seems to work more consistently now. I also removed retries since the instantaneous value isn't dependent on number of retries. Additionally, `assert_lessthan $value [expr $retries*22000]` makes no sense to me. The value is usually around 30-100us, since all it's doing is waking up and running a little bit of cron. The retries doesn't make much sense, since the retries don't impact the instantaneous value. I just removed the retries and left the 22k value for now, maybe valgrind is slow. --------- Signed-off-by: Madelyn Olson <[email protected]>
The
serverCron()
function contains a variety of maintenance functions and is set up as a timer job, configured to run at a certain rate (hz). The default rate is 10hz (every 100ms).One of the things that
serverCron()
does is to perform maintenance functions on connected clients. Since the number of clients is variable, and can be very large, this could cause latency spikes when the 100msserverCron()
task gets invoked. To combat those latency spikes, a feature called "dynamic-hz" was introduced. This feature will runserverCron()
more often, if there are more clients. The clients get processed up to 200 at a time. The delay forserverCron()
is shortened with the goal of processing all of the clients every second.The result of this is that some of the other (non-client) maintenance functions also get (unnecessarily) run more often. Like
cronUpdateMemoryStats()
anddatabasesCron()
. Logically, it doesn't make sense to run these functions more often, just because we happen to have more clients attached.This PR separates client activities onto a separate, variable, timer. The "dynamic-hz" feature is eliminated. Now,
serverCron
will run at a standard configured rate. The separate clients cron will automatically adjust based on the number of clients. This has the added benefit that often, the 2 crons will fire during separate event loop invocations and will usually avoid the combined latency impact of doing both maintenance activities together.The new timer follows the same rules which were established with the dynamic HZ feature.
MAX_CLIENTS_PER_CLOCK_TICK
)CLIENTS_CRON_MIN_ITERATIONS
)The delay (ms) for the new timer is also more precise, computing the number of milliseconds needed to achieve the goal of reaching all of the clients every second. The old dynamic-hz feature just performs a doubling of the HZ until the clients processing rate is achieved (i.e. delays of 100ms, 50ms, 25ms, 12ms...)