Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_prev: a simple scheduler tested on OLTP workloads #1275

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

danieljordan10
Copy link
Contributor

A FIFO-only variation on scx_simple with CPU selection that prioritizes an idle previous CPU over a fully idle core (as is done in scx_simple and scx_rusty).

scx_prev outperforms a few other schedulers on OLTP workloads run on systems with relatively flat topology (i.e. non-NUMA, single LLC) by changing CPU selection as above and by taking advantage of the more aggressive work conservation (i.e. idle balancing) that comes with sched_ext by default.

It's far from being a full-fledged scheduler, but it demonstrates how a small change to an existing scheduler can improve performance in a real application.

Notes:

  • AMD EPYC 7J13 (16-CPU VM) server running v6.12-based UEK-next kernel, scx (688bffc "Merge pull request code simplifications, using more modern construct when possible. #1192 from devnexen/code_simpl3"), and MySQL Community Edition 8.4[0]
  • AMD EPYC 7551 (128-CPU BM) client running BMK[1] (a sysbench-based BenchMark Kit)
  • Each data point in the table below represents the average of ten, one-minute runs done after a three-minute warmup. The server is rebooted between each scheduler.
  • "cli" means the number of database clients.
  • Each %diff column is relative to eevdf.
Representative BMK testcase: sb11-OLTP_RO_10M_8tab-uniform-ps-notrx.sh

cli    eevdf (std%)    rusty (std%)     %diff    simple (std%)     %diff     prev (std%)     %diff
---    ------------    ------------     -----    -------------     -----     -----------     -----

throughput
16      4140 (  1%)     4224 (  1%)    (  2%)      4276 (  2%)    (  3%)     4263 (  1%)    (  3%)
32      7382 (  1%)     7259 (  1%)    ( -2%)      7314 (  1%)    ( -1%)     7919 (  1%)    (  7%)
48      9015 (  0%)     9644 (  0%)    (  7%)     10055 (  0%)    ( 12%)    10411 (  1%)    ( 15%)
64      9765 (  1%)     9601 (  0%)    ( -2%)     10214 (  0%)    (  5%)    10481 (  0%)    (  7%)

average latency
16         4 (  1%)        4 (  1%)    ( -2%)         4 (  2%)    ( -3%)        4 (  1%)    ( -3%)
32         4 (  1%)        4 (  1%)    (  2%)         4 (  1%)    (  1%)        4 (  1%)    ( -7%)
48         5 (  0%)        5 (  0%)    ( -7%)         5 (  0%)    (-10%)        5 (  1%)    (-13%)
64         7 (  1%)        7 (  0%)    (  2%)         6 (  0%)    ( -4%)        6 (  0%)    ( -7%)

95p latency
16         4 (  3%)        4 (  2%)    ( -4%)         4 (  4%)    ( -1%)        4 (  4%)    ( -7%)
32         5 (  2%)        5 (  1%)    (  1%)         5 (  2%)    (  1%)        4 (  2%)    (-11%)
48         7 (  1%)        6 (  1%)    (-16%)         5 (  1%)    (-24%)        5 (  1%)    (-26%)
64         9 (  3%)        8 (  0%)    (-12%)         7 (  0%)    (-26%)        7 (  1%)    (-26%)

In the read-only workload, prev consistently outperforms with equal or better throughput and latency across the board.

[0] https://github.com/mysql/mysql-server/tree/8.4
[1] http://dimitrik.free.fr/blog/posts/mysql-perf-bmk-kit.html

A FIFO-only variation on scx_simple with CPU selection that prioritizes an idle
previous CPU over a fully idle core (as is done in scx_simple and scx_rusty).

scx_prev outperforms a few other schedulers on OLTP workloads run on
systems with relatively flat topology (i.e. non-NUMA, single LLC) by
changing CPU selection as above and by taking advantage of the more
aggressive work conservation (i.e. idle balancing) that comes with
sched_ext by default.

It's far from being a full-fledged scheduler, but it demonstrates how a
small change to an existing scheduler can improve performance in a real
application.

Notes:
 - AMD EPYC 7J13 (16-CPU VM) server running v6.12-based UEK-next kernel,
   scx (688bffc "Merge pull request sched-ext#1192 from devnexen/code_simpl3"), and
   MySQL Community Edition 8.4[0]
 - AMD EPYC 7551 (128-CPU BM) client running BMK[1] (a sysbench-based
   BenchMark Kit)
 - Each data point in the table below represents the average of ten,
   one-minute runs done after a three-minute warmup.  The server is
   rebooted between each scheduler.
 - "cli" means the number of database clients.
 - Each %diff column is relative to eevdf.

Representative BMK testcase: sb11-OLTP_RO_10M_8tab-uniform-ps-notrx.sh

cli    eevdf (std%)    rusty (std%)     %diff    simple (std%)     %diff     prev (std%)     %diff
---    ------------    ------------     -----    -------------     -----     -----------     -----

throughput
16      4140 (  1%)     4224 (  1%)    (  2%)      4276 (  2%)    (  3%)     4263 (  1%)    (  3%)
32      7382 (  1%)     7259 (  1%)    ( -2%)      7314 (  1%)    ( -1%)     7919 (  1%)    (  7%)
48      9015 (  0%)     9644 (  0%)    (  7%)     10055 (  0%)    ( 12%)    10411 (  1%)    ( 15%)
64      9765 (  1%)     9601 (  0%)    ( -2%)     10214 (  0%)    (  5%)    10481 (  0%)    (  7%)

average latency
16         4 (  1%)        4 (  1%)    ( -2%)         4 (  2%)    ( -3%)        4 (  1%)    ( -3%)
32         4 (  1%)        4 (  1%)    (  2%)         4 (  1%)    (  1%)        4 (  1%)    ( -7%)
48         5 (  0%)        5 (  0%)    ( -7%)         5 (  0%)    (-10%)        5 (  1%)    (-13%)
64         7 (  1%)        7 (  0%)    (  2%)         6 (  0%)    ( -4%)        6 (  0%)    ( -7%)

95p latency
16         4 (  3%)        4 (  2%)    ( -4%)         4 (  4%)    ( -1%)        4 (  4%)    ( -7%)
32         5 (  2%)        5 (  1%)    (  1%)         5 (  2%)    (  1%)        4 (  2%)    (-11%)
48         7 (  1%)        6 (  1%)    (-16%)         5 (  1%)    (-24%)        5 (  1%)    (-26%)
64         9 (  3%)        8 (  0%)    (-12%)         7 (  0%)    (-26%)        7 (  1%)    (-26%)

In the read-only workload, prev consistently outperforms with equal or better
throughput and latency across the board.

[0] https://github.com/mysql/mysql-server/tree/8.4
[1] http://dimitrik.free.fr/blog/posts/mysql-perf-bmk-kit.html

Signed-off-by: Daniel Jordan <[email protected]>
@etsal etsal self-requested a review January 30, 2025 16:28
Copy link
Contributor

@etsal etsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! If you'd like to elaborate on the results and your use case, we have a Slack and meet every Tuesday at 11AM EST.

@etsal etsal added this pull request to the merge queue Jan 30, 2025
{
s32 cpu;

if (p->nr_cpus_allowed == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this condition is always false, ops.select_cpu() is always skipped if the task can only run on 1 cpu.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, thanks, I see how ->select_cpu() is always skipped in the in-kernel scheduler core for nr_cpus_allowed == 1. I'll send a follow up deleting the unused branch.

Merged via the queue into sched-ext:main with commit 85f819d Jan 30, 2025
23 checks passed
danieljordan10 added a commit to danieljordan10/scx that referenced this pull request Jan 30, 2025
As Andrea points out[0], select_cpu() is never called for such
tasks, so this branch is dead code.  Remove it.

[0] sched-ext#1275

Signed-off-by: Daniel Jordan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants