Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add /benchmark github command to comparison benchmark between base and pr commit #9461

Merged
merged 6 commits into from
Mar 13, 2024

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Mar 5, 2024

Which issue does this PR close?

Partially progresses #5504.

Rationale for this change

Enable automated comparative PR benchmarking, and thus make performance regression less likely to be introduced into the code.

What changes are included in this PR?

Add two new workflows:

  1. Benchmarks, that is triggered by a PR comment /benchmark, and which runs the standard repo benchmarks against the base and head commits.
  2. A generic PR Comment workflow, that in this case should post a message along the lines of

Benchmark results

Benchmarks comparing e5404a1 and 3bf178a
Comparing main-e5404a1 and ci-benches-3bf178a
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main-e5404a1 ┃ ci-benches-3bf178a ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │     445.21ms │          1446.96ms │  3.25x slower │
│ QQuery 2     │      60.82ms │          1062.72ms │ 17.47x slower │
│ QQuery 3     │     147.76ms │          1153.42ms │  7.81x slower │
│ QQuery 4     │      89.77ms │          1094.23ms │ 12.19x slower │
│ QQuery 5     │     208.05ms │          1215.64ms │  5.84x slower │
│ QQuery 6     │     109.18ms │           109.30ms │     no change │
│ QQuery 7     │     295.69ms │          1312.57ms │  4.44x slower │
│ QQuery 8     │     200.32ms │          1207.25ms │  6.03x slower │
│ QQuery 9     │     304.16ms │          1316.10ms │  4.33x slower │
│ QQuery 10    │     244.40ms │          1254.85ms │  5.13x slower │
│ QQuery 11    │      66.08ms │          1068.05ms │ 16.16x slower │
│ QQuery 12    │     127.32ms │          1132.51ms │  8.89x slower │
│ QQuery 13    │     179.09ms │          1191.68ms │  6.65x slower │
│ QQuery 14    │     129.67ms │           132.54ms │     no change │
│ QQuery 15    │     198.01ms │          1201.44ms │  6.07x slower │
│ QQuery 16    │      53.18ms │          1059.45ms │ 19.92x slower │
│ QQuery 17    │     323.30ms │           340.44ms │  1.05x slower │
│ QQuery 18    │     465.16ms │          1464.97ms │  3.15x slower │
│ QQuery 19    │     235.39ms │           235.56ms │     no change │
│ QQuery 20    │     193.32ms │          1214.60ms │  6.28x slower │
│ QQuery 21    │     458.16ms │          1349.42ms │  2.95x slower │
│ QQuery 22    │      54.80ms │          1059.73ms │ 19.34x slower │
└──────────────┴──────────────┴────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main-e5404a1)         │  4588.85ms │
│ Total Time (ci-benches-3bf178a)   │ 22623.43ms │
│ Average Time (main-e5404a1)       │   208.58ms │
│ Average Time (ci-benches-3bf178a) │  1028.34ms │
│ Queries Faster                    │          0 │
│ Queries Slower                    │         19 │
│ Queries with No Change            │          3 │
└───────────────────────────────────┴────────────┘

Are these changes tested?

Sort of.

There is a big catch-22 here at play:

  • The HEAD~1 commit relies on a pull_request event to run the benchmarks, as is officially recommended. The workflow fails just prior to posting a comment because in the case of pull_request events there are insufficient permission to write on the PR (trying to override the permissions of a job itself doesn't work).
  • Workaround 1 is to execute a workflow_call workflow, but the problem there is that this type of workflow inherits permissions from the caller workflow so again this isn't sufficient.
  • Workaround 2 is to use the pull_request_target which does have sufficient permissions and call a workflow_call workflow, but suffers from another problem, namely it must be present on the default branch to be run, so in this case it won't be executed until the PR is merged..
  • Workaround 3 is to rely on the workflow_run event that gets triggered by the benchmark workflow, but the same problem appears here (can't run until merged).

In the end I decided to go with a issue_comment-based (covers pull requests too) benchmarking workflow (which also needs to be on the main branch to be run, and then make that trigger the PR commenting workflow via a workflow_run event. I figured since there has to be a two-step motion to test this out anyway we might as well try not to spam and add noise between steps (plus I envision the PR benchmarking to be triggered by a comment anyway).

Are there any user-facing changes?

None (there will be dev-facing changes with the follow-up).

@gruuya gruuya force-pushed the ci-benches branch 30 times, most recently from b92732e to e371175 Compare March 6, 2024 12:23
@gruuya gruuya force-pushed the ci-benches branch 3 times, most recently from 522d19d to be9b644 Compare March 7, 2024 07:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gruuya -- I think the basic pattern of this PR looks good to me. I wonder how we could test it... Maybe I could merge it to my fork and run it there 🤔

The basic idea looks great -- thank you. The major concern I have is that this PR seems to run the benchmark on github runners, as I understand it

The downside of using github runners is that I think they do not have stable base performance. I think they are shared and they also may vary from one run to the next (e.g. different processor). If the idea is to benchmark just the code change, keeping the runtime environment the same I think is important. For example if the tests report that performance slows down on a PR but the problem really was that the github runner was overloaded, that will be super confusing

I think I had in my mind that we would have a dedicated benchmarking machine / VM running somewhere and run the benchmarks on that VM. I am pretty sure InfluxData would be willing to cover the cost of this VM (or we can make a shared collection)

The beneift of this approach would be that the benchmark enviroment would be more controlled (the only change would be the software change), though it has monetary cost as well won't scale up/down with our jobs

@gruuya
Copy link
Contributor Author

gruuya commented Mar 11, 2024

The major concern I have is that this PR seems to run the benchmark on github runners, as I understand it

True, that is correct. My assumption was that any instabilities in the base performance would not vary greatly during a single run as both benchmarks are run within the same job in a relatively short time interval, but I guess this is not a given.

In addition, for this type of benchmarking (PR-vs-main) we're only interested in relative comparisons, and so the longitudinal variance component, which is undoubtedly large wouldn't come into play (unlike for the tracking of main perf across time).

That said the present workflows should be easily extendable to use self-hosted runners once those become available I believe. This would also bring additional benefits, such as shorter benchmark runs, both in terms of persisting the test data (e.g. downloading hits.parquet takes about ~12 minutes in CI), as well as the runs themselves (means we could use beefier SFs, i.e. 10 or even 100).

I wonder how we could test it... Maybe I could merge it to my fork and run it there 🤔

Oh that's a good idea, I believe I can test it out on our fork and provide the details here, thanks!

@gruuya
Copy link
Contributor Author

gruuya commented Mar 11, 2024

Ok I went ahead and tested it on our fork: splitgraph#1

Here it is catching a deliberate regression I introduced: splitgraph#1 (comment)

Here it is after removing the regression (and polishing the comment message a bit): splitgraph#1 (comment)

There are a couple of false positives in that last one, though it is running SF1, so with shorter run times it is more sensitive to any variations (though I believe such variations can be seen when running locally as well).

@alamb alamb changed the title Try running a basic comparison benchmark between base and pr commit Add /benchmark github command to comparison benchmark between base and pr commit Mar 13, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gruuya -- I think this is a really neat step forward and we can refine this functionality over time.

I think we should also add a section in the benchmark documentation about this feature to make it easier to find.
https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks

We could do that as a follow on PR

Ok I went ahead and tested it on our fork: splitgraph#1

Here it is catching a deliberate regression I introduced: splitgraph#1 (comment)

Here it is after removing the regression (and polishing the comment message a bit): splitgraph#1 (comment)

Those results are very cool

There are a couple of false positives in that last one, though it is running SF1, so with shorter run times it is more sensitive to any variations (though I believe such variations can be seen when running locally as well).

Yes, you are correct, I have also seen similar variations locally.

.github/workflows/pr_benchmarks.yml Show resolved Hide resolved
@gruuya
Copy link
Contributor Author

gruuya commented Mar 13, 2024

I think we should also add a section in the benchmark documentation about this feature to make it easier to find.
https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks

We could do that as a follow on PR

Yup, makes sense! I'd also like to bump SF to 10 there to reduce the noise a bit, and perhaps add parquet/sorting benches as well (assuming those things don't take too long).

Eventually when we have a self-hosted runner we can add a selection of ClickBench queries too.

Thanks!

@alamb
Copy link
Contributor

alamb commented Mar 13, 2024

Yup, makes sense! I'd also like to bump SF to 10 there to reduce the noise a bit, and perhaps add parquet/sorting benches as well (assuming those things don't take too long).

I think ClickBench would be a good choice too (it doesn't take too long and is an excellent benchmark for aggregation / filtering performance)

So how about we merge this PR once CI has passed and then file follow on tickets for the remaining tasks?

@gruuya
Copy link
Contributor Author

gruuya commented Mar 13, 2024

So how about we merge this PR once CI has passed and then file follow on tickets for the remaining tasks?

Sounds good to me!

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you @gruuya !

@Dandandan Dandandan merged commit c2787c7 into apache:main Mar 13, 2024
23 checks passed
@gruuya gruuya deleted the ci-benches branch March 14, 2024 05:45
@alamb
Copy link
Contributor

alamb commented Aug 26, 2024

FWI this code was removed in #11165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants