-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new API that accepts resusable context. #196
Conversation
3a059ce
to
7b89e02
Compare
34f4b5d
to
5d2e0ac
Compare
fcbc26b
to
d736593
Compare
I profiled the allocator a bit and I got rid of some hashmaps in favor of using the key as an index which improved things a little bit. |
@jakubDoka could you summarize with a perf measurement (Sightglass or hyperfine of |
I had to run the bench separately because whatever I run last performs worse (if I run two identical programs, the first run always wins) (my CPU throttles)
Let me know If full perf stat could be useful
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all this work! I'm happy to see the performance improvement. After reading through, I have a few high-level thoughts:
-
I'd like to see some of the changes here pulled out into separate PRs. For example, you replace concatenate-and-sort steps with in-place merge when we know the liverange lists are already sorted; good optimization; but let's have that as a separate PR, so we can reason about this PR more easily and also so that we can test and revert those changes separately. Basically, I want this PR to be as close as possible to a mechanical replacement of data structures with arena-allocated or reused data structures, with no semantics changes. Some of the changes to the core allocation loop around the allocation-map iterator also make me very nervous.
-
The replacement of the actual
BTreeMap
per physical reg with a sortedVec
makes me very nervous. It probably looks better on most benchmarks, but it has a quadratic worst-case (inserting in the middle), where a true btree does not; and we consider quadratic worst-cases bugs that we need to avoid. Is there an argument we can make about why this won't be seen in practice or...?
I did not realize Wasmtime needs to deal with malicious code, good point, all of the changes here are guided by a simple principle: How to make the code do less. Using vec instead of BTree has many properties that CPUs like. I suspect that in Now that I think about this, quadratic behavior is a problem if the function we compile is maliciously large, this is not the common case though, so what about doing a hybrid/adaptative approach? We could switch to BTreeMap dynamically if we detect Vec costing too much (it exceeds a certain size). What do you think @cfallin? |
After reverting to BTree I see a regression of ~500m cycles. |
After reverting this I see a regression of another ~250m cycles. |
Another, perhaps a little more out-there, idea is to make the whole algorithm generic over this container type, and make two instantiations: one with The failure checks and propagation overheads might dwarf any speed up though. |
@cfallin are there any extra modifications needed? (considering I will add the other changes in a separate pr(s)) |
I realized there are some things I missed, so nvm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please make sure all of my earlier comments were addressed; also one new one below from the start of my second pass.
Okay, @cfallin, I double-checked everything, hopefully nothing was missed (I found one thing in the diff). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have just a few more comments, but this is getting close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your updates! A few more comments below; hopefully these are the last ones!
deny.toml
Outdated
@@ -1,7 +1,7 @@ | |||
targets = [ | |||
{ triple = "x86_64-unknown-linux-gnu" }, | |||
{ triple = "x86_64-apple-darwin" }, | |||
{ triple = "x86_64-pc-windows-msvc" }, | |||
{ triple = "x86_64-pc-windows-msempty_vec" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was an errant find-replace problem? pc-windows-msempty_vec
sounds like an interesting platform, but not one that Rust supports...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and its outdated, I noticed it and fixed it
src/lib.rs
Outdated
@@ -1671,6 +1671,8 @@ impl<T> VecExt<T> for Vec<T> { | |||
} | |||
|
|||
#[derive(Debug, Clone, Default)] | |||
/// Bump is a wrapper around `bumpalo::Bump` that can be cloned and also | |||
/// implements `Allocator`. Using this avoids lifetime polution of `Ctx`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PoV of API docs consumers, there should probably be something here about what this is used for...
... actually, it seems it's not exposed at all in the run_with_ctx
signature; can we make this pub(crate)
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem with that is that rust will complain in many places but I can change these to pub(crate) too
src/lib.rs
Outdated
env: &MachineEnv, | ||
options: &RegallocOptions, | ||
ctx: &mut Ctx, | ||
) -> Result<(), RegAllocError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we return the Output
here as with run()
, by mem::take
'ing it from the Ctx
? I'd much prefer that to an implicit result in ctx.output
-- OK to be imperative and implicit inside RA2 but a functional API is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is implicit because the allocations inside the output are reused, of course, If we assume the user will return the output to the Ctx
this is fine but IMO I'd rather hint in the doc where the output is located, that way you don't need to write more code to get optimal performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually returning a reference could be a good middle ground
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this is finally good to go. Thanks so much!
If you'd like, feel free to make a PR over in bytecodealliance/wasmtime to make use of run_with_ctx
in Cranelift as well...
This includes two major updates: - The new single-pass fast allocator (bytecodealliance#181); - An ability to reuse allocations across runs (bytecodealliance#196).
Sadly due to how the code was structured, I needed to change the `Env' fields so basically everything that was used was changed as well. I did not benchmark anything yet (work in progress).
Context -> https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/Using.20context.20for.20.60TargetIsa.3A.3Acompile_function.60