-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression with rustc 1.81 #203
Comments
cc @orlp @Voultapher (authors of the new sort implementation) |
Fascinating, thanks for tracking this down! At least at a surface level, I'm not super-inclined to rewrite part of the register allocator based on a performance bug in |
Hi, thanks @plusvic for raising the issue. Looking at the std library documentation:
If your input is nearly sorted or all you want to do is merging, |
To expand a little more, @orlp and I are aware of the append + sort use case and did include it in our design and benchmarking, under the pattern name random_s95, 95% sorted + 5% unsorted unstable sort design document and stable sort design document. The previous implementation of The robust and generic fix is to run detection + lazy merging, exactly what driftsort does, which is the current implementation of For questions about methodology design choices and evaluation, please go and read the design documents. |
Thanks for the info @Voultapher. I definitely respect that you and @orlp spent a ton of time on this update. It's unfortunate that it has edge-cases where performance regresses; I've been in your position where one is trying to make a cross-ecosystem update so I definitely understand the frustration of seeing these regressions. Unfortunately I don't have any cycles to spend on this right now; it was an unexpected and unplanned performance regression (from my point of view); and I have no way of reproducing it (the above Wasm is a "schema" but not an actual file to test). @plusvic since you were the reporter, would you be willing to take your regressing use-cases, and perhaps try the fix suggested above ( Thanks! |
I can confirm that by using
In resume, for this particular case I also decided to give a try to a different solution, so I replaced this code: self.bundles[to].ranges.extend_from_slice(&from_list[..]);
self.bundles[to]
.ranges
.sort_by_key(|entry| entry.range.from); With this... for entry in from_list.into_iter() {
let pos = self.bundles[to]
.ranges
.binary_search_by_key(&entry.range.from, |entry| entry.range.from)
.unwrap_or_else(|pos| pos);
self.bundles[to].ranges.insert(pos, entry)
} This ended up being 2x faster that the existing implementation, and performance doesn't vary with the rustc version.
Of course this is a very specific case, in which we have a very long sorted slice, and we are adding only a handful of new items to it. I don't know if this is the most common case, probably not. As my knowledge about the If it turns out that in most cases the |
@plusvic Would it be possible to log and post the distribution of I'm afraid that your proposed alternative implementation is not a correct merge, unless there are other specific properties of the ranges I'm not aware of. Consider the following input:
That said, I do think if this is so performance critical a proper merge algorithm instead of concatenating + sorting is the play. |
@plusvic I find it surprising that Rust 1.80 In the meantime feel free to try out just using the merge function from the stdlib link. |
This is what I get while logging the length of both slices. The first one is
While compiling less extreme code I got this:
My proposed alternative is not really a merge (at least not an efficient merge), but the result should be correct. It simply iterates the second vector looking for the index in the first vector where each item should be inserted, and does it (which implies displacing items in the first list to make room for the new item). The key of this implementation is that, when some item is not found in the first list, |
@Voultapher, you can find the project at https://github.com/VirusTotal/yara-x I recommend compiling it with the
This runs the Let me know if you need anything else. |
@plusvic Yes, that's a really inefficient merge :) I missed that you iterated over each item. |
I've tried a naive merge with two iterators that advance on one slice or the other depending on which item is smaller, and adding the smaller item to a new vector, but that's slow due to the additional allocations. The best solution I've found so far is handing the case in which we are adding only one item as a special case. As it was shown above, this is a very common case (not only in my edge case, but in general), so I think it's worth optimizing for that case. I compiled a gigantic WASM module and saw a 6-8% performance improvement with respect to the current implementation, even with rustc 1.80.
I'll send a PR for your consideration. |
Preliminary results I was able to reproduce the issue locally, and patched in my own version of regalloc2 and tested a couple sort implementations: All tests done with the same rustc version to avoid other effects related to the impact of a new Rust version.
These first results tell us a couple things, Logging the length of both sides We could quick and dirty fix it by adding a ``if self.bundles[to].ranges.len() == 1` and adding a special case for that, though it's not clear to me if that would be helpful generally or in specific cases as encountered by @plusvic. I think the real fix would be to prepare the bundles once and call merge_bundles once. for i in 0..self.blockparam_outs.len() {
let BlockparamOut {
from_vreg, to_vreg, ..
} = self.blockparam_outs[i];
let to_bundle = self.ranges[self.vregs[to_vreg].ranges[0].index].bundle;
let from_bundle = self.ranges[self.vregs[from_vreg].ranges[0].index].bundle;
self.merge_bundles(from_bundle, to_bundle);
}
->
for i in 0..self.blockparam_outs.len() {
let BlockparamOut {
from_vreg, to_vreg, ..
} = self.blockparam_outs[i];
// build bundles
}
self.merge_bundles(from_bundle, to_bundle); |
@Voultapher I've submitted a PR with exactly with the quick and dirty fix you mentioned. I believe it can help in many other cases, as I've compiled some other WASM code besides my specific edge-case and there are many calls where the vector being appended contains a single element. |
Unfortunately this is not an option: the merges performed for each blockparam are separate merges in general, of separate bundles. (In regalloc2 terminology, a "bundle" is a collection of liveranges that do not conflict; in the general case, all the separate blockparams flowing into a block will be separate values and so do conflict.) I'll take a look at the PR, thanks @plusvic! |
@Voultapher a clarification question on this comment:
I'm not sure I understand the subtext here. You seem to be implying there is some quadratic behavior of sorts (separate sorts for steps of the same process). But there is one sort per bundle merge operation; these are logically separate bundles with nothing in common. And while we see here that there are commonly very small sort bundles being merged into existing large bundles, it also sometimes happens that we merge two large bundles (e.g. there are merge cases where we merge the reused-input of an x86 destructive src/dest instruction and both sides will be large). I did extensively profile and compare against a manual merge loop in the pre-1.80 world and found the sort was faster. Do you have another suggestion for the general case, aside from @plusvic's point fix? |
@cfallin keep in mind my register allocation knowledge and specifically regalloc2 knowledge is very limited, so if what I say is nonsense please attribute it to that. This issue here arises only because |
Ah, yes, that's true, we could pre-bin merges by destination bundle and do one N-way merge (or sort). If I get some more time to work on this in the next month I'll see if I can come up with a more general fix -- thanks @Voultapher and all for the ideas! |
rustc 1.81 completely replaced the implementation of slice::sort_unstable_by_key. These changes are mentioned in the release notes, and more information can be found in the corresponding PR: rust-lang/rust#124032.
The new sorting algorithm was supposed to be faster in the general case, but in some specific cases it looks like it is way slower, and this has impacted the
wasmtime
crate viaregalloc2
. I've experienced a 10x times slowdown when compiling some WASM modules usingwasmtime
with rustc 1.81 (more details can be found in: VirusTotal/yara-x#209)I tracked down the issue to be caused by the merge_bundles function, which uses
sort_unstable_by_key
. According to my investigation,sort_unstable_by_key
is being called with slices that are mostly sorted. Apparently, two already sorted slices are being concatenated and then sorted again. This is confirmed by the following comment in the code ofmerge_bundles
:This pattern, which worked well with the previous implementation of
sort_unstable_by_key
, seems to be pathologically bad for the new one. The assumption that concatenating the two lists and sorting them again is faster than merging the two sorted lists likely no longer holds true.If you are curious about which WASM code is triggering this issue, it is an implementation of the "switch" statement in WASM using multiple nested blocks and a
br_table
instruction that jumps out the block that corresponds to the input value. The overall schema is shown below.The text was updated successfully, but these errors were encountered: