-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Divergent behavior between regalloc_algorithm={backtracking,single_pass} #9980
Comments
For context: Alex and I briefly discussed this offline and concluded this was not a security issue as |
Additional context: it seems this only reproduces with x86-64; cannot reproduce with aarch64 (also infinitely recurses correctly). Successfully reproduces on macOS/x86-64 (via Rosetta 2), though, in addition to Linux. |
The strange thing is that it seems to be caused by the NaN. If you replace that NaN with any other value (for example -12.345) single-pass recurses infinitely as well. If you look at the disassembly of both variants, nothing is different except that NaN constant in single-pass and fuel with NaN
single-pass and fuel with -12.345 instead of NaN
|
Another test that came up in @Robbepop's differential fuzzing of wasmi vs wasmtime: (module
(type (;0;) (func (result i32)))
(global (;0;) (mut i32) i32.const 0)
(export "xxx" (func 0))
(func (;0;) (type 0) (result i32)
block (result i32) ;; label = @1
block (result i32) ;; label = @2
block (result i32) ;; label = @3
block (result i32) ;; label = @4
block (result i32) ;; label = @5
block (result i32) ;; label = @6
i32.const 1
i32.const 1
f32.convert_i32_s
f64.const -nan:0xffffffff80000 (;=NaN;)
f32.demote_f64
f32.ne
br_if 3 (;@3;)
drop
i32.const 1
end
global.get 0
i32.xor
global.set 0
i32.const 1
end
global.get 0
i32.xor
global.set 0
i32.const 0
end
global.get 0
i32.xor
global.set 0
i32.const 0
end
i32.const 1
i32.xor
global.set 0
i32.const 0
end
global.get 0
i32.xor
global.set 0
i32.const 1
end
global.get 0
i32.xor
)
) |
For the test case above I used The
Namely the The
It again looks like the So something about the phi nodes may be off? @cfallin does this look familiar at all? (or anything I can do to help dig in?) |
Ok I've done a bit of further digging here using the above module (which notably doesn't need fuel which cleans up IR slightly). First the two results I get are:
aka 1 is correct and 0 is wrong. Looking at the
Using the The VCode for this function I generated with:
The precise regalloc results are slightly different because
I think the problem here is that VCode is lying to regalloc2 in that we no longer have "extended basic blocks". Notably branch-on-float comparisons (and I think only float comparisons) are represented with two In this case I think regalloc2 is inserting code before In talking with @cfallin it looks like this problem is scoped to the So I think the long-and-short of it is that we need to refactor how float comparisons work, notably the |
Slightly more precisely (but we're still safe): it will never insert code before a terminator with multiple targets; this is such a case, because the branch sequence is "one-target branch ; two-target branch" ( |
In bytecodealliance#9980, we saw that code copmiled with the single-pass register allocator has incorrect behavior. We eventually narrowed this down to the fact that the single-pass allocator is inserting code meant to be at the end of a block, just before its terminator, *between* two branches that form the terminator sequence. The allocator is correct; the bug is with Cranelift's x64 backend. When we produce instructions into a VCode container, we maintain basic blocks, and we have the invariant (usual for basic block-based IR) that only the last -- terminator -- instruction is a branch that can leave the block. Even the conditional branches maintain this invariant: though VCode is meant to be "almost machine code", we emit *two-target conditionals* that are semantically like "jcond; jmp". We then are able to optimize this inline during binary emission in the `MachBuffer`: the buffer knows about unconditional and conditional branches and will "chomp" branches off the tail of the buffer whenever they target the fallthrough block. (We designed the system this way because it is simpler to think about BBs that are order-invariant, i.e., not bake the "fallthrough" concept into the IR.) Thus we have a simpler abstraction but produce optimal terminator sequences. Unfortunately, when adding a branch-on-floating-point-compare lowering, we had the need to branch to a target if either of *two* conditions were true, and rather than add a new kind of terminator instruction, we added a "one-armed branch": conditionally branch to label or fall through. We emitted this in sequence right before the actual terminator, so semantically it was almost equivalent. I write "almost" because the register allocator *is* allowed to insert spills/reloads/moves between any two instructions. Here the distinct pieces of the terminator sequence matter: the allocator might insert something just before the last instruction, assuming the basic-block "single in, single out" invariant means this will always run with the block. With one-armed branches this is no longer true. The backtracking allocator (our original RA2 algorithm, and still the default today) will never insert code at the end of a block when it has multiple terminators, because it associates such block-start/end insertions with *edges*; so in such conditions it inserts instructions into the tops of successor blocks instead. But the single-pass allocator needs to perform work at the end of every block, so it will trigger this bug. This PR removes `JmpIf` and converts the br-of-fcmp lowering to use `JmpCondOr` instead, which is a pseudoinstruction that does `jcc1; jcc2; jmp`. This maintains the BB invariant and fixes the bug. Note that Winch still uses `JmpIf`, so we cannot remove it entirely: this PR renames it to `WinchJmpIf` instead, and adds a mechanism to assert failure if it is ever added to `VCode` (rather than emitted directly, as Winch's macro-assembler does). We could instead write Winch's `jmp_if` assembler function in terms of `JmpCond` with a fallthrough label that is immediately bound, and let the MachBuffer always chomp the jmp; I opted not to regress Winch compiler performance by doing this. If one day we abstract out the assembler further, we can remove `WinchJmpIf`. This is one of two instances of a "one-armed branch"; the other is s390x's `OneWayCondBr`, used in `br_table` lowerings, which we will address separately. Once we do, that will address bytecodealliance#9980 entirely.
In bytecodealliance#9980, we saw that code copmiled with the single-pass register allocator has incorrect behavior. We eventually narrowed this down to the fact that the single-pass allocator is inserting code meant to be at the end of a block, just before its terminator, *between* two branches that form the terminator sequence. The allocator is correct; the bug is with Cranelift's x64 backend. When we produce instructions into a VCode container, we maintain basic blocks, and we have the invariant (usual for basic block-based IR) that only the last -- terminator -- instruction is a branch that can leave the block. Even the conditional branches maintain this invariant: though VCode is meant to be "almost machine code", we emit *two-target conditionals* that are semantically like "jcond; jmp". We then are able to optimize this inline during binary emission in the `MachBuffer`: the buffer knows about unconditional and conditional branches and will "chomp" branches off the tail of the buffer whenever they target the fallthrough block. (We designed the system this way because it is simpler to think about BBs that are order-invariant, i.e., not bake the "fallthrough" concept into the IR.) Thus we have a simpler abstraction but produce optimal terminator sequences. Unfortunately, when adding a branch-on-floating-point-compare lowering, we had the need to branch to a target if either of *two* conditions were true, and rather than add a new kind of terminator instruction, we added a "one-armed branch": conditionally branch to label or fall through. We emitted this in sequence right before the actual terminator, so semantically it was almost equivalent. I write "almost" because the register allocator *is* allowed to insert spills/reloads/moves between any two instructions. Here the distinct pieces of the terminator sequence matter: the allocator might insert something just before the last instruction, assuming the basic-block "single in, single out" invariant means this will always run with the block. With one-armed branches this is no longer true. The backtracking allocator (our original RA2 algorithm, and still the default today) will never insert code at the end of a block when it has multiple terminators, because it associates such block-start/end insertions with *edges*; so in such conditions it inserts instructions into the tops of successor blocks instead. But the single-pass allocator needs to perform work at the end of every block, so it will trigger this bug. This PR removes `JmpIf` and converts the br-of-fcmp lowering to use `JmpCondOr` instead, which is a pseudoinstruction that does `jcc1; jcc2; jmp`. This maintains the BB invariant and fixes the bug. Note that Winch still uses `JmpIf`, so we cannot remove it entirely: this PR renames it to `WinchJmpIf` instead, and adds a mechanism to assert failure if it is ever added to `VCode` (rather than emitted directly, as Winch's macro-assembler does). We could instead write Winch's `jmp_if` assembler function in terms of `JmpCond` with a fallthrough label that is immediately bound, and let the MachBuffer always chomp the jmp; I opted not to regress Winch compiler performance by doing this. If one day we abstract out the assembler further, we can remove `WinchJmpIf`. This is one of two instances of a "one-armed branch"; the other is s390x's `OneWayCondBr`, used in `br_table` lowerings, which we will address separately. Once we do, that will address bytecodealliance#9980 entirely.
* Cranelift/x64 backend: do not use one-way branches. In #9980, we saw that code copmiled with the single-pass register allocator has incorrect behavior. We eventually narrowed this down to the fact that the single-pass allocator is inserting code meant to be at the end of a block, just before its terminator, *between* two branches that form the terminator sequence. The allocator is correct; the bug is with Cranelift's x64 backend. When we produce instructions into a VCode container, we maintain basic blocks, and we have the invariant (usual for basic block-based IR) that only the last -- terminator -- instruction is a branch that can leave the block. Even the conditional branches maintain this invariant: though VCode is meant to be "almost machine code", we emit *two-target conditionals* that are semantically like "jcond; jmp". We then are able to optimize this inline during binary emission in the `MachBuffer`: the buffer knows about unconditional and conditional branches and will "chomp" branches off the tail of the buffer whenever they target the fallthrough block. (We designed the system this way because it is simpler to think about BBs that are order-invariant, i.e., not bake the "fallthrough" concept into the IR.) Thus we have a simpler abstraction but produce optimal terminator sequences. Unfortunately, when adding a branch-on-floating-point-compare lowering, we had the need to branch to a target if either of *two* conditions were true, and rather than add a new kind of terminator instruction, we added a "one-armed branch": conditionally branch to label or fall through. We emitted this in sequence right before the actual terminator, so semantically it was almost equivalent. I write "almost" because the register allocator *is* allowed to insert spills/reloads/moves between any two instructions. Here the distinct pieces of the terminator sequence matter: the allocator might insert something just before the last instruction, assuming the basic-block "single in, single out" invariant means this will always run with the block. With one-armed branches this is no longer true. The backtracking allocator (our original RA2 algorithm, and still the default today) will never insert code at the end of a block when it has multiple terminators, because it associates such block-start/end insertions with *edges*; so in such conditions it inserts instructions into the tops of successor blocks instead. But the single-pass allocator needs to perform work at the end of every block, so it will trigger this bug. This PR removes `JmpIf` and converts the br-of-fcmp lowering to use `JmpCondOr` instead, which is a pseudoinstruction that does `jcc1; jcc2; jmp`. This maintains the BB invariant and fixes the bug. Note that Winch still uses `JmpIf`, so we cannot remove it entirely: this PR renames it to `WinchJmpIf` instead, and adds a mechanism to assert failure if it is ever added to `VCode` (rather than emitted directly, as Winch's macro-assembler does). We could instead write Winch's `jmp_if` assembler function in terms of `JmpCond` with a fallthrough label that is immediately bound, and let the MachBuffer always chomp the jmp; I opted not to regress Winch compiler performance by doing this. If one day we abstract out the assembler further, we can remove `WinchJmpIf`. This is one of two instances of a "one-armed branch"; the other is s390x's `OneWayCondBr`, used in `br_table` lowerings, which we will address separately. Once we do, that will address #9980 entirely. * Add test for cascading branch-chomping behavior. * keep the paperclip happy
This is a followup to bytecodealliance#10086, this time removing the one-armed branch variant for s390x. This branch was only used as the default-target branch in the `br_table` lowering. This PR incorporates the branch into the `JTSequence` pseudo-instruction. Some care is needed to keep the `ProducesBool` abstraction; it is unwrapped into its `ProducesFlags` and the `JTSequence` becomes a `ConsumesFlags`, so the compare for the jump-table bound (for default target) is not part of the pseudoinst. (This is OK because regalloc-inserted moves never alter flags, by explicit contract; the same reason allows cmp/branch terminators.) Along the way I noticed that the comments on `JTSequence` claimed that `targets` included the default, but this is (no longer?) the case, as the targets are unwrapped by `jump_table_targets` which peels off the first (default) separately. Aside from comments, this only affected pretty-printing; codegen was correct. With this, we have no more one-armed branches; hence, this fixes bytecodealliance#9980.
This is a followup to bytecodealliance#10086, this time removing the one-armed branch variant for s390x. This branch was only used as the default-target branch in the `br_table` lowering. This PR incorporates the branch into the `JTSequence` pseudo-instruction. Some care is needed to keep the `ProducesBool` abstraction; it is unwrapped into its `ProducesFlags` and the `JTSequence` becomes a `ConsumesFlags`, so the compare for the jump-table bound (for default target) is not part of the pseudoinst. (This is OK because regalloc-inserted moves never alter flags, by explicit contract; the same reason allows cmp/branch terminators.) Along the way I noticed that the comments on `JTSequence` claimed that `targets` included the default, but this is (no longer?) the case, as the targets are unwrapped by `jump_table_targets` which peels off the first (default) separately. Aside from comments, this only affected pretty-printing; codegen was correct. With this, we have no more one-armed branches; hence, this fixes bytecodealliance#9980.
I've confirmed that post-#10086 the above test now shows the same behavior (i.e., "wasm trap: call stack exhausted") under the single-pass allocator as under backtracking. Strictly speaking the underlying bug is still present on s390x until #10087 also merges (so I guess I'll leave this issue open for now) but hopefully we'll see the fuzzbug close as fixed on the next build. |
* Cranelift/s390x: do not use one-way conditional branches. This is a followup to #10086, this time removing the one-armed branch variant for s390x. This branch was only used as the default-target branch in the `br_table` lowering. This PR incorporates the branch into the `JTSequence` pseudo-instruction. Some care is needed to keep the `ProducesBool` abstraction; it is unwrapped into its `ProducesFlags` and the `JTSequence` becomes a `ConsumesFlags`, so the compare for the jump-table bound (for default target) is not part of the pseudoinst. (This is OK because regalloc-inserted moves never alter flags, by explicit contract; the same reason allows cmp/branch terminators.) Along the way I noticed that the comments on `JTSequence` claimed that `targets` included the default, but this is (no longer?) the case, as the targets are unwrapped by `jump_table_targets` which peels off the first (default) separately. Aside from comments, this only affected pretty-printing; codegen was correct. With this, we have no more one-armed branches; hence, this fixes #9980. * Review feedback.
Found via oss-fuzz in https://issues.oss-fuzz.com/issues/387110342 I've minimized this to:
which I can show different behavior with:
That's expected, this infinitely recurses. With
single_pass
though:The text was updated successfully, but these errors were encountered: