-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce zero overhead loop #46
Conversation
fa781f4
to
0fd51cd
Compare
0fd51cd
to
3ba34e1
Compare
for (Instruction &I : *BB) { | ||
if (isa<CallInst>(I) || isa<InvokeInst>(I)) { | ||
if (const Function *F = cast<CallBase>(I).getCalledFunction()) { | ||
if (!isLoweredToCall(F)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: Where does isLoweredToCall
come from? Is that a generic LLVM function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generic, with a note that it should be moved to a target-specific hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a custom implementation for this PR. See:
void sum(double *a, double *b, double *c) {
for(int i = 0; i < 30; i++) {
c[i] = a[i] + b[i];
}
}
It should walk in sync with GISel legalization rules for libcalls.
; CHECK-NEXT: nop | ||
; CHECK-NEXT: nop | ||
; CHECK-NEXT: mova r6, #0 | ||
; CHECK-NEXT: add.nc r5, r1, #-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you understand the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, profitability of low overhead loops was reduced to single block loops. I removed the issue-limit=1, which probably wasn't very clever for this particular test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have restored issue-limit=1.
; CHECK-NEXT: [[C3:%[0-9]+]]:_(s32) = G_CONSTANT i32 1 | ||
; CHECK-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[ASSERT_ZEXT]], [[C3]] | ||
; CHECK-NEXT: G_BRCOND [[AND]](s32), %bb.2 | ||
; CHECK-NEXT: G_BRCOND [[ASSERT_ZEXT]](s32), %bb.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did not change the legalizer, do you know why that test needed to be updated?
Edit: I didn't see it's not your change. But I think it still makes sense to move that diff to the first commit if that's where it belongs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have assumed that the bit analysis has improved in upstream llvm, and now recognises that booleans are inrange.
Hi, the following code can cause some problem in this PR:
With:
Gives:
|
Hi @martien-de-jong , another interesting case (sample.ll):
However, you need the following options (Elf emission, no loop scheduling):
Result:
|
3ba34e1
to
148cd00
Compare
@@ -0,0 +1,34 @@ | |||
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Liar 😏
@@ -0,0 +1,152 @@ | |||
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4 | |||
# RUN: llc -mtriple=aie2 --start-after=instruction-select \ | |||
# RUN: --stop-before=aie-finalize-mi-bundles %s -o - | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super-Nit: I'm changing the scheduler so it outputs "correct" Bundles. Could you change that line to --stop-after=aie-finalize-mi-bundles
? This way there won't be test updates.
Cond.push_back(MachineOperand::CreateImm(I->getOpcode())); | ||
Cond.push_back(I->getOperand(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that hurt to keep Cond.push_back(I->getOperand(0));
? I'm still struggling to understand what kind of API analyzeBranch
has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The api is that the target can push whatever it needs to reconstruct/invert a branch. We don't actually need that third operand, and I like occam's razor.
if (isHardwareLoopEnd(Opc)) { | ||
CBranchBuilder.addMBB(TBB).add(Cond[1]); | ||
} else { | ||
CBranchBuilder.add(Cond[1]).addMBB(TBB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super-nit: Maybe we could define PseudoLoopEnd
to have the same operand order as other branches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm done with the review. I think it looks good! Please go through the remaining comments, I'd be happy if some tests are moved around, but it's not such a big deal :)
f67f04f
to
8bbddd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice work, credits to you and the origin author.
2c74d2a
8bbddd2
to
2c74d2a
Compare
lower symbol in MC lowering
Also make PseudoLoopEnd a meta instruction to simplify emit logic Make sure LoopStart/LoopEnd don't get duplicated in e.g. TailDuplication
PseudoLoopEnd is very similar to a regular conditional branch. We need two Cond elements in order to reconstruct the instruction, one is the opcode, the other is the condition register for JZ/JNZ and the last-bundle label for PseudoLoopEnd The operand order of PseudoLoopEnd was swapped to make it congruent to the other conditonal branches insertBranch needs to generate unconditional branch for FBB even after PseudoLoopEnd.
Completely remove empty ZOL
2c74d2a
to
12fc40d
Compare
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py | ||
# RUN: llc -O2 -mtriple=aie2 -run-pass=instruction-select %s -verify-machineinstrs -o - | FileCheck %s | ||
|
||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of tests here are missing the license & copyright header. Could you please add those in a fixup PR @martien-de-jong?
Final verdict: hwloop mostly causes significant PMsize expansion and frequent slight instruction count regressions (~ 10-100 cycles)
We have a few significant wins, e.g. GEMM_int8_1 InsnCount 55877 -> 53775
I didn't find any functional incorrectness and it is switched off by default.
I would like to commit now, and propose a follow-up to take the loop size into account at e.g. legalization time, where it is relatively easy to allocate a virtual loopcount register. This on the basis that we don't gain on big loops, since there the loop body is dominated by memory, move and vector instructions.
This is a direct port of Abnikant's original aie-private PR.
The representation of PseudoLoopEnd for analyzeBranch has changed significantly; We always push two components. The first is the opcode, the second the additional operand which can not be derived from the target block.
It should be noted that ZOL does not handle zero or negative loopcounts correctly. As such we need to establish that it is positive, e.g. by loop guarding or by interpreting pragma-like directives.