Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce zero overhead loop #46

Merged
merged 10 commits into from
Jun 28, 2024
Merged

Introduce zero overhead loop #46

merged 10 commits into from
Jun 28, 2024

Conversation

martien-de-jong
Copy link
Collaborator

@martien-de-jong martien-de-jong commented May 21, 2024

Final verdict: hwloop mostly causes significant PMsize expansion and frequent slight instruction count regressions (~ 10-100 cycles)

We have a few significant wins, e.g. GEMM_int8_1 InsnCount 55877 -> 53775
I didn't find any functional incorrectness and it is switched off by default.

I would like to commit now, and propose a follow-up to take the loop size into account at e.g. legalization time, where it is relatively easy to allocate a virtual loopcount register. This on the basis that we don't gain on big loops, since there the loop body is dominated by memory, move and vector instructions.

This is a direct port of Abnikant's original aie-private PR.
The representation of PseudoLoopEnd for analyzeBranch has changed significantly; We always push two components. The first is the opcode, the second the additional operand which can not be derived from the target block.

It should be noted that ZOL does not handle zero or negative loopcounts correctly. As such we need to establish that it is positive, e.g. by loop guarding or by interpreting pragma-like directives.

@martien-de-jong martien-de-jong marked this pull request as draft May 21, 2024 15:39
@martien-de-jong martien-de-jong marked this pull request as ready for review May 31, 2024 13:28
for (Instruction &I : *BB) {
if (isa<CallInst>(I) || isa<InvokeInst>(I)) {
if (const Function *F = cast<CallBase>(I).getCalledFunction()) {
if (!isLoweredToCall(F))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: Where does isLoweredToCall come from? Is that a generic LLVM function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generic, with a note that it should be moved to a target-specific hook.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a custom implementation for this PR. See:

void sum(double *a, double *b, double *c) {
    for(int i = 0; i < 30; i++) {
       c[i] = a[i] + b[i];
    }
}

It should walk in sync with GISel legalization rules for libcalls.

; CHECK-NEXT: nop
; CHECK-NEXT: nop
; CHECK-NEXT: mova r6, #0
; CHECK-NEXT: add.nc r5, r1, #-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you understand the changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, profitability of low overhead loops was reduced to single block loops. I removed the issue-limit=1, which probably wasn't very clever for this particular test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have restored issue-limit=1.

; CHECK-NEXT: [[C3:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
; CHECK-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[ASSERT_ZEXT]], [[C3]]
; CHECK-NEXT: G_BRCOND [[AND]](s32), %bb.2
; CHECK-NEXT: G_BRCOND [[ASSERT_ZEXT]](s32), %bb.2
Copy link
Collaborator

@gbossu gbossu Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not change the legalizer, do you know why that test needed to be updated?
Edit: I didn't see it's not your change. But I think it still makes sense to move that diff to the first commit if that's where it belongs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have assumed that the bit analysis has improved in upstream llvm, and now recognises that booleans are inrange.

@andcarminati
Copy link
Collaborator

Hi, the following code can cause some problem in this PR:

target datalayout = "e-m:e-p:20:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-f32:32:32-i64:32-f64:32-a:0:32-n32"
target triple = "aie2"

define i32 @main() {
entry:
  br label %for.body

for.body:                                  ; preds = %for.body, %entry
  %i = phi i16 [ 0, %entry ], [ 1, %for.body ]
  %cmp = icmp ult i16 %i, 1
  br i1 %cmp, label %for.body, label %if.end

if.end:                                     ; preds = %for.body
  ret i32 0
}

With:

llc --march=aie2 sample.ll  --enable-aie-hardware-loops  --enable-aie-zero-overhead-loops

Gives:

llc: ../llvm/include/llvm/ADT/ilist_iterator.h:138: llvm::ilist_iterator::reference llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::MachineInstr, true, true, void, false>, false, false>::operator*() const [OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr, true, true, void, false>, IsReverse = false, IsConst = false]: Assertion `!NodePtr->isKnownSentinel()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ./bin/llc --march=aie2 reduced.ll --enable-aie-hardware-loops --enable-aie-zero-overhead-loops
1.      Running pass 'Function Pass Manager' on module 'reduced.ll'.
2.      Running pass 'InstructionSelect' on function '@main'
 #0 0x0000000005d38f97 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) ./../llvm/lib/Support/Unix/Signals.inc:723:13
 #1 0x0000000005d36f70 llvm::sys::RunSignalHandlers() ./../llvm/lib/Support/Signals.cpp:106:18
 #2 0x0000000005d3964a SignalHandler(int) ./../llvm/lib/Support/Unix/Signals.inc:413:1
 #3 0x00007f6f6c1a5520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007f6f6c1f99fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007f6f6c1f99fc __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #6 0x00007f6f6c1f99fc pthread_kill ./nptl/pthread_kill.c:89:10
 #7 0x00007f6f6c1a5476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #8 0x00007f6f6c18b7f3 abort ./stdlib/abort.c:81:7
 #9 0x00007f6f6c18b71b _nl_load_domain ./intl/loadmsgcat.c:1177:9
#10 0x00007f6f6c19ce96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)

@andcarminati
Copy link
Collaborator

Hi @martien-de-jong , another interesting case (sample.ll):

target datalayout = "e-m:e-p:20:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-f32:32:32-i64:32-f64:32-a:0:32-n32"
target triple = "aie2"

define i32 @main() {
entry:
  br label %for.body

for.body:                                     ; preds = %for.body, %entry
  %i = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %inc = add i32 %i, 1
  %exitcond = icmp eq i32 %inc, 0
  br i1 %exitcond, label %label, label %for.body

label:                                     ; preds = %label, %for.body
  br label %label
}

However, you need the following options (Elf emission, no loop scheduling):

llc --march=aie2 sample.ll  --enable-aie-hardware-loops  --enable-aie-zero-overhead-loops -aie-loop-aware=false -filetype=obj

Result:

<unknown>:0: error: Undefined temporary symbol .L_1120

@@ -0,0 +1,34 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liar 😏

@@ -0,0 +1,152 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=aie2 --start-after=instruction-select \
# RUN: --stop-before=aie-finalize-mi-bundles %s -o - | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-Nit: I'm changing the scheduler so it outputs "correct" Bundles. Could you change that line to --stop-after=aie-finalize-mi-bundles? This way there won't be test updates.

Cond.push_back(MachineOperand::CreateImm(I->getOpcode()));
Cond.push_back(I->getOperand(0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that hurt to keep Cond.push_back(I->getOperand(0));? I'm still struggling to understand what kind of API analyzeBranch has.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The api is that the target can push whatever it needs to reconstruct/invert a branch. We don't actually need that third operand, and I like occam's razor.

if (isHardwareLoopEnd(Opc)) {
CBranchBuilder.addMBB(TBB).add(Cond[1]);
} else {
CBranchBuilder.add(Cond[1]).addMBB(TBB);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-nit: Maybe we could define PseudoLoopEnd to have the same operand order as other branches?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deal!

Copy link
Collaborator

@gbossu gbossu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm done with the review. I think it looks good! Please go through the remaining comments, I'd be happy if some tests are moved around, but it's not such a big deal :)

gbossu
gbossu previously approved these changes Jun 27, 2024
andcarminati
andcarminati previously approved these changes Jun 28, 2024
Copy link
Collaborator

@andcarminati andcarminati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work, credits to you and the origin author.

Martien de Jong and others added 10 commits June 28, 2024 14:36
lower symbol in MC lowering
Also make PseudoLoopEnd a meta instruction to simplify emit logic

Make sure LoopStart/LoopEnd don't get duplicated in e.g. TailDuplication
PseudoLoopEnd is very similar to a regular conditional branch.
We need two Cond elements in order to reconstruct the instruction,
one is the opcode, the other is the condition register for JZ/JNZ and
the last-bundle label for PseudoLoopEnd

The operand order of PseudoLoopEnd was swapped to make it congruent to
the other conditonal branches

insertBranch needs to generate unconditional branch for FBB
even after PseudoLoopEnd.
Completely remove empty ZOL
@martien-de-jong martien-de-jong enabled auto-merge (rebase) June 28, 2024 13:56
@martien-de-jong martien-de-jong merged commit 8185e31 into aie-public Jun 28, 2024
8 checks passed
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
# RUN: llc -O2 -mtriple=aie2 -run-pass=instruction-select %s -verify-machineinstrs -o - | FileCheck %s

---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of tests here are missing the license & copyright header. Could you please add those in a fixup PR @martien-de-jong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants