-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAA: version unit stride for stores #124567
Conversation
Since SymbolicStrides are available when we iterate on stores in analyzeLoop, use the information to refine the loop-invariant check, resulting in better vectorization of degenerate cases.
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-llvm-analysis Author: Ramkumar Ramachandra (artagnon) ChangesSince SymbolicStrides are available when we iterate on stores in analyzeLoop, use the information to refine the loop-invariant check, resulting in better vectorization of degenerate cases. Full diff: https://github.com/llvm/llvm-project/pull/124567.diff 4 Files Affected:
diff --git a/llvm/lib/Analysis/LoopAccessAnalysis.cpp b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
index 697b40403902cc..f8c8bfe87d23ea 100644
--- a/llvm/lib/Analysis/LoopAccessAnalysis.cpp
+++ b/llvm/lib/Analysis/LoopAccessAnalysis.cpp
@@ -2545,7 +2545,8 @@ bool LoopAccessInfo::analyzeLoop(AAResults *AA, const LoopInfo *LI,
for (StoreInst *ST : Stores) {
Value *Ptr = ST->getPointerOperand();
- if (isInvariant(Ptr)) {
+ const SCEV *PtrScev = replaceSymbolicStrideSCEV(*PSE, SymbolicStrides, Ptr);
+ if (PSE->getSE()->isLoopInvariant(PtrScev, TheLoop)) {
// Record store instructions to loop invariant addresses
StoresToInvariantAddresses.push_back(ST);
HasStoreStoreDependenceInvolvingLoopInvariantAddress |=
diff --git a/llvm/test/Analysis/LoopAccessAnalysis/invariant-dependence-before.ll b/llvm/test/Analysis/LoopAccessAnalysis/invariant-dependence-before.ll
index 2139804753ef55..123cfaca059fe6 100644
--- a/llvm/test/Analysis/LoopAccessAnalysis/invariant-dependence-before.ll
+++ b/llvm/test/Analysis/LoopAccessAnalysis/invariant-dependence-before.ll
@@ -784,3 +784,42 @@ loop:
exit:
ret void
}
+
+define void @stores_to_invariant_address(i32 %offset, ptr noalias %dst.1, ptr %dst.2) {
+; CHECK-LABEL: 'stores_to_invariant_address'
+; CHECK-NEXT: loop:
+; CHECK-NEXT: Memory dependences are safe
+; CHECK-NEXT: Dependences:
+; CHECK-NEXT: Run-time memory checks:
+; CHECK-NEXT: Grouped accesses:
+; CHECK-EMPTY:
+; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
+; CHECK-NEXT: SCEV assumptions:
+; CHECK-NEXT: Equal predicate: %offset == 1
+; CHECK-EMPTY:
+; CHECK-NEXT: Expressions re-written:
+; CHECK-NEXT: [PSE] %gep = getelementptr i32, ptr %dst.2, i32 %iv.2.mul:
+; CHECK-NEXT: ((4 * (sext i32 {0,+,%offset}<%loop> to i64))<nsw> + %dst.2)
+; CHECK-NEXT: --> {%dst.2,+,4}<nw><%loop>
+;
+entry:
+ %add = add i32 %offset, 3
+ br label %loop
+
+loop:
+ %iv.2 = phi i32 [ 0, %entry ], [ %iv.2.next, %loop ]
+ %iv.3 = phi i32 [ 0, %entry ], [ %iv.3.next, %loop ]
+ %iv.mul = mul i32 %iv.3, %add
+ %gep.mul = getelementptr i8, ptr %dst.1, i32 %iv.mul
+ store i32 0, ptr %gep.mul, align 8
+ %iv.2.mul = mul i32 %iv.2, %offset
+ %gep = getelementptr i32, ptr %dst.2, i32 %iv.2.mul
+ store i32 0, ptr %gep, align 8
+ %iv.2.next = add i32 %iv.2, 1
+ %iv.3.next = add i32 %iv.3, 1
+ %ec = icmp eq i32 %iv.3, 200
+ br i1 %ec, label %exit, label %loop
+
+exit:
+ ret void
+}
diff --git a/llvm/test/Transforms/LoopVectorize/strided-accesses-interleave-only.ll b/llvm/test/Transforms/LoopVectorize/strided-accesses-interleave-only.ll
index bc6fb48fd2cef3..e065717240ad0d 100644
--- a/llvm/test/Transforms/LoopVectorize/strided-accesses-interleave-only.ll
+++ b/llvm/test/Transforms/LoopVectorize/strided-accesses-interleave-only.ll
@@ -5,26 +5,27 @@ define void @test_variable_stride(ptr %dst, i32 %scale) {
; CHECK-LABEL: define void @test_variable_stride
; CHECK-SAME: (ptr [[DST:%.*]], i32 [[SCALE:%.*]]) {
; CHECK-NEXT: entry:
-; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
+; CHECK: vector.scevcheck:
+; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[SCALE]], 1
+; CHECK-NEXT: br i1 [[IDENT_CHECK]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
-; CHECK-NEXT: [[TMP2:%.*]] = mul i32 [[TMP0]], [[SCALE]]
-; CHECK-NEXT: [[TMP3:%.*]] = mul i32 [[TMP1]], [[SCALE]]
-; CHECK-NEXT: [[TMP4:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP2]]
-; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP3]]
-; CHECK-NEXT: store i32 [[TMP0]], ptr [[TMP4]], align 2
-; CHECK-NEXT: store i32 [[TMP1]], ptr [[TMP5]], align 2
+; CHECK-NEXT: [[TMP2:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP0]]
+; CHECK-NEXT: [[TMP3:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP1]]
+; CHECK-NEXT: store i32 [[TMP0]], ptr [[TMP2]], align 2
+; CHECK-NEXT: store i32 [[TMP1]], ptr [[TMP3]], align 2
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
-; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
-; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
+; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
-; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
diff --git a/llvm/test/Transforms/LoopVectorize/version-stride-with-integer-casts.ll b/llvm/test/Transforms/LoopVectorize/version-stride-with-integer-casts.ll
index 791c995d88c146..2e231541563ee6 100644
--- a/llvm/test/Transforms/LoopVectorize/version-stride-with-integer-casts.ll
+++ b/llvm/test/Transforms/LoopVectorize/version-stride-with-integer-casts.ll
@@ -326,46 +326,23 @@ define void @test_versioned_with_non_ex_use(i32 %offset, ptr noalias %dst.1, ptr
; CHECK-NEXT: [[ADD:%.*]] = add i32 [[OFFSET]], 3
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
; CHECK: vector.scevcheck:
-; CHECK-NEXT: [[TMP0:%.*]] = sub i32 -3, [[OFFSET]]
-; CHECK-NEXT: [[TMP1:%.*]] = icmp slt i32 [[ADD]], 0
-; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 [[TMP0]], i32 [[ADD]]
-; CHECK-NEXT: [[MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[TMP2]], i32 200)
-; CHECK-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL]], 0
-; CHECK-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL]], 1
-; CHECK-NEXT: [[TMP3:%.*]] = sub i32 0, [[MUL_RESULT]]
-; CHECK-NEXT: [[TMP4:%.*]] = icmp slt i32 [[MUL_RESULT]], 0
-; CHECK-NEXT: [[TMP5:%.*]] = icmp sgt i32 [[TMP3]], 0
-; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP1]], i1 [[TMP5]], i1 [[TMP4]]
-; CHECK-NEXT: [[TMP7:%.*]] = or i1 [[TMP6]], [[MUL_OVERFLOW]]
; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[OFFSET]], 1
-; CHECK-NEXT: [[TMP8:%.*]] = or i1 [[TMP7]], [[IDENT_CHECK]]
-; CHECK-NEXT: br i1 [[TMP8]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
+; CHECK-NEXT: br i1 [[IDENT_CHECK]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
-; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[ADD]], i64 0
-; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT: [[TMP10:%.*]] = mul <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0
-; CHECK-NEXT: [[TMP12:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP11]]
-; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1
-; CHECK-NEXT: [[TMP14:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP13]]
-; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i32> [[TMP10]], i32 2
-; CHECK-NEXT: [[TMP16:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP15]]
-; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i32> [[TMP10]], i32 3
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i64 [[INDEX]] to i32
+; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[OFFSET_IDX]], 0
+; CHECK-NEXT: [[TMP17:%.*]] = mul i32 [[TMP1]], [[ADD]]
; CHECK-NEXT: [[TMP18:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP17]]
-; CHECK-NEXT: store i32 0, ptr [[TMP12]], align 8
-; CHECK-NEXT: store i32 0, ptr [[TMP14]], align 8
-; CHECK-NEXT: store i32 0, ptr [[TMP16]], align 8
-; CHECK-NEXT: store i32 0, ptr [[TMP18]], align 8
+; CHECK-NEXT: [[TMP4:%.*]] = getelementptr i32, ptr [[TMP18]], i32 0
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[TMP4]], align 8
; CHECK-NEXT: [[TMP20:%.*]] = getelementptr i32, ptr [[DST_2]], i64 [[TMP9]]
; CHECK-NEXT: [[TMP21:%.*]] = getelementptr i32, ptr [[TMP20]], i32 0
; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[TMP21]], align 8
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
-; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], splat (i32 4)
; CHECK-NEXT: [[TMP22:%.*]] = icmp eq i64 [[INDEX_NEXT]], 200
; CHECK-NEXT: br i1 [[TMP22]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; CHECK: middle.block:
|
; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[SCALE]], 1 | ||
; CHECK-NEXT: br i1 [[IDENT_CHECK]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure about this one actually: nothing is actually vectorized, if you look further down. It's a degenerate case, and the motivation for the patch was following the same flow for stores as we do for loads: still quite unsure and confused about the test updates though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends whether you enter the loop or not, right? Certainly in the vector loop the code is faster without the mul. I would imagine that turning on symbolic stride predicates for loads would have done the same thing in the past? Is this not just a special case because we're (presumably) choosing VF=1, IC=2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be faster with the mul
removed, but this is not the reason we added the predicate, and now we can only execute the interleaved loop if scale == 1
Hmm, it seems a shame to close this because I thought what you are doing here seems perfectly sensible. There seems to be a genuine improvement in llvm/test/Transforms/LoopVectorize/version-stride-with-integer-casts.ll |
We already replace the symbolic strides for loads, so it seems reasonable to do the same for stores. Any problems we see for stores are surely also problems for loads, right? |
I would defer to @fhahn's judgement on this: yes, the patch seems sensible to me too, but there seems to be an unfortunate regression (?). Will re-open if there is some way forward. |
Since SymbolicStrides are available when we iterate on stores in analyzeLoop, use the information to version unit stride for stores, just as we do for loads.