Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAA: version unit stride for stores #124567

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion llvm/lib/Analysis/LoopAccessAnalysis.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2545,7 +2545,8 @@ bool LoopAccessInfo::analyzeLoop(AAResults *AA, const LoopInfo *LI,
for (StoreInst *ST : Stores) {
Value *Ptr = ST->getPointerOperand();

if (isInvariant(Ptr)) {
const SCEV *PtrScev = replaceSymbolicStrideSCEV(*PSE, SymbolicStrides, Ptr);
artagnon marked this conversation as resolved.
Show resolved Hide resolved
if (PSE->getSE()->isLoopInvariant(PtrScev, TheLoop)) {
// Record store instructions to loop invariant addresses
StoresToInvariantAddresses.push_back(ST);
HasStoreStoreDependenceInvolvingLoopInvariantAddress |=
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -784,3 +784,42 @@ loop:
exit:
ret void
}

define void @stores_to_invariant_address(i32 %offset, ptr noalias %dst.1, ptr %dst.2) {
; CHECK-LABEL: 'stores_to_invariant_address'
; CHECK-NEXT: loop:
; CHECK-NEXT: Memory dependences are safe
; CHECK-NEXT: Dependences:
; CHECK-NEXT: Run-time memory checks:
; CHECK-NEXT: Grouped accesses:
; CHECK-EMPTY:
; CHECK-NEXT: Non vectorizable stores to invariant address were not found in loop.
; CHECK-NEXT: SCEV assumptions:
; CHECK-NEXT: Equal predicate: %offset == 1
; CHECK-EMPTY:
; CHECK-NEXT: Expressions re-written:
; CHECK-NEXT: [PSE] %gep = getelementptr i32, ptr %dst.2, i32 %iv.2.mul:
; CHECK-NEXT: ((4 * (sext i32 {0,+,%offset}<%loop> to i64))<nsw> + %dst.2)
; CHECK-NEXT: --> {%dst.2,+,4}<nw><%loop>
;
entry:
%add = add i32 %offset, 3
br label %loop

loop:
%iv.2 = phi i32 [ 0, %entry ], [ %iv.2.next, %loop ]
%iv.3 = phi i32 [ 0, %entry ], [ %iv.3.next, %loop ]
%iv.mul = mul i32 %iv.3, %add
%gep.mul = getelementptr i8, ptr %dst.1, i32 %iv.mul
store i32 0, ptr %gep.mul, align 8
%iv.2.mul = mul i32 %iv.2, %offset
%gep = getelementptr i32, ptr %dst.2, i32 %iv.2.mul
store i32 0, ptr %gep, align 8
%iv.2.next = add i32 %iv.2, 1
%iv.3.next = add i32 %iv.3, 1
%ec = icmp eq i32 %iv.3, 200
br i1 %ec, label %exit, label %loop

exit:
ret void
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,27 @@ define void @test_variable_stride(ptr %dst, i32 %scale) {
; CHECK-LABEL: define void @test_variable_stride
; CHECK-SAME: (ptr [[DST:%.*]], i32 [[SCALE:%.*]]) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
; CHECK: vector.scevcheck:
; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[SCALE]], 1
; CHECK-NEXT: br i1 [[IDENT_CHECK]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
Comment on lines +10 to +11
Copy link
Contributor

@fhahn fhahn Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure about this one actually: nothing is actually vectorized, if you look further down. It's a degenerate case, and the motivation for the patch was following the same flow for stores as we do for loads: still quite unsure and confused about the test updates though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends whether you enter the loop or not, right? Certainly in the vector loop the code is faster without the mul. I would imagine that turning on symbolic stride predicates for loads would have done the same thing in the past? Is this not just a special case because we're (presumably) choosing VF=1, IC=2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be faster with the mul removed, but this is not the reason we added the predicate, and now we can only execute the interleaved loop if scale == 1

; CHECK: vector.ph:
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
; CHECK-NEXT: [[TMP2:%.*]] = mul i32 [[TMP0]], [[SCALE]]
; CHECK-NEXT: [[TMP3:%.*]] = mul i32 [[TMP1]], [[SCALE]]
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP2]]
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP3]]
; CHECK-NEXT: store i32 [[TMP0]], ptr [[TMP4]], align 2
; CHECK-NEXT: store i32 [[TMP1]], ptr [[TMP5]], align 2
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP0]]
; CHECK-NEXT: [[TMP3:%.*]] = getelementptr i16, ptr [[DST]], i32 [[TMP1]]
; CHECK-NEXT: store i32 [[TMP0]], ptr [[TMP2]], align 2
; CHECK-NEXT: store i32 [[TMP1]], ptr [[TMP3]], align 2
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: middle.block:
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ 1000, [[MIDDLE_BLOCK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[IV:%.*]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -326,46 +326,23 @@ define void @test_versioned_with_non_ex_use(i32 %offset, ptr noalias %dst.1, ptr
; CHECK-NEXT: [[ADD:%.*]] = add i32 [[OFFSET]], 3
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_SCEVCHECK:%.*]]
; CHECK: vector.scevcheck:
; CHECK-NEXT: [[TMP0:%.*]] = sub i32 -3, [[OFFSET]]
; CHECK-NEXT: [[TMP1:%.*]] = icmp slt i32 [[ADD]], 0
; CHECK-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 [[TMP0]], i32 [[ADD]]
; CHECK-NEXT: [[MUL:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 [[TMP2]], i32 200)
; CHECK-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL]], 0
; CHECK-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL]], 1
; CHECK-NEXT: [[TMP3:%.*]] = sub i32 0, [[MUL_RESULT]]
; CHECK-NEXT: [[TMP4:%.*]] = icmp slt i32 [[MUL_RESULT]], 0
; CHECK-NEXT: [[TMP5:%.*]] = icmp sgt i32 [[TMP3]], 0
; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP1]], i1 [[TMP5]], i1 [[TMP4]]
; CHECK-NEXT: [[TMP7:%.*]] = or i1 [[TMP6]], [[MUL_OVERFLOW]]
; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[OFFSET]], 1
; CHECK-NEXT: [[TMP8:%.*]] = or i1 [[TMP7]], [[IDENT_CHECK]]
; CHECK-NEXT: br i1 [[TMP8]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; CHECK-NEXT: br i1 [[IDENT_CHECK]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[ADD]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP10:%.*]] = mul <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i32> [[TMP10]], i32 0
; CHECK-NEXT: [[TMP12:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP11]]
; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x i32> [[TMP10]], i32 1
; CHECK-NEXT: [[TMP14:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP13]]
; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i32> [[TMP10]], i32 2
; CHECK-NEXT: [[TMP16:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP15]]
; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x i32> [[TMP10]], i32 3
; CHECK-NEXT: [[OFFSET_IDX:%.*]] = trunc i64 [[INDEX]] to i32
; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[OFFSET_IDX]], 0
; CHECK-NEXT: [[TMP17:%.*]] = mul i32 [[TMP1]], [[ADD]]
; CHECK-NEXT: [[TMP18:%.*]] = getelementptr i8, ptr [[DST_1]], i32 [[TMP17]]
; CHECK-NEXT: store i32 0, ptr [[TMP12]], align 8
; CHECK-NEXT: store i32 0, ptr [[TMP14]], align 8
; CHECK-NEXT: store i32 0, ptr [[TMP16]], align 8
; CHECK-NEXT: store i32 0, ptr [[TMP18]], align 8
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr i32, ptr [[TMP18]], i32 0
; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[TMP4]], align 8
; CHECK-NEXT: [[TMP20:%.*]] = getelementptr i32, ptr [[DST_2]], i64 [[TMP9]]
; CHECK-NEXT: [[TMP21:%.*]] = getelementptr i32, ptr [[TMP20]], i32 0
; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[TMP21]], align 8
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], splat (i32 4)
; CHECK-NEXT: [[TMP22:%.*]] = icmp eq i64 [[INDEX_NEXT]], 200
; CHECK-NEXT: br i1 [[TMP22]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
; CHECK: middle.block:
Expand Down
Loading