-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIE2P] Combine G_SHUFFLE_VECTOR to G_AIE_EXTRACT_SUBVECTOR #302
base: aie-public
Are you sure you want to change the base?
Conversation
978972b
to
2903dc3
Compare
const unsigned NumDstElems = DstTy.getNumElements(); | ||
const unsigned NumSrcElems = Src1Ty.getNumElements(); | ||
if (NumDstElems < NumSrcElems) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this could prevent a broadcast of an element of a bigger vector into a smaller one, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but as far as I can see this combine matchShuffleToBroadcast
doesn't cover this case. And without this check It fails for the following test:
%0:_(<4 x s8>) = G_SHUFFLE_VECTOR %1(<8 x s8>), %2(<8 x s8>), shufflemask(-1, 5, -1, 3)
because createDuplicatePatternMask
returns a mask of size NumDstElems / NumSrcElems=0
.
In general, this PR is just to combine G_SHUFFLE_VECTOR
to G_AIE_EXTRACT_SUBVECTOR
but there will be a follow-up PR which will extract a subvector and broadcast it and I think I can try to cover this case (matchShuffleToBroadcast
) there.
@@ -227,3 +227,129 @@ body: | | |||
$x0 = COPY %0(<16 x s32>) | |||
PseudoRET implicit $lr, implicit $x0 | |||
... | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this case:
---
name: shuffle_vector_to_extract_subvec_4x8Dst
tracksRegLiveness: true
body: |
bb.1:
liveins: $l0, $l1
%1:_(<8 x s8>) = COPY $l0
%2:_(<8 x s8>) = COPY $l1
%0:_(<4 x s8>) = G_SHUFFLE_VECTOR %1(<8 x s8>), %2(<8 x s8>), shufflemask(8, 9, 10, 11)
PseudoRET implicit $lr, implicit %0
Now we just discard, but we could return the first subreg of the second src register.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought about it but @konstantinschwarz said there is a PR in upstream for canonicalization, i.e., if we have a mask with all indices from the second source vector, then the source vectors are switched and the indices are changed so that they correspond to the new first source vector (previously, source vector 2). It means that we implement all combines only for the source vector 1.
BuildFnTy &MatchInfo) { | ||
assert(MI.getOpcode() == TargetOpcode::G_SHUFFLE_VECTOR); | ||
const Register DstReg = MI.getOperand(0).getReg(); | ||
const Register Src1Reg = MI.getOperand(1).getReg(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a nice opportunity here by considering extracts from the second input operand.
2903dc3
to
4a12a1f
Compare
@@ -126,3 +126,10 @@ def G_AIE_VSHIFT_RIGHT : AIEGenericInstruction { | |||
let InOperandList = (ins type0:$src1, type0:$src2, type1:$shift_amt); | |||
let hasSideEffects = false; | |||
} | |||
|
|||
// Extract 32-bit or 64-bit subvector. | |||
def G_AIE_EXTRACT_SUBVECTOR : AIEGenericInstruction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are only extracting 32/64-bit, is it really subvector extract? Can we use G_AIE_SEXT_EXTRACT_VECTOR_ELT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could reuse and extend to include immediate usage. G_AIE_SEXT_EXTRACT_VECTOR_ELT also expands the index to a register, so as is, it is not interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment must be changed. It extracts 32/64 bit vectors: 4 x s8
or 2 x s16
for 32-bit and 8 x s8
, 4 x s16
and 2 x s32
for 64-bit. I discussed it already with @konstantinschwarz to extend G_AIE_SEXT_EXTRACT_VECTOR_ELT to cover also output vectors not only elements but we came to conclusion that it would be confusing when G_AIE_SEXT_EXTRACT_VECTOR_ELT extracts a vector and not an element, so we decided to introduce G_AIE_EXTRACT_SUBVECTOR for extracting vectors.
auto ExtractMask = createSequentialMask(Start, NumElems, 0); | ||
|
||
for (unsigned I = 0; I < NumDstElems; I++) { | ||
if (Mask[I] == -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment about ignoring undefs.
if (Mask[I] == -1) | ||
continue; | ||
|
||
if (Mask[I] != ExtractMask[I]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a more compact implementation:
auto CheckExtractMask = [&](unsigned Start, unsigned NumElems) -> bool {
auto ExtractMask = createSequentialMask(Start, NumElems, 0);
return std::equal(Mask.begin(), Mask.end(), ExtractMask.begin(),
ExtractMask.end(), [&](const int LHS, const int RHS) {
return (LHS == -1 || LHS == RHS);
});
};
@@ -719,6 +719,28 @@ def : Pat<(int_aie2p_vinsert64_bf512 VEC512:$s1, eR29:$idx, eL:$s0), | |||
def : Pat<(int_aie2p_vinsert32_accfloat ACC512:$s1, eR29:$idx, eR:$s0), | |||
(COPY_TO_REGCLASS (VINSERT_32_mR29_insert (COPY_TO_REGCLASS ACC512:$s1, VEC512), eR29:$idx, eR:$s0), ACC512)>; | |||
|
|||
// VEXTRACT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are selecting like this (<4 x s8>) = G_AIE_EXTRACT_SUBVECTOR [[COPY]](<8 x s8>), [[C]](s32)
in the prelegalizer. Output type differs from selection patterns below.
4a12a1f
to
cd13ed1
Compare
The only native source vector type for |
llvm/lib/Target/AIE/AIEInstrGISel.td
Outdated
@@ -126,3 +126,10 @@ def G_AIE_VSHIFT_RIGHT : AIEGenericInstruction { | |||
let InOperandList = (ins type0:$src1, type0:$src2, type1:$shift_amt); | |||
let hasSideEffects = false; | |||
} | |||
|
|||
// Extract 32-bit or 64-bit subvector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include 16-bit as well, which can extract <2 x 8>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed it with @konstantinschwarz and we decided to introduce G_AIE_EXTRACT_SUBVECTOR
to cover, e.g., the following case: shufflemask <0,1,0,1,0,1> -> bcst (extract (src <0,1>))
. And the input for bcst
is 32 or 64 bit register, so at least for now we allow G_AIE_EXTRACT_SUBVECTOR
to extract 32 and 64 bit vectors because this combine is currently the only place where we use G_AIE_EXTRACT_SUBVECTOR
.
if (Src1TySize == 32) { | ||
B.buildConcatVectors( | ||
{NewSrcReg}, {Src1Reg, ImplicitDef, ImplicitDef, ImplicitDef, | ||
ImplicitDef, ImplicitDef, ImplicitDef, ImplicitDef, | ||
ImplicitDef, ImplicitDef, ImplicitDef, ImplicitDef, | ||
ImplicitDef, ImplicitDef, ImplicitDef, ImplicitDef}); | ||
} else if (Src1TySize == 64) { | ||
B.buildConcatVectors( | ||
{NewSrcReg}, {Src1Reg, ImplicitDef, ImplicitDef, ImplicitDef, | ||
ImplicitDef, ImplicitDef, ImplicitDef, ImplicitDef}); | ||
} else if (Src1TySize == 128) { | ||
B.buildConcatVectors({NewSrcReg}, | ||
{Src1Reg, ImplicitDef, ImplicitDef, ImplicitDef}); | ||
} else { // Src1TySize == 256 | ||
B.buildConcatVectors({NewSrcReg}, {Src1Reg, ImplicitDef}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do better, to avoid code duplication. May be use for loop to generate concat vector Ops.
} | ||
|
||
if (Src1TySize == 2048) { | ||
NewSubIdx = SubIdx % (NumSubVectors / 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: May be using separate variable for NumSubVectors / 4
increase readability.
const Register NewSrcReg = MRI.createGenericVirtualRegister(NewSrc1Ty); | ||
|
||
// 32, 64, 126, 256 bit source vectors | ||
if (Src1TySize < 512) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to support 32 and 64 source vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum extract subvector size is 32, so what does extract 32 from 32-bit vector means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers this case: %0:_(<4 x s8>) = G_SHUFFLE_VECTOR %1(<4 x s8>), %2(<4 x s8>), shufflemask(0, 1, 2, 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that equivalent to %0 = COPY %1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I've added this already to my implementation. See this test: https://github.com/Xilinx/llvm-aie/pull/302/files#:~:text=%2D%2D%2D-,name%3A%20shuffle_vector_to_extract_subvec_32BitSrc,-tracksRegLiveness%3A%20true
; CHECK-NEXT: PseudoRET implicit $lr, implicit [[AIE_EXTRACT_SUBVECTOR]](<4 x s8>) | ||
%1:_(<4 x s8>) = COPY $r0 | ||
%2:_(<4 x s8>) = COPY $r1 | ||
%0:_(<4 x s8>) = G_SHUFFLE_VECTOR %1(<4 x s8>), %2(<4 x s8>), shufflemask(0, 1, 2, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems %0:_(<4 x s8>) = COPY %1
is more optimal.
Not sure this is a phase ordering problem or we are not supporting already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good point but we don't support this combine (G_SHUFFLE_VECTOR->COPY) yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe I can cover this case in my function and then this will be more optimal and I won't need support for 32 bit vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this case to my implementation
cd13ed1
to
c2e822d
Compare
c2e822d
to
490891c
Compare
} | ||
|
||
if (Src1TySize == 2048) { | ||
const unsigned NumSubVectors512Bits = NumSubVectors / 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code can be simplified. Just one unmerge in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, put the UnusedSubregs in an array and build the operand vector from it.
@@ -145,6 +145,10 @@ struct AIEBaseInstrInfo : public TargetInstrInfo { | |||
virtual unsigned getGenericVShiftOpcode() const { | |||
llvm_unreachable("Target didn't implement getGenericVShiftOpcode!"); | |||
} | |||
/// Return the opcode to be used for subvector extraction. | |||
virtual unsigned getGenericExtractSubvectorOpcode() const { | |||
llvm_unreachable("Target didn't implement getGenericVSelOpcode!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: incorrect function name in the message
/// %1:_(<16 x s16>) = COPY $wl0 | ||
/// %2:_(<16 x s16>) = COPY $wl1 | ||
/// %0:_(<4 x s16>) = G_SHUFFLE_VECTOR %1(<16 x s16>), %2(<16 x s16>), | ||
/// shufflemask(4, 5, 6, 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy paste of 16 x s16 here? I expect 4 x s16
if (SubIdx == -1) | ||
return false; | ||
|
||
if (Src1TySize == DstTy.getSizeInBits()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though its a simple case, I think it needs to be a separate combine with more tests. Can be even generic combine that we could upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll just add a check and do nothing in my combine for this case.
; CHECK-NEXT: {{ $}} | ||
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s16>) = COPY $l0 | ||
; CHECK-NEXT: [[DEF:%[0-9]+]]:_(<4 x s16>) = G_IMPLICIT_DEF | ||
; CHECK-NEXT: [[CONCAT_VECTORS:%[0-9]+]]:_(<32 x s16>) = G_CONCAT_VECTORS [[COPY]](<4 x s16>), [[DEF]](<4 x s16>), [[DEF]](<4 x s16>), [[DEF]](<4 x s16>), [[DEF]](<4 x s16>), [[DEF]](<4 x s16>), [[DEF]](<4 x s16>), [[DEF]](<4 x s16>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we handle G_CONCAT_VECTORS
for this types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks we cannot legalize. With 4 inputs we can custom legalize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for this type, we need a different solution:
llvm::AIELegalizerHelper::legalizeG_CONCAT_VECTORS(llvm::LegalizerHelper &, llvm::MachineInstr &) const: Assertion `SrcTy.getSizeInBits() >= 256 && "Input vector size does not match!"' failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, subregister copy is enough for the 64 bit source. No need to build a 512-bit register and extract 32-bit element from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed it with @konstantinschwarz, for now we don't need support for so small source vectors, so I just put an assertion for these cases and cover only the source vectors sizes greater than or equal to 256 bits.
; CHECK: liveins: $dm0, $dm1 | ||
; CHECK-NEXT: {{ $}} | ||
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<64 x s32>) = COPY $dm0 | ||
; CHECK-NEXT: [[UV:%[0-9]+]]:_(<16 x s32>), [[UV1:%[0-9]+]]:_(<16 x s32>), [[UV2:%[0-9]+]]:_(<16 x s32>), [[UV3:%[0-9]+]]:_(<16 x s32>) = G_UNMERGE_VALUES [[COPY]](<64 x s32>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see G_UNMERGE_VALUES
for this not legal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is illegal, I added the legalization for it but the legalization is also a part of this PR #274
|
||
if (!DstTy.isVector() || !Src1Ty.isVector() || Src1TySize < 32 || | ||
(DstTy.getSizeInBits() != 32 && DstTy.getSizeInBits() != 64)) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we parameterize this function with all the sizes that are explicit constants here? I guess 32 is BitSize, 64 is 2 x BitSize, 512 is SomeFactor * BitSize, etc.
Since this is in a generic helper, it would make sense to cut the function into smaller, more useful methods.
return false; | ||
|
||
auto CheckExtractMask = [=](unsigned Start, unsigned NumElems) -> bool { | ||
auto ExtractMask = createSequentialMask(Start, NumElems, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just write this as a for-loop reading Mask, returning false as soon as an index doesn't match. And perhaps lift it to a top level static method checkSequentialMask(Mask, StartPos, NumElems, StartIdx).
|
||
// Not an extract pattern | ||
if (SubIdx == -1) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would read good as a lambda, returning SubVecIdx from the loop and -1 at the end.
const int SubIdx = getSubVectorIndx(...);
if (SubIdx < 0) {
return false;
}
// 1024 and 2048 bit source vectors | ||
const Register UnusedSubReg1 = MRI.createGenericVirtualRegister(NewSrc1Ty); | ||
const Register UnusedSubReg2 = MRI.createGenericVirtualRegister(NewSrc1Ty); | ||
const Register UnusedSubReg3 = MRI.createGenericVirtualRegister(NewSrc1Ty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we bring the Src1TypeSize test outside, have two different MatchInfo lambdas and only create the VRs that we need? Note that we may create unused VRs now.
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<64 x s32>) = COPY $dm0 | ||
; CHECK-NEXT: [[UV:%[0-9]+]]:_(<16 x s32>), [[UV1:%[0-9]+]]:_(<16 x s32>), [[UV2:%[0-9]+]]:_(<16 x s32>), [[UV3:%[0-9]+]]:_(<16 x s32>) = G_UNMERGE_VALUES [[COPY]](<64 x s32>) | ||
; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0 | ||
; CHECK-NEXT: [[AIE_EXTRACT_SUBVECTOR:%[0-9]+]]:_(<2 x s32>) = G_AIE_EXTRACT_SUBVECTOR [[UV]](<16 x s32>), [[C]](s32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have:
LLVM ERROR: unable to legalize instruction: %3:_(<16 x s32>), %4:_(<16 x s32>), %5:_(<16 x s32>), %6:_(<16 x s32>) = G_UNMERGE_VALUES %0:_(<64 x s32>) (in function: shuffle_vector_to_extract_subvec_2048BitSrc)
For this, we need two unmerges:
1024, 1024 = G_UNMERGE_VALUES 2048
512, 512 = G_UNMERGE_VALUES 1024
And then take the correct 512 to extract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the legalization for it to this PR but the legalization is also a part of this PR #274
490891c
to
6ea2d56
Compare
No description provided.