-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Ascii.Equals when widening #87141
Conversation
return false; | ||
} | ||
|
||
(Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is me adding value:
(Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened); | |
var (lower, upper) = Vector256.Widen(leftNotWidened); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is me adding value:
Please stop adding value. :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BCL has a rule that var
can only be used when the type is apparent.
It's a battle that I lost before ever joining the team 😅
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
|
||
Vector256<TRight> leftValues; | ||
Vector256<TRight> rightValues; | ||
ref TRight oneVectorAwayFromRightEnd = ref Unsafe.Add(ref currentRightSearchSpace, length - (uint)Vector256<TLeft>.Count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding why this is valid. We're subtracting from the "right" search space the number of "left" elements in a vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works because TLeft
and TRight
are either the same type, or we are in the widen case where Vector<TLeft>
is twice the size of Vector<TRight>
and the widen code will advance twice Vector<TRight>.Count
which is equal to 1 Vector<TLeft>.Count
.
But it is written in a confusing way. The whole TLoader
abstraction helps with code sharing but makes this part kind of yucky. Maybe if the compare method advanced the pointers it would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. At a minimum a comment explaining would be helpful.
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
if (!Vector256<ushort>.AllBitsSet.Equals( | ||
Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should prefer the operators where possible:
if (!Vector256<ushort>.AllBitsSet.Equals( | |
Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1)))) | |
if (Vector256<ushort>.AllBitsSet != (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1))) |
bool Equals()
in particular is the same as ==
for integral types, but isn't directly recognized as intrinsic and is an instance methjod. So the JIT has to inline and elide the reference taken
for this
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to just make this return Vector256<ushort>.AllBitsSet == (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1))
instead as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the "Side-note on weird codegen observed" section at the bottom of the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be ==
if you noticed a suboptimal codegen - we'd better look and fix it in JIT instead of complicating C# code just to squeeze everything here and now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be as simple as:
return ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sometimes it can be fixed with return cond ? true : false;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
Vector256<ushort> equals = Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1);
if (equals.AsByte().ExtractMostSignificantBits() != 0xffffffffu)
return false;
return true;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
Not needed, ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero;
is just a canonical way to do that. Also, ExtractMostSignificantBits
is very expensive on arm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorBo Agreed, VPMOVMSKB
has no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by [CompExactlyDependsOn(typeof(Avx))]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorBo Agreed,
VPMOVMSKB
has no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by[CompExactlyDependsOn(typeof(Avx))]
.
still, == Vector.Zero
is expected to be lowered to MoveMask with SSE2 or to vptest with SSE41/AVX so no reason to do it by hands
Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue DetailsWhile trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that
Code-gen for the different if conditions
Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR. But opening the PR now as draft to get feedback on the overall approach before expanding it.
We can likely get similar gains in the Side-note on weird codegen observedWhen doing if condition:
no if, return directly
|
See also for ARM64: Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more. |
rightValues = Vector256.LoadUnsafe(ref currentRightSearchSpace); | ||
|
||
if (leftValues != rightValues || !AllCharsInVectorAreAscii(leftValues | rightValues)) | ||
if (!TLoader.Compare256(ref currentLeftSearchSpace, ref currentRightSearchSpace)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TLoader
does now the comparison, so the name should be adjusted to reflect that.
(Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr)); | ||
return Vector256.Create(lower, upper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr)); | |
return Vector256.Create(lower, upper); | |
return Vector256.WidenLower(Vector128.LoadUnsafe(ref ptr).ToVector256Unsafe()); |
This results in better codegen when Avx2
is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggests a set of missing System.Runtime.Intrinsics.Vector256
APIs:
public static System.Runtime.Intrinsics.Vector256<ushort> Widen (System.Runtime.Intrinsics.Vector128<byte> source);
public static System.Runtime.Intrinsics.Vector256<ushort> LoadWideningUnsafe (ref byte source);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in better codegen when
Avx2
is available.
Codegen on arm64
is pretty bad though, probably should wrap with Avx2.IsSupported
.
@BrennanConroy, when this lands, will that be enough to enable ASP.NET to switch away from its custom implementation? If so, @adamsitnik, can you help ensure this lands as soon as possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes LGTM, big thanks for your contribution @BrennanConroy.
I've left some comments, but they are all subjective and related only to naming. PTAL at them and either reject or apply them and mark the PR as ready for review. Then I am simply going to merge it.
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BrennanConroy is OOF so I applied my suggestions and added a test for BoundedMemory
Overall for cases where the inputs are equal, the perf has improved and is even faster than the ASP.NET implementation that we want to remove in dotnet/aspnetcore#48368
For cases where the inputs are not equal at first character, the perf has regressed, but it's on par with the ASP.NET implementation. It's acceptable if we want to unblock dotnet/aspnetcore#48368
Source code, results:
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1848)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.4.23259.14
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job-SJHYUI : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-GSESEP : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
LaunchCount=3 MemoryRandomization=True
Method | Job | Size | Equal | Mean | Ratio |
---|---|---|---|---|---|
SystemAscii | PR | 6 | False | 1.653 ns | 1.00 |
AspNet | PR | 6 | False | 3.335 ns | 2.03 |
SystemAscii | main | 6 | False | 1.646 ns | 1.00 |
SystemAscii | PR | 6 | True | 4.446 ns | 1.06 |
AspNet | PR | 6 | True | 4.180 ns | 0.99 |
SystemAscii | main | 6 | True | 4.202 ns | 1.00 |
SystemAscii | PR | 32 | False | 2.837 ns | 1.36 |
AspNet | PR | 32 | False | 2.814 ns | 1.35 |
SystemAscii | main | 32 | False | 2.080 ns | 1.00 |
SystemAscii | PR | 32 | True | 2.888 ns | 0.86 |
AspNet | PR | 32 | True | 3.023 ns | 0.90 |
SystemAscii | main | 32 | True | 3.356 ns | 1.00 |
SystemAscii | PR | 64 | False | 2.821 ns | 1.36 |
AspNet | PR | 64 | False | 2.796 ns | 1.35 |
SystemAscii | main | 64 | False | 2.072 ns | 1.00 |
SystemAscii | PR | 64 | True | 3.849 ns | 0.73 |
AspNet | PR | 64 | True | 4.446 ns | 0.84 |
SystemAscii | main | 64 | True | 5.277 ns | 1.00 |
the CI failure is unrelated (#73040)
While trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that
Ascii.Equals
is slower than Kestrel's hand-rolled code. Upon investigation, 3 changes were found that could improve the performance.vpor ymm0,ymm0,ymm1
string
to abyte[]
the bytes were widened which results in 2 vectors that are half the size of the original vector. The way the code was written made it so we would fallback toVector128
comparisons in theVector256
case, andVector64
in theVector128
case. We can refactor the code to not fallback to smaller vector sizes resulting in half the loop iterations needed for the same input.if (lower != rightValues0 || upper != rightValues1)
if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))
Code-gen for the different if conditions
if (lower != rightValues0 || upper != rightValues1)
if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))
Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR.
We can likely get similar gains in the
EqualsIgnoreCase
, and of course can be expanded to theVector128
paths. But opening the PR now as draft to get feedback on the overall approach before expanding it.Side-note on weird codegen observed
When doing
return Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1)))
instead of the if condition it looks like extra instructions are generatedif condition:
no if, return directly