Optimize Ascii.Equals when widening #87141

BrennanConroy · 2023-06-05T20:19:44Z

While trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that Ascii.Equals is slower than Kestrel's hand-rolled code. Upon investigation, 3 changes were found that could improve the performance.

When checking for non-ascii characters in the inputs we were ORing both sides together. This shouldn't be needed as we are already checking the two inputs for bitwise equality, so this was doing an unneeded vpor ymm0,ymm0,ymm1
When comparing a string to a byte[] the bytes were widened which results in 2 vectors that are half the size of the original vector. The way the code was written made it so we would fallback to Vector128 comparisons in the Vector256 case, and Vector64 in the Vector128 case. We can refactor the code to not fallback to smaller vector sizes resulting in half the loop iterations needed for the same input.
Changing the equality condition after widening results in faster code

Slower: if (lower != rightValues0 || upper != rightValues1)
Faster: if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

Code-gen for the different if conditions

if (lower != rightValues0 || upper != rightValues1)

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vmovups   ymm3,[rdx]
       vmovups   ymm4,[rdx+20]
       vpcmpeqw  ymm2,ymm2,ymm3
       vpmovmskb r10d,ymm2
       cmp       r10d,0FFFFFFFF
       setne     r10b
       movzx     r10d,r10b
       vpcmpeqw  ymm1,ymm1,ymm4
       vpmovmskb r11d,ymm1
       cmp       r11d,0FFFFFFFF
       setne     r11b
       movzx     r11d,r11b
       or        r10b,r11b
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR.

|             Method |        Job |              Toolchain | Size |      Mean |     Error |    StdDev |    Median |       Min |       Max | Ratio | MannWhitney(1%) | RatioSD |

|------------------- |----------- |----------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|
|       Equals_Bytes | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  4.546 ns | 0.0757 ns | 0.1365 ns |  4.472 ns |  4.410 ns |  4.815 ns |  0.93 |          Faster |    0.03 |
|       Equals_Bytes | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  4.887 ns | 0.0654 ns | 0.1259 ns |  4.869 ns |  4.700 ns |  5.344 ns |  1.00 |            Base |    0.00 |

|       Equals_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  9.503 ns | 0.1326 ns | 0.2425 ns |  9.432 ns |  9.183 ns |  9.831 ns |  0.98 |            Same |    0.03 |
|       Equals_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  9.689 ns | 0.0439 ns | 0.0803 ns |  9.679 ns |  9.519 ns |  9.865 ns |  1.00 |            Base |    0.00 |

| Equals_Bytes_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  7.583 ns | 0.0622 ns | 0.1137 ns |  7.555 ns |  7.384 ns |  7.938 ns |  0.63 |          Faster |    0.01 |
| Equals_Bytes_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 | 12.058 ns | 0.1172 ns | 0.2259 ns | 12.078 ns | 11.645 ns | 12.646 ns |  1.00 |            Base |    0.00 |

We can likely get similar gains in the EqualsIgnoreCase, and of course can be expanded to the Vector128 paths. But opening the PR now as draft to get feedback on the overall approach before expanding it.

Side-note on weird codegen observed

When doing return Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))) instead of the if condition it looks like extra instructions are generated

if condition:

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

no if, return directly

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L02
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       sete      r10b          <-- extra
       movzx     r10d,r10b     <-- extra
       test      r10d,r10d     <-- extra
       je        short M01_L02
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,1F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

davidfowl · 2023-06-05T21:09:29Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

+                    return false;
+                }
+
+                (Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened);


This is me adding value:

Suggested change

(Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened);

var (lower, upper) = Vector256.Widen(leftNotWidened);

This is me adding value:

Please stop adding value. :-P

The BCL has a rule that var can only be used when the type is apparent.

It's a battle that I lost before ever joining the team 😅

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

stephentoub · 2023-06-05T21:22:03Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

-
-                Vector256<TRight> leftValues;
-                Vector256<TRight> rightValues;
+                ref TRight oneVectorAwayFromRightEnd = ref Unsafe.Add(ref currentRightSearchSpace, length - (uint)Vector256<TLeft>.Count);


I'm not understanding why this is valid. We're subtracting from the "right" search space the number of "left" elements in a vector?

It works because TLeft and TRight are either the same type, or we are in the widen case where Vector<TLeft> is twice the size of Vector<TRight> and the widen code will advance twice Vector<TRight>.Count which is equal to 1 Vector<TLeft>.Count.

But it is written in a confusing way. The whole TLoader abstraction helps with code sharing but makes this part kind of yucky. Maybe if the compare method advanced the pointers it would be better?

Thanks. At a minimum a comment explaining would be helpful.

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

tannergooding · 2023-06-05T21:27:56Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

+                if (!Vector256<ushort>.AllBitsSet.Equals(
+                    Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))


We should prefer the operators where possible:

Suggested change

if (!Vector256<ushort>.AllBitsSet.Equals(

Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

if (Vector256<ushort>.AllBitsSet != (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1)))

bool Equals() in particular is the same as == for integral types, but isn't directly recognized as intrinsic and is an instance methjod. So the JIT has to inline and elide the reference taken for this.

Might be nice to just make this return Vector256<ushort>.AllBitsSet == (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1)) instead as well

Look at the "Side-note on weird codegen observed" section at the bottom of the PR description.

I think it should be == if you noticed a suboptimal codegen - we'd better look and fix it in JIT instead of complicating C# code just to squeeze everything here and now

It should be as simple as:

return ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero;

Yep, sometimes it can be fixed with return cond ? true : false;

How about:

Vector256<ushort> equals = Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1); if (equals.AsByte().ExtractMostSignificantBits() != 0xffffffffu) return false; return true;

How about:

Not needed, ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero; is just a canonical way to do that. Also, ExtractMostSignificantBits is very expensive on arm

@EgorBo Agreed, VPMOVMSKB has no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by [CompExactlyDependsOn(typeof(Avx))].

@EgorBo Agreed, VPMOVMSKB has no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by [CompExactlyDependsOn(typeof(Avx))].

still, == Vector.Zero is expected to be lowered to MoveMask with SSE2 or to vptest with SSE41/AVX so no reason to do it by hands

ghost · 2023-06-05T21:59:19Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

While trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that Ascii.Equals is slower than Kestrel's hand-rolled code. Upon investigation, 3 changes were found that could improve the performance.

When checking for non-ascii characters in the inputs we were ORing both sides together. This shouldn't be needed as we are already checking the two inputs for bitwise equality, so this was doing an unneeded vpor ymm0,ymm0,ymm1
When comparing a string to a byte[] the bytes were widened which results in 2 vectors that are half the size of the original vector. The way the code was written made it so we would fallback to Vector128 comparisons in the Vector256 case, and Vector64 in the Vector128 case. We can refactor the code to not fallback to smaller vector sizes resulting in half the loop iterations needed for the same input.
Changing the equality condition after widening results in faster code

Slower: if (lower != rightValues0 || upper != rightValues1)
Faster: if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

Code-gen for the different if conditions

if (lower != rightValues0 || upper != rightValues1)

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vmovups   ymm3,[rdx]
       vmovups   ymm4,[rdx+20]
       vpcmpeqw  ymm2,ymm2,ymm3
       vpmovmskb r10d,ymm2
       cmp       r10d,0FFFFFFFF
       setne     r10b
       movzx     r10d,r10b
       vpcmpeqw  ymm1,ymm1,ymm4
       vpmovmskb r11d,ymm1
       cmp       r11d,0FFFFFFFF
       setne     r11b
       movzx     r11d,r11b
       or        r10b,r11b
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR. But opening the PR now as draft to get feedback on the overall approach before expanding it.

|             Method |        Job |              Toolchain | Size |      Mean |     Error |    StdDev |    Median |       Min |       Max | Ratio | MannWhitney(1%) | RatioSD |

|------------------- |----------- |----------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|
|       Equals_Bytes | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  4.546 ns | 0.0757 ns | 0.1365 ns |  4.472 ns |  4.410 ns |  4.815 ns |  0.93 |          Faster |    0.03 |
|       Equals_Bytes | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  4.887 ns | 0.0654 ns | 0.1259 ns |  4.869 ns |  4.700 ns |  5.344 ns |  1.00 |            Base |    0.00 |

|       Equals_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  9.503 ns | 0.1326 ns | 0.2425 ns |  9.432 ns |  9.183 ns |  9.831 ns |  0.98 |            Same |    0.03 |
|       Equals_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  9.689 ns | 0.0439 ns | 0.0803 ns |  9.679 ns |  9.519 ns |  9.865 ns |  1.00 |            Base |    0.00 |

| Equals_Bytes_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  7.583 ns | 0.0622 ns | 0.1137 ns |  7.555 ns |  7.384 ns |  7.938 ns |  0.63 |          Faster |    0.01 |
| Equals_Bytes_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 | 12.058 ns | 0.1172 ns | 0.2259 ns | 12.078 ns | 11.645 ns | 12.646 ns |  1.00 |            Base |    0.00 |

We can likely get similar gains in the EqualsIgnoreCase, and of course can be expanded to the Vector128 paths.

Side-note on weird codegen observed

When doing return Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))) instead of the if condition it looks like extra instructions are generated

if condition:

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

no if, return directly

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L02
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       sete      r10b          <-- extra
       movzx     r10d,r10b     <-- extra
       test      r10d,r10d     <-- extra
       je        short M01_L02
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,1F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

Author:	BrennanConroy
Assignees:	BrennanConroy
Labels:	`area-System.Text.Encoding`
Milestone:	-

xtqqczze · 2023-06-08T17:08:33Z

See also for ARM64: Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more.

adamsitnik · 2023-06-16T05:41:35Z

cc @gfoidl who provided previous optimizations in #85926

gfoidl · 2023-06-16T08:53:29Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

-                    rightValues = Vector256.LoadUnsafe(ref currentRightSearchSpace);
-
-                    if (leftValues != rightValues || !AllCharsInVectorAreAscii(leftValues | rightValues))
+                    if (!TLoader.Compare256(ref currentLeftSearchSpace, ref currentRightSearchSpace))


The TLoader does now the comparison, so the name should be adjusted to reflect that.

xtqqczze · 2023-06-16T17:22:07Z

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

                (Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr));
                return Vector256.Create(lower, upper);


Suggested change

(Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr));

return Vector256.Create(lower, upper);

return Vector256.WidenLower(Vector128.LoadUnsafe(ref ptr).ToVector256Unsafe());

This results in better codegen when Avx2 is available.

This suggests a set of missing System.Runtime.Intrinsics.Vector256 APIs:

~~public static System.Runtime.Intrinsics.Vector256<ushort> Widen (System.Runtime.Intrinsics.Vector128<byte> source);~~

public static System.Runtime.Intrinsics.Vector256<ushort> LoadWideningUnsafe (ref byte source);

This results in better codegen when Avx2 is available.

Codegen on arm64 is pretty bad though, probably should wrap with Avx2.IsSupported.

stephentoub · 2023-07-05T03:51:58Z

@BrennanConroy, when this lands, will that be enough to enable ASP.NET to switch away from its custom implementation?

If so, @adamsitnik, can you help ensure this lands as soon as possible?

adamsitnik

Overall the changes LGTM, big thanks for your contribution @BrennanConroy.

I've left some comments, but they are all subjective and related only to naming. PTAL at them and either reject or apply them and mark the PR as ready for review. Then I am simply going to merge it.

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs

…e minor renaming

adamsitnik

@BrennanConroy is OOF so I applied my suggestions and added a test for BoundedMemory

Overall for cases where the inputs are equal, the perf has improved and is even faster than the ASP.NET implementation that we want to remove in dotnet/aspnetcore#48368

For cases where the inputs are not equal at first character, the perf has regressed, but it's on par with the ASP.NET implementation. It's acceptable if we want to unblock dotnet/aspnetcore#48368

Source code, results:

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1848)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.4.23259.14
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  Job-SJHYUI : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-GSESEP : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

LaunchCount=3 MemoryRandomization=True

Method	Job	Size	Equal	Mean	Ratio
SystemAscii	PR	6	False	1.653 ns	1.00
AspNet	PR	6	False	3.335 ns	2.03
SystemAscii	main	6	False	1.646 ns	1.00

SystemAscii	PR	6	True	4.446 ns	1.06
AspNet	PR	6	True	4.180 ns	0.99
SystemAscii	main	6	True	4.202 ns	1.00

SystemAscii	PR	32	False	2.837 ns	1.36
AspNet	PR	32	False	2.814 ns	1.35
SystemAscii	main	32	False	2.080 ns	1.00

SystemAscii	PR	32	True	2.888 ns	0.86
AspNet	PR	32	True	3.023 ns	0.90
SystemAscii	main	32	True	3.356 ns	1.00

SystemAscii	PR	64	False	2.821 ns	1.36
AspNet	PR	64	False	2.796 ns	1.35
SystemAscii	main	64	False	2.072 ns	1.00

SystemAscii	PR	64	True	3.849 ns	0.73
AspNet	PR	64	True	4.446 ns	0.84
SystemAscii	main	64	True	5.277 ns	1.00

the CI failure is unrelated (#73040)

cincuranet · 2023-07-11T17:14:08Z

Some improvements:

[Perf] Linux/x64: 3 Improvements on 7/6/2023 1:53:22 PM perf-autofiling-issues#19704
[Perf] Windows/x64: 1 Improvement on 7/6/2023 1:53:22 PM perf-autofiling-issues#19699
[Perf] Windows/x64: 1 Improvement on 7/6/2023 1:53:22 PM perf-autofiling-issues#19658

Optimize Ascii.Equals when widening

1467895

BrennanConroy requested review from stephentoub and adamsitnik June 5, 2023 20:19

ghost assigned BrennanConroy Jun 5, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 5, 2023

davidfowl reviewed Jun 5, 2023

View reviewed changes

EgorBo reviewed Jun 5, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs Outdated Show resolved Hide resolved

stephentoub reviewed Jun 5, 2023

View reviewed changes

tannergooding reviewed Jun 5, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs Outdated Show resolved Hide resolved

tannergooding reviewed Jun 5, 2023

View reviewed changes

danmoseley added area-System.Text.Encoding and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 5, 2023

build-analysis bot mentioned this pull request Jun 6, 2023

Tracking issue for CI build timeouts #76454

Closed

fb

6b1cb00

gfoidl reviewed Jun 16, 2023

View reviewed changes

xtqqczze reviewed Jun 16, 2023

View reviewed changes

adamsitnik reviewed Jul 5, 2023

View reviewed changes

adamsitnik added 2 commits July 6, 2023 09:34

add BoundedMemory tests to ensure that boundaries are respected + som…

60e0c2f

…e minor renaming

Merge remote-tracking branch 'upstream/main' into brecon/ascii

dd02aeb

adamsitnik marked this pull request as ready for review July 6, 2023 07:56

adamsitnik approved these changes Jul 6, 2023

View reviewed changes

adamsitnik merged commit bd63402 into dotnet:main Jul 6, 2023

adamsitnik mentioned this pull request Jul 6, 2023

use new System.Text.Ascii APIs, remove internal helpers dotnet/aspnetcore#48368

Merged

runfoapp bot mentioned this pull request Jul 5, 2023

Long Running Test: Interop/MonoAPI/MonoMono/PInvokeDetach/PInvokeDetach.sh #73040

Closed

cincuranet mentioned this pull request Jul 11, 2023

Regressions in System.Text.Perf_Ascii #88670

Closed

adamsitnik mentioned this pull request Jul 17, 2023

remove redundant OR #88993

Merged

This was referenced Jul 19, 2023

[Perf] Windows/x86: 3 Improvements on 7/6/2023 1:53:22 PM dotnet/perf-autofiling-issues#19669

Open

[Perf] Alpine/x64: 2 Improvements on 7/6/2023 1:53:22 PM dotnet/perf-autofiling-issues#19712

Open

BrennanConroy deleted the brecon/ascii branch July 31, 2023 16:47

ghost locked as resolved and limited conversation to collaborators Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Ascii.Equals when widening #87141

Optimize Ascii.Equals when widening #87141

BrennanConroy commented Jun 5, 2023 •

edited

Loading

davidfowl Jun 5, 2023

stephentoub Jun 5, 2023

tannergooding Jun 5, 2023

stephentoub Jun 5, 2023

BrennanConroy Jun 5, 2023

stephentoub Jun 8, 2023

tannergooding Jun 5, 2023

tannergooding Jun 5, 2023

BrennanConroy Jun 5, 2023

EgorBo Jun 5, 2023

EgorBo Jun 5, 2023

EgorBo Jun 5, 2023

xtqqczze Jun 8, 2023

EgorBo Jun 8, 2023

xtqqczze Jun 8, 2023

EgorBo Jun 8, 2023

ghost commented Jun 5, 2023

xtqqczze commented Jun 8, 2023

adamsitnik commented Jun 16, 2023

gfoidl Jun 16, 2023

xtqqczze Jun 16, 2023

xtqqczze Jun 16, 2023 •

edited

Loading

xtqqczze Jun 17, 2023 •

edited

Loading

stephentoub commented Jul 5, 2023

adamsitnik left a comment

adamsitnik left a comment

cincuranet commented Jul 11, 2023 •

edited

Loading

	(Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened);
	var (lower, upper) = Vector256.Widen(leftNotWidened);

		if (!Vector256<ushort>.AllBitsSet.Equals(
		Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

	if (!Vector256<ushort>.AllBitsSet.Equals(
	Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))
	if (Vector256<ushort>.AllBitsSet != (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1)))

		(Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr));
		return Vector256.Create(lower, upper);

	(Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr));
	return Vector256.Create(lower, upper);
	return Vector256.WidenLower(Vector128.LoadUnsafe(ref ptr).ToVector256Unsafe());

Optimize Ascii.Equals when widening #87141

Optimize Ascii.Equals when widening #87141

Conversation

BrennanConroy commented Jun 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jun 5, 2023

xtqqczze commented Jun 8, 2023

adamsitnik commented Jun 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xtqqczze Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

xtqqczze Jun 17, 2023 • edited Loading

Choose a reason for hiding this comment

stephentoub commented Jul 5, 2023

adamsitnik left a comment

Choose a reason for hiding this comment

adamsitnik left a comment

Choose a reason for hiding this comment

cincuranet commented Jul 11, 2023 • edited Loading

BrennanConroy commented Jun 5, 2023 •

edited

Loading

xtqqczze Jun 16, 2023 •

edited

Loading

xtqqczze Jun 17, 2023 •

edited

Loading

cincuranet commented Jul 11, 2023 •

edited

Loading