Dalek NEON v7 #691

rubdos · 2024-08-22T08:57:56Z

Continuation of #457 for v7, @Tarinn will continue this while I patch some things around in Rust and LLVM...

Co-authored-by: pinkforest <[email protected]> Co-authored-by: Robrecht Blancquaert <[email protected]>

…5x4::shuffle Co-authored-by: Robrecht Blacquaert <[email protected]>

rozbb

Hey hey, thanks for submitting this! I got around to looking at this and left some notes for you and myself.

First, this looks overall pretty great. There are just a minor compile-time warnings that I get that need to be resolved. I won't paste here bc you can see it yourself.

On that note, it'd be nice to get a CI test case for this backend. I can look into that.

Finally, this seems like it's really slow on my Macbook Air M1. Am I doing something wrong? I ran with backend=serial, then backend=simd, and saw a 45-50% slowdown across the board. Here's a partial paste of benchmarks.

edwards benches/EdwardsPoint compression
                        time:   [3.3320 µs 3.3329 µs 3.3339 µs]
                        change: [+1.1223% +1.3350% +1.5407%] (p = 0.00 < 0.05)
                        Performance has regressed.
edwards benches/EdwardsPoint decompression
                        time:   [3.5899 µs 3.5936 µs 3.5978 µs]
                        change: [+0.7537% +1.2902% +1.7433%] (p = 0.00 < 0.05)
                        Change within noise threshold.
edwards benches/Constant-time fixed-base scalar mul
                        time:   [9.2123 µs 9.2148 µs 9.2174 µs]
                        change: [-6.0098% -5.4511% -4.8764%] (p = 0.00 < 0.05)
                        Performance has improved.
edwards benches/Constant-time variable-base scalar mul
                        time:   [50.151 µs 50.160 µs 50.171 µs]
                        change: [+34.156% +42.306% +49.662%] (p = 0.00 < 0.05)
                        Performance has regressed.
edwards benches/Variable-time aA+bB, A variable, B fixed
                        time:   [48.128 µs 48.142 µs 48.157 µs]
                        change: [+62.406% +62.559% +62.715%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/1
                        time:   [50.362 µs 50.376 µs 50.392 µs]
                        change: [+48.626% +50.038% +50.915%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/2
                        time:   [66.040 µs 66.063 µs 66.091 µs]
                        change: [+49.459% +49.626% +49.793%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/4
                        time:   [96.888 µs 96.919 µs 96.954 µs]
                        change: [+47.723% +47.968% +48.186%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/8
                        time:   [159.00 µs 159.03 µs 159.07 µs]
                        change: [+46.456% +46.623% +46.774%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/16
                        time:   [283.98 µs 284.12 µs 284.30 µs]
                        change: [+46.400% +46.584% +46.797%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/32
                        time:   [533.07 µs 533.21 µs 533.36 µs]
                        change: [+45.954% +46.096% +46.227%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/64
                        time:   [1.0317 ms 1.0329 ms 1.0349 ms]
                        change: [+46.202% +47.030% +48.196%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/128
                        time:   [2.0357 ms 2.0364 ms 2.0372 ms]
                        change: [+44.702% +44.773% +44.844%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/256
                        time:   [4.0719 ms 4.1737 ms 4.3095 ms]
                        change: [+46.218% +49.911% +54.422%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/384
                        time:   [6.0426 ms 6.0450 ms 6.0474 ms]
                        change: [+45.247% +45.313% +45.378%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/512
                        time:   [8.0369 ms 8.0400 ms 8.0433 ms]
                        change: [+44.325% +45.007% +45.447%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/768
                        time:   [12.068 ms 12.079 ms 12.093 ms]
                        change: [+45.656% +45.797% +45.974%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024
                        time:   [16.070 ms 16.080 ms 16.090 ms]
                        change: [+45.287% +45.574% +45.767%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/1
                        time:   [43.677 µs 43.699 µs 43.724 µs]
                        change: [+61.865% +62.013% +62.168%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/2
                        time:   [53.354 µs 53.396 µs 53.451 µs]
                        change: [+60.683% +60.870% +61.053%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/4
                        time:   [72.161 µs 72.232 µs 72.358 µs]
                        change: [+58.561% +58.725% +58.914%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/8
                        time:   [110.14 µs 110.18 µs 110.21 µs]
                        change: [+56.365% +56.537% +56.744%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/16
                        time:   [186.40 µs 186.46 µs 186.52 µs]
                        change: [+54.205% +54.357% +54.516%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/32
                        time:   [339.54 µs 339.68 µs 339.84 µs]
                        change: [+52.412% +53.057% +53.697%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/64
                        time:   [653.32 µs 654.65 µs 655.88 µs]
                        change: [+51.725% +52.242% +52.851%] (p = 0.00 < 0.05)
                        Performance has regressed.

rozbb · 2025-01-04T04:48:19Z

curve25519-dalek/src/backend/vector/neon/packed_simd.rs

+            fn eq(&self, rhs: &$ty) -> bool {
+                unsafe {
+                    let m = neon::$beq_intrinsic(self.0, rhs.0);
+                    Self(m).extract::<0>() != 0


Why can you extract just the first element to check equality? The docs say that the vectors are equal iff _every_bit in the output vector is set to 1.

rozbb · 2025-01-04T04:57:19Z

curve25519-dalek/src/backend/vector/neon/packed_simd.rs

+impl u32x4 {
+    #[inline]
+    pub fn new(x0: u32, x1: u32, x2: u32, x3: u32) -> Self {
+        unsafe { core::mem::transmute::<[u32; 4], Self>([x0, x1, x2, x3]) }


Should include a safety note like here

curve25519-dalek/curve25519-dalek/src/backend/vector/packed_simd.rs

Lines 241 to 242 in 4570d80

// SAFETY: Transmuting between an array and a SIMD type is safe

// https://rust-lang.github.io/unsafe-code-guidelines/layout/packed-simd-vectors.html

Also, out of curiosity, why the transmute and not a vld* instruction? Eg the x86 code says this set instruction is faster

curve25519-dalek/curve25519-dalek/src/backend/vector/packed_simd.rs

Line 262 in 4570d80

u64x4(core::arch::x86_64::_mm256_set_epi64x(

If you're gonna transmute for everything, you may as well just call const_new from here instead of duplicating the code. Actually, even better, since they're all the same anyway, just call them new and splat and remove the const_* names entirely

rozbb · 2025-01-04T05:27:22Z

curve25519-dalek/src/backend/vector/neon/edwards.rs

+            assert_eq!(base_splits[3], b_splits[3]);
+        }
+    }
+}


Self: File looks good. Diff between this and the AVX2 edwards impl is entirely aesthetic

rozbb · 2025-01-04T05:54:49Z

curve25519-dalek/src/backend/vector/neon/field.rs

+            let (b8, b9) = unpack_pair(self.0[4]);
+
+            FieldElement2625x4::reduce64([
+                u64x2x2::new(vmull_u32(b0.0.0, consts.0.into()).into(), vmull_u32(b0.0.1, consts.1.into()).into()),


It seems like this mul32 pattern comes up. Might make sense to make it its own function like in the AVX2 version?

rozbb · 2025-01-04T05:56:37Z

curve25519-dalek/src/backend/vector/neon/field.rs

+        assert_eq!(x2, splits[2]);
+        assert_eq!(x3, splits[3]);
+    }
+}


Self: Ran this file through Difftastic and it seems mostly the same as the AVX2 version. Need to understand the blending functions better though. Also the rotating

rozbb · 2025-01-04T06:06:38Z

curve25519-dalek/src/backend/vector/neon/constants.rs

+            u32x4::const_new(44524351, 50428429, 21904953, 12608048),
+        ),
+    ])),
+]);


Self: seems fine. spot-checked by perturbing constants in random places, and seeing that tests failed. One place it didn't fail: if you change the shift in P_TIMES_16_LO to 3 instead of 4, everything works fine. That's probably due to some approximation algorithm still succeeding

Robrecht Blancquaert and others added 17 commits February 28, 2024 13:18

Added ARM neon backend support

cdffbbd

Co-authored-by: pinkforest <[email protected]> Co-authored-by: Robrecht Blancquaert <[email protected]>

Use packed_simd::shuffle instead of vqtbx1q_u8

13b52a0

Use shuffle! macro instead of manual lane swapping in FieldElement262…

14e05d4

…5x4::shuffle Co-authored-by: Robrecht Blacquaert <[email protected]>

Rustfmt: retain some manual formatting

58a853d

rustfmt on neon constants

c49e465

implemented packed_simd for neon

38807d0

fixed minor mistake

d2b7b31

Small improvements to neon backend

1c070b3

repack_pair optimisation

f61eb5a

fixed small bug in repack_pair

6cffc06

changed to use internal arm types instead of tuples

b7d2f53

further refactoring to internal arm types

d524791

Merge remote-tracking branch 'origin/main' into dalek-neon

5c15111

Clippy

2aa38c9

Rustfmt

98d09ae

Allow overriding simd for arm32

aeaed19

Partial port to v7

b4018cd

rozbb reviewed Jan 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dalek NEON v7 #691

Dalek NEON v7 #691

rubdos commented Aug 22, 2024 •

edited

Loading

rozbb left a comment

rozbb Jan 4, 2025

rozbb Jan 4, 2025

rozbb Jan 4, 2025

rozbb Jan 4, 2025

rozbb Jan 4, 2025

rozbb Jan 4, 2025

rozbb Jan 4, 2025

rozbb Jan 4, 2025

	// SAFETY: Transmuting between an array and a SIMD type is safe
	// https://rust-lang.github.io/unsafe-code-guidelines/layout/packed-simd-vectors.html

Dalek NEON v7 #691

Are you sure you want to change the base?

Dalek NEON v7 #691

Conversation

rubdos commented Aug 22, 2024 • edited Loading

rozbb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rubdos commented Aug 22, 2024 •

edited

Loading