Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dalek NEON v7 #691

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Dalek NEON v7 #691

wants to merge 17 commits into from

Conversation

rubdos
Copy link
Contributor

@rubdos rubdos commented Aug 22, 2024

Continuation of #457 for v7, @Tarinn will continue this while I patch some things around in Rust and LLVM...

Copy link
Contributor

@rozbb rozbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey hey, thanks for submitting this! I got around to looking at this and left some notes for you and myself.

First, this looks overall pretty great. There are just a minor compile-time warnings that I get that need to be resolved. I won't paste here bc you can see it yourself.

On that note, it'd be nice to get a CI test case for this backend. I can look into that.

Finally, this seems like it's really slow on my Macbook Air M1. Am I doing something wrong? I ran with backend=serial, then backend=simd, and saw a 45-50% slowdown across the board. Here's a partial paste of benchmarks.

edwards benches/EdwardsPoint compression
                        time:   [3.3320 µs 3.3329 µs 3.3339 µs]
                        change: [+1.1223% +1.3350% +1.5407%] (p = 0.00 < 0.05)
                        Performance has regressed.
edwards benches/EdwardsPoint decompression
                        time:   [3.5899 µs 3.5936 µs 3.5978 µs]
                        change: [+0.7537% +1.2902% +1.7433%] (p = 0.00 < 0.05)
                        Change within noise threshold.
edwards benches/Constant-time fixed-base scalar mul
                        time:   [9.2123 µs 9.2148 µs 9.2174 µs]
                        change: [-6.0098% -5.4511% -4.8764%] (p = 0.00 < 0.05)
                        Performance has improved.
edwards benches/Constant-time variable-base scalar mul
                        time:   [50.151 µs 50.160 µs 50.171 µs]
                        change: [+34.156% +42.306% +49.662%] (p = 0.00 < 0.05)
                        Performance has regressed.
edwards benches/Variable-time aA+bB, A variable, B fixed
                        time:   [48.128 µs 48.142 µs 48.157 µs]
                        change: [+62.406% +62.559% +62.715%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/1
                        time:   [50.362 µs 50.376 µs 50.392 µs]
                        change: [+48.626% +50.038% +50.915%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/2
                        time:   [66.040 µs 66.063 µs 66.091 µs]
                        change: [+49.459% +49.626% +49.793%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/4
                        time:   [96.888 µs 96.919 µs 96.954 µs]
                        change: [+47.723% +47.968% +48.186%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/8
                        time:   [159.00 µs 159.03 µs 159.07 µs]
                        change: [+46.456% +46.623% +46.774%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/16
                        time:   [283.98 µs 284.12 µs 284.30 µs]
                        change: [+46.400% +46.584% +46.797%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/32
                        time:   [533.07 µs 533.21 µs 533.36 µs]
                        change: [+45.954% +46.096% +46.227%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/64
                        time:   [1.0317 ms 1.0329 ms 1.0349 ms]
                        change: [+46.202% +47.030% +48.196%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/128
                        time:   [2.0357 ms 2.0364 ms 2.0372 ms]
                        change: [+44.702% +44.773% +44.844%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/256
                        time:   [4.0719 ms 4.1737 ms 4.3095 ms]
                        change: [+46.218% +49.911% +54.422%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/384
                        time:   [6.0426 ms 6.0450 ms 6.0474 ms]
                        change: [+45.247% +45.313% +45.378%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/512
                        time:   [8.0369 ms 8.0400 ms 8.0433 ms]
                        change: [+44.325% +45.007% +45.447%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/768
                        time:   [12.068 ms 12.079 ms 12.093 ms]
                        change: [+45.656% +45.797% +45.974%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024
                        time:   [16.070 ms 16.080 ms 16.090 ms]
                        change: [+45.287% +45.574% +45.767%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/1
                        time:   [43.677 µs 43.699 µs 43.724 µs]
                        change: [+61.865% +62.013% +62.168%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/2
                        time:   [53.354 µs 53.396 µs 53.451 µs]
                        change: [+60.683% +60.870% +61.053%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/4
                        time:   [72.161 µs 72.232 µs 72.358 µs]
                        change: [+58.561% +58.725% +58.914%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/8
                        time:   [110.14 µs 110.18 µs 110.21 µs]
                        change: [+56.365% +56.537% +56.744%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/16
                        time:   [186.40 µs 186.46 µs 186.52 µs]
                        change: [+54.205% +54.357% +54.516%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/32
                        time:   [339.54 µs 339.68 µs 339.84 µs]
                        change: [+52.412% +53.057% +53.697%] (p = 0.00 < 0.05)
                        Performance has regressed.
multiscalar benches/Variable-time variable-base multiscalar multiplication/64
                        time:   [653.32 µs 654.65 µs 655.88 µs]
                        change: [+51.725% +52.242% +52.851%] (p = 0.00 < 0.05)
                        Performance has regressed.

fn eq(&self, rhs: &$ty) -> bool {
unsafe {
let m = neon::$beq_intrinsic(self.0, rhs.0);
Self(m).extract::<0>() != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can you extract just the first element to check equality? The docs say that the vectors are equal iff _every_bit in the output vector is set to 1.

impl u32x4 {
#[inline]
pub fn new(x0: u32, x1: u32, x2: u32, x3: u32) -> Self {
unsafe { core::mem::transmute::<[u32; 4], Self>([x0, x1, x2, x3]) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should include a safety note like here

// SAFETY: Transmuting between an array and a SIMD type is safe
// https://rust-lang.github.io/unsafe-code-guidelines/layout/packed-simd-vectors.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, out of curiosity, why the transmute and not a vld* instruction? Eg the x86 code says this set instruction is faster

u64x4(core::arch::x86_64::_mm256_set_epi64x(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're gonna transmute for everything, you may as well just call const_new from here instead of duplicating the code. Actually, even better, since they're all the same anyway, just call them new and splat and remove the const_* names entirely

assert_eq!(base_splits[3], b_splits[3]);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self: File looks good. Diff between this and the AVX2 edwards impl is entirely aesthetic

let (b8, b9) = unpack_pair(self.0[4]);

FieldElement2625x4::reduce64([
u64x2x2::new(vmull_u32(b0.0.0, consts.0.into()).into(), vmull_u32(b0.0.1, consts.1.into()).into()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this mul32 pattern comes up. Might make sense to make it its own function like in the AVX2 version?

assert_eq!(x2, splits[2]);
assert_eq!(x3, splits[3]);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self: Ran this file through Difftastic and it seems mostly the same as the AVX2 version. Need to understand the blending functions better though. Also the rotating

u32x4::const_new(44524351, 50428429, 21904953, 12608048),
),
])),
]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self: seems fine. spot-checked by perturbing constants in random places, and seeing that tests failed. One place it didn't fail: if you change the shift in P_TIMES_16_LO to 3 instead of 4, everything works fine. That's probably due to some approximation algorithm still succeeding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants