-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use embedded broadcast to replicate constants for AVX512. #147
Conversation
Signed-off-by: Nicola Torracca <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, thanks for the work!
sha256_mb/sha256_mb_x16_avx512.asm
Outdated
dd 0x90befffa | ||
dd 0xa4506ceb | ||
dd 0xbef9a3f7 | ||
dd 0xc67178f2 | ||
|
||
|
||
PSHUFFLE_BYTE_FLIP_MASK: dq 0x0405060700010203, 0x0c0d0e0f08090a0b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as sha1_mb, these 2 need to be aligned to 64 bytes
K00_19: dd 0x5A827999 | ||
K20_39: dd 0x6ED9EBA1 | ||
K40_59: dd 0x8F1BBCDC | ||
K60_79: dd 0xCA62C1D6 | ||
|
||
PSHUFFLE_BYTE_FLIP_MASK: dq 0x0405060700010203, 0x0c0d0e0f08090a0b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be aligned to 64 bytes (same for MASK2 below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't align PSHUFFLE_BYTE_FLIP_MASK because the 4 32bit constant above it are aligned to 64 bytes, so +4*32 bit makes it aligned to 16 bytes which is the natural size for the 128bit load. You're right about the transpose masks, better to keep them aligned to avoid a split-load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't align PSHUFFLE_BYTE_FLIP_MASK because the 4 32bit constant above it are aligned to 64 bytes, so +4*32 bit makes it aligned to 16 bytes which is the natural size for the 128bit load. You're right about the transpose masks, better to keep them aligned to avoid a split-load.
The problem is not the split-load. I got a seg fault because there are aligned load instructions that need this data to be 64-byte aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's interesting, i didn't get any SEGFAULT when doing `make test'. Anyway now they should be aligned, is it working ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now yes :) Will merge this soon, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, if you want i can also make a quick separate pull requests to use VPTERNLOG in SM3 for the boolean functions. It's such a nice instruction that it's a shame not to use it wherever possible ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you promise it's a short one that you can send in the next few hours :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a try :P
avoid cacheline splits for 64bytes loads. Signed-off-by: Nicola Torracca <[email protected]>
Code is now merged, thanks for the work @Shark64! |
Here is the minimal pull request. The only thing I've sneaked in this is switching `cmp reg, 0' to 'test reg, reg': while it's less important than years ago when only test/jcc was macrofused, it's still a byte shorter and looked ugly ;). Shouldn't create any problem, i think :D .