Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Masked load and stores are unimplementable for most platforms and should not be in the standard #69

Open
DenisYaroshevskiy opened this issue May 31, 2023 · 33 comments

Comments

@DenisYaroshevskiy
Copy link

DenisYaroshevskiy commented May 31, 2023

in eve we don't have full masked loads and stores, we only have

load[ignore_first(3) && ignore_last(1)](ptr)
store[ignore_first(3) && ignore_last(1)](reg, ptr)

Here is why

  • Masked load.

Masked load is a very weird operation.
You never need to mask intermidiate elements - because allocation happens in pages, it only makes sense to mask the ements on the side

How does the eve one work?

Well those just ignore beginning and end, so you can store on the stack and then memcpy to the destimation, which is a couple of overlapping loads and stores.

Proposal

Do not include any notion of the masked loads and stores

@mattkretz
Copy link
Owner

I tend to agree. The interesting use cases are:

  1. pro-/epilogue loads & stores
  2. (de)compressing loads & stores
  3. masked gather & scatter

The remaining question: Should masked loads and stores exist for completeness of the interface or should it explicitly not exist in order to give a hint to users that they actually want to use something else?

@mattkretz
Copy link
Owner

About implementability, of course it can always be implemented, just not with a single instruction / efficient instruction sequence.

@DenisYaroshevskiy
Copy link
Author

DenisYaroshevskiy commented May 31, 2023

: Should masked loads and stores exist for completeness of the interface or should it explicitly not exist in order to give a hint to users that they actually want to use something else?

Horray. I think that in order to be in the standard, all platforms should provide a non-disasterous solution and most platfomrs should be usable.

Otherwise you will get "never use this standard thingy" advice.

Other

About implementability, of course it can always be implemented, just not with a single instruction / efficient instruction sequence.

I have spent a lot of time on this and I don't know how you can do it in a way that scalar loop would not be substantially faster. You want at least somewhat parity.

pro-/epilogue loads & stores

You don't need masked loads/stores for that. You want what eve has which is the ability to ignore some elements in the front some elements in the back.

This code: https://github.com/jfalcou/eve/blob/889469eed20b18d10158230b679d936c3fbda163/include/eve/module/algo/algo/find.hpp#L44-L52

Is called for tails in find for example.

See ignore? It can be: ignore_first(n) ignore_last(n) ignore_extremma(n, m) (combination of ignore first and last.
There are also keep_first, keep_last, keep_between

And this you can implement for sse2 and such efficiently, because it is a simpler operation: https://godbolt.org/z/6dfM9Mdfe

  • (de)compressing loads & stores

I don't believe masked load/stores have anything to do with it.

I have not heard of a compressing load, I would love to see some code that does it, sounds amazing.
Compress store does not need it.

The main operation there that I know can be done efficiently is compress_store_unsafe
Compress_store_unsafe is still allowed to write the whole register but will copress in the middle.

Here is remove if for arm: https://godbolt.org/z/xbbMxdvve
Here is the code: https://github.com/jfalcou/eve/blob/889469eed20b18d10158230b679d936c3fbda163/include/eve/module/algo/algo/remove.hpp#L35

  • masked gather & scatter

Do you actually support gather/scatterr here?
I didn't use them much.
Hmm.

I think the interface for efficient masked gather scatter needs "load this index when garbage". If this is smth you really want to tackle, I can try to write one.

@mattkretz
Copy link
Owner

I don't know how you can do it in a way that scalar loop would not be substantially faster

You can test my implementation: https://compiler-explorer.com/z/bTcfn7MqY

You don't need masked loads/stores for that. You want what eve has which is the ability to ignore some elements in the front some elements in the back.

I fully agree! I didn't mean to say that this use case should motivate the existence of generic masked loads and stores.

And this you can implement for sse2 and such efficiently, because it is a simpler operation: https://godbolt.org/z/6dfM9Mdfe

Hmm, the run-time sized memcpy call is still rather expensive. There should be a better way still.

@DenisYaroshevskiy
Copy link
Author

DenisYaroshevskiy commented May 31, 2023

You can test my implementation: https://compiler-explorer.com/z/bTcfn7MqY

I think this is, what you call a compress load, isn't it? Or a masked_gather if you will?

I looked at where(k, v).copy_to(ptr, stdx::vector_aligned), seems to use https://www.laruence.com/x86/MASKMOVDQU.html
_mm_maskmoveu_si128

This instruction will kill your performance, never use it:
https://stackoverflow.com/questions/62183557/how-to-most-efficiently-store-a-part-of-m128i-m256i-while-ignoring-some-num/62492369#62492369

Hmm, the run-time sized memcpy call is still rather expensive. There should be a better way still.

omg if you come up with smth I'd be very happy.

memcpy for 16/32 bytes is 2 overlapped stores. In that same stack overflow we talk about it.

@danieltowner
Copy link
Collaborator

I have spent a lot of time on this and I don't know how you can do it in a way that scalar loop would not be substantially faster. You want at least somewhat parity.

What sizes are you handling. Small simd might be faster as a loop, but read-modify-write sequences on larger simd seems to win in my experience (e.g,. char elements in 256-bit AVX2).

pro-/epilogue loads & stores

You don't need masked loads/stores for that. You want what eve has which is the ability to ignore some elements in the front some elements in the back.

One incomplete part of the proposal that has been discussed at the last meeting, but not decided on, is to have resize/insert/extract functions. For example, in a loop tail you can do:

resize<First_N>(mysimd).copy_to(ptr);

If N is dynamic then this won't work, and you will have to switch to using a generated mask instead.

See ignore? It can be: ignore_first(n) ignore_last(n) ignore_extremma(n, m) (combination of ignore first and last. There are also keep_first, keep_last, keep_between

Do they allow run-time n?

In Intel's example implementation we have `simd_mask::first_n' which allows a runtime value to be converted to a mask (and you can obviously shift and invert it too to get related selection operations). That mask can then be passed in to a load/store, reduction, blend, or whatever. I haven't written that up as a proposal to the committee yet.

I have not heard of a compressing load, I would love to see some code that does it, sounds amazing. Compress store does not need it.

Both show up a lot in our code bases: https://isocpp.org/files/papers/P2664R3.html#permute_by_mask

Do you actually support gather/scatterr here? I didn't use them much. Hmm.

Those show up in our code bases too: https://isocpp.org/files/papers/P2664R3.html#memory_permutes

@mattkretz
Copy link
Owner

You can test my implementation: https://compiler-explorer.com/z/bTcfn7MqY

I think this is, what you call a compress load, isn't it? Or a masked_gather if you will?

No, see how the load and store address calculation uses the same offset? It's not compressed. It simply uses tzcnt or bsf to avoid additional branching on the mask values. The only branch necessary is for the loop.

@danieltowner
Copy link
Collaborator

danieltowner commented May 31, 2023

  • Masked load.

Masked load is a very weird operation. You never need to mask intermidiate elements

Maybe you don't, but that doesn't mean never :-) I've seen it in our code bases, and it was added to Intel's implementation at the request of users who wanted this feature.

You could do a conventional load, followed by a conditional operator (which doesn't exist in simd yet), which will generate identical code. The mask parameter allows that all to be rolled together for convenience. Also having a mask parameter allows a constructor to accept a mask too, which again is convenient. Masked loads are therefore more a simpler syntax than something that makes a difference to the generated code.

  • Masked stores are awful

Their style of use, or the code they generate? They are used when you want to overwrite selected elements, both hard-coded or run-time selected positions, and that is something that when you need it, you really need it. Packet processing or telecoms requires them, for example.

Code generation quality is going to depend upon the ISA - AVX512 is very good, AVX2 is good for some sizes, not so good further back!

Masked stores are possibly more important to support in std::simd than masked loads. Masked loads can be efficiently implemented as above, but without masked stores users of std::simd would need to try to build their own using std::simd's existing API. This might be a read-modify-write sequence, for example, or some sort of loop with element extraction. This would be unfortunate when targeting processors which actually have hardware support. On those targets users would have to fall back to using intrinsics, which then removes the portability of their code. By putting the masked store into the std::simd API the simd library implementation can choose to use hardware facilities when available, or fall back to alternative implementations when they aren't, where the library's implementation should be at least as good as the user could manage for themselves.

@mattkretz
Copy link
Owner

I looked at where(k, v).copy_to(ptr, stdx::vector_aligned), seems to use _mm_maskmoveu_si128

Right. This instruction should only be used if the user asks for a non-temporal store. But since WG21 took that store flag out of my proposal... 🤷 what can I do?

The bad performance is probably due to this part of the instruction documentation:

The MASKMOVDQU instruction generates a non-temporal hint to the processor to minimize cache pollution. The non-temporal hint is implemented by using a write combining (WC) memory type protocol (see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1).

And looking at the manual (emphasis mine):

[...] the processor will do the following:
• If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.
• The non-temporal data is written to memory with WC semantics.

If such a store happens for temporal data, then yes, that's going to be expensive.

So you're right. The generic masked store in libstdc++ should not use a non-temporal store. Will file a PR (and fix).

@mattkretz
Copy link
Owner

Will file a PR

Done.

@mattkretz
Copy link
Owner

@DenisYaroshevskiy Okay if we "retarget" this issue to explore the addition of pro-/epilogue load and store API instead of changing anything about the existing masked load/store?

@DenisYaroshevskiy
Copy link
Author

Maybe you don't, but that doesn't mean never :-) I've seen it in our code bases, and it was added to Intel's implementation at the request of users who wanted this feature.

Sorry if it wasn't clear - masked load doesn't do anything. masking an element in between two ignored does not do anything.

Okay if we "retarget" this issue to explore the addition of pro-/epilogue load and store API instead of changing anything about the existing masked load/store?

Let's not? i don't want to expand the scope of your proposal more then it is. My objection is that I believe this can be way to slow to be usable, that's it.

Can you give me the sse4.2 and neon -aarch64 implementation of a masked store for chars that is not horribly slow? I will test that and if it's really ok: a) I'm very happy, b) I'll leave this alone?

@DenisYaroshevskiy
Copy link
Author

Code generation quality is going to depend upon the ISA - AVX512 is very good, AVX2 is good for some sizes, not so good further back!

We are on the same page. + neon. It's like very very bad.

Do they allow run-time n?

Yes, precisely. It's like ::ignore_first(ptr - aligned_ptr) .
I don't think it should be part of this paper, I'm just bringing it up to show that a more restrictive operation can get you there with better codegen.

My point is: let's not have masked stores unless you know how to do them very well in most cases.

This would be unfortunate when targeting processors which actually have hardware support

I don't think we agree here: my take - if you can't implement something remotely well for a popular platform, you should not do it.

@mattkretz
Copy link
Owner

masked load doesn't do anything. masking an element in between two ignored does not do anything.

Extreme case: You could have an "oversized" simd and mask off an entire cache line in between. Granted, that's a very unlikely use case, but it is currently possible.

Less extreme: As Daniel said, masked loads are a short-hand for replacing

simd<T> x(it);
x = k ? x : 0;

with

simd<T> x(it, k);

Otherwise, I agree with you.

I don't want to expand the scope of your proposal more then it is.

No, the P1928 scope wouldn't change. It's either a new paper or the issue sits here forever. 😉

my take - if you can't implement something remotely well for a popular platform, you should not do it.

I don't agree. My design principle has always been to provide an API that solves use cases and that is complete. I don't add or omit operations because of hardware support. (I started with Vc when SSE2 was all there was for x86...)

The simd<char> masked stores without hardware support are really bad if you can't do a read-modify-write, I agree. So I would be open to discuss whether the wording allows a reasonable read-modify-write implementation. It currently says:

Precondition: For all selected indices i, [first, first + i) is a valid range. [...]
Effects: Copies the selected elements as if first[i] = static_cast<iter_value_t<Out>>(operator[](i)) for all selected indices i.

The precondition allows a read of all values in between first and first + i. The question is, whether the effects clause disallows stores to those memory locations. I believe they are not observable unless another thread stores to the masked-off elements. If that store happens to interleave between the read and write of the read-modify-write sequence then the values from the other thread would be gone. And it's not a data race according to the C++ memory model.

@danieltowner
Copy link
Collaborator

Can you give me the sse4.2 and neon -aarch64 implementation of a masked store for chars that is not horribly slow? I will test that and if it's really ok: a) I'm very happy, b) I'll leave this alone?

Does a read-modify-write work for you? The issue with that is whether it is allowed to read more values than expected: #27.

@DenisYaroshevskiy
Copy link
Author

Does a read-modify-write work for you?

No, unfortunately it doesn't, that breaks a lot of things.
The most obvious one is par_unseq: you might override a result from a different thread

The issue with that is whether it is allowed to read more values than expected:

I don't believe you can read more values than expected, because you can read past the end of the page, which will cause a signal. I believe though that reading the values in the middle of the register is OK.

NOTE: OK in a sense that it won't crash not in a sense that it you can push it through committee.
NOTE2: the oversized extreme case is there, taken.

Less extreme: As Daniel said, masked loads are a short-hand for replacing ...

simd<T> x(it);
x = k ? x : 0;

is much faster on most platforms, right? I mean - it can get optimized to the mask load where one is avaliable and good but in general - it is a different and much more relaxed operation.

I don't agree. My design principle has always been to provide an API that solves use cases and that is complete.

Ooh. But what if the resulting code is bad beyond unusable in many portable cases? And people will have to be told "oh don't use this".

Anyways if we end up with clear choice:

  • We have masked loads/stores but on SSE2-AVX2 and arm up to version 9 they will be slow beyond usable
  • We don't have masked loads/stores and people who need them for very modern hardware will revert to intrinsics

People can vote on it. There is a committee of people with strong opinions I heard.

@danieltowner
Copy link
Collaborator

This would be unfortunate when targeting processors which actually have hardware support

I don't think we agree here: my take - if you can't implement something remotely well for a popular platform, you should not do it.

In our experience, and that of users of our implementation, masked stores are required to solve certain problems. Because it is a common issue, simd should provide a way of accessing that feature which works consistently with the rest of the API.

If simd itself doesn't provide a masked store then when the users need to do that they are going to be forced to roll their own, either by:

  • Write the masked store themselves using std::simd. For example, use read-modify-write, or a loop which extracts masked elements one-by-one and writes them to memory individually.
  • or switch to intrinsics for their preferred target. This removes any portability, it obfuscates the intent of the code, it can often be tricky to implement as std::simd allows any size of simd and intrinsics only work with selected native sizes, and it requires the user to know the target platform in detail (e.g., know that they should avoid certain instructions even if they appear to solve their problem).

By putting masked stores into std::simd it allows the implementor of std::simd to provide the best possible experience for the user. The implementor will know all about the quirks of their platform, and when and how to use different code sequences or instructions to achieve the most efficient way of solving a particular masked store scenario. The user has no need to have that knowledge or experience, and can reasonably expect the std::simd implementation to be as good as anything they might do themselves.

In summary, masked stores solve a problem that people will encounter, and putting it into std::simd itself gives the best possible chance that when it is needed, the library will do a good job of delivering the desired effect.

@danieltowner
Copy link
Collaborator

Does a read-modify-write work for you?

No, unfortunately it doesn't, that breaks a lot of things. The most obvious one is par_unseq: you might override a result from a different thread

I'm not an expert on threading models and C++ guarantees across threads, but expecting two threads to interact in a sane way at such fine granularity seems to be asking too much anyway.

What else breaks?

@DenisYaroshevskiy
Copy link
Author

In summary, masked stores solve a problem that people will encounter, and putting it into std::simd itself gives the best possible chance that when it is needed, the library will do a good job of delivering the desired effect.

I believe we understand each other.

Well - except for the part about interaction with std::simd and intrinsics - I believed that to be rather seamless.

My objections are:

  • people will see smth in the standard and assume it's good when it's not. I personally had a lot of problems with intrinsics being very slow
  • it's not about obscure platforms in my experience, it's about very popular platforms.
  • for the common case of tail handling the interface is not a good match.

I suggest:

  • if there is a produced reasonable solution for sse4.2 and neon - I'll back off.
  • if there isn't - let's do a vote in standard committee somewhere.

@DenisYaroshevskiy
Copy link
Author

What else breaks?

everything that breaks is around threading. In the olden days before threads compilers generated read modify write left right and center.

@DenisYaroshevskiy
Copy link
Author

What else breaks?

Remembered, if you are att the end of the page, you can't do it. There is just nothing there to load/store

@mattkretz
Copy link
Owner

if you are att the end of the page, you can't do it. There is just nothing there to load/store

Recall:

Precondition: For all selected indices i, [first, first + i) is a valid range. [...]

Meaning you need to extract the last index out of the mask (end = first + reduce_max_index(mask)) and ensure the loads and stores of the read-modify-write access only [first, end). If, however, the load-store-flag indicating vector alignment is passed, you don't even have to do that and can simply access complete vectors. (I'd be happy to point that out explicitly in the wording.)

I don't believe it is too far fetched for users to understand that concurrent writes to the same pointer but with disjoint masks is

  1. a very bad idea in any case
  2. a (potentially) racy operation, if we add an opt-in (Manage fault suppression for masked reads and writes #27)

@DenisYaroshevskiy
Copy link
Author

I don't see at all how read modify write can be acceptable implementation for masked store, sorry.

@danieltowner
Copy link
Collaborator

Can you elaborate, please? What behaviour makes it unacceptable? Can wording fix or clarify the issue?

@DenisYaroshevskiy
Copy link
Author

Can you elaborate, please? What behaviour makes it unacceptable? Can wording fix or clarify the issue?

I feel like we are going around in circles a little bit.
(sorry for inventing the syntax - didn't learn the copy one yet)

store(test ? x : load(ptr), ptr)

In my mind should not be the same as

store(x, test, ptr)

The latter should never touch the memory the mask said not to touch.
Just because the spec says it can doesn't make it right.

It is not ok to touch that memory because:
a) that write of x is visible, so any other threads can see it
b) x might not entirely fit on the page

@danieltowner
Copy link
Collaborator

The latter should never touch the memory the mask said not to touch.

This is discussed in #27. A flag could be provided to say whether the user expects the memory to be untouched, or whether they allow it to be touched if that allows an efficient implementation. I think by default it should be that read-modify-write is permitted (after all, if the mask is dynamic, then potentially any value with [ptr, ptr + N) might be touched anyway). A flag would then overrule that when it isn't acceptable for cases like the ones that you highlight.

It is not ok to touch that memory because: a) that write of x is visible, so any other threads can see it b) x might not entirely fit on the page

With #27 you could say:

v.copy_to(ptr, v > 0, std::only_touch_masked_elements);

Or for the page boundary case you could introduce a syntax more like yours for dealing specifically with that situation if that made things clearer.

@DenisYaroshevskiy
Copy link
Author

I see where you are coming from, however I can't agree with ergonomics or defaults.

  • there exist masked stores, they don't touch masked elements.
  • if you want to read_modify_write you can do that in two lines very clearly.
  • i would not object to a separate function that does read modify write that is called smth different, if you have a lot of use for it. I don't have a good name.

@mattkretz
Copy link
Owner

mattkretz commented Jun 1, 2023

there exist masked stores, they don't touch masked elements

That's irrelevant wrt. design of a complete and consistent API.

if you want to read_modify_write you can do that in two lines very clearly.

Sure. But if you want read-modify-write only as fallback for a masked store instruction then what?

i would not object to a separate function that does read modify write that is called smth different, if you have a lot of use for it. I don't have a good name.

I don't care for a function that does read-modify-write. I.e. I don't care for how exactly things are implemented. I care for use cases and semantics. And I care for readable end-user code, hiding all the implementation-specific madness behind the simd API.

Edit: I also care for performance, of course. So I care whether things are implemented efficiently. By whatever means necessary. But the means don't prescribe the API.

@DenisYaroshevskiy
Copy link
Author

DenisYaroshevskiy commented Jun 1, 2023

EDIT: I think we understand each other. That can also be voted on I think, everyone understands other person's position.

That's irrelevant wrt. design of a complete and consistent API.

I meant there is an existing semantic to the "masked store".

Sure. But if you want read-modify-write only as fallback for a masked store instruction then what?

If I understand it correctly store(test ? x : load(ptr), ptr)
can be easily optimised to masked store when that's better. So I'd rely on that.

I don't care for a function that does read-modify-write. I.e. I don't care for how exactly things are implemented.

read modify write is a very different semantics for me from a masked store.

@mattkretz
Copy link
Owner

If I understand it correctly store(test ? x : load(ptr), ptr)
can be easily optimised to masked store when that's better. So I'd rely on that.

Strike "easily", but yes, GCC can recognize the pattern and emits a masked store instruction without the load. Good point. https://compiler-explorer.com/z/rEhEdrhM7. Clang doesn't. I'm not sure whether relying on it is a safe bet performance-wise, but it's an important data point.

I meant there is an existing semantic to the "masked store".

I'm actually not sure there is. There is no precedence in (pseudo-) standard C++ APIs for masked stores, right? So if there is any precedence then it's the behavior of a few recent CPU instructions.

From the feedback I received this was a recurring question, that people didn't know how masked load/store would behave. I.e. there was no expectation - either way seemed like reasonable behavior. But for them to use it for epilogues they needed to know. And when I told them it wouldn't read past the end of what their mask allowed, that was all they cared for, for an answer. Nobody followed up about the in-betweens (IIRC). That's all anecdotal and no evidence. But from my point of view there is no "existing semantic". We can define it.

All that said, I believe the default of a masked store should be "thread-safe", i.e. no read-modify-write allowed. Because that'd be more consistent with the rest of the standard library.

(mask ? x : V(ptr)).copy_to(ptr) seems ok to me. That doesn't need a named function. Especially if I ever get

std::vector d;
d[offset, x.size] = mask ? x : d[offset, x.size];

into the standard...

@DenisYaroshevskiy
Copy link
Author

I'm not sure whether relying on it is a safe bet performance-wise

Let me see if I can find someone to ask.
I have seen llvm do much more complicated things, I'm pretty sure that it will be fine but I am not a compiler backend person.

The past the end not allowed but the middle js allowed

I skipped that a bit - do you have an implementation for some platform that efficiently uses this restriction?

@mattkretz
Copy link
Owner

The past the end not allowed but the middle js allowed

I didn't say that. But maybe I'm misunderstanding your point?

@DenisYaroshevskiy
Copy link
Author

Yes you didn't, i didn't know what to precisely quote sorry 😀. Your what's allowed to be replaced and what's not are very peculiar. I was wondering why. Specifically what is the implementation you have in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants