-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Masked load and stores are unimplementable for most platforms and should not be in the standard #69
Comments
I tend to agree. The interesting use cases are:
The remaining question: Should masked loads and stores exist for completeness of the interface or should it explicitly not exist in order to give a hint to users that they actually want to use something else? |
About implementability, of course it can always be implemented, just not with a single instruction / efficient instruction sequence. |
Horray. I think that in order to be in the standard, all platforms should provide a non-disasterous solution and most platfomrs should be usable. Otherwise you will get "never use this standard thingy" advice. Other
I have spent a lot of time on this and I don't know how you can do it in a way that scalar loop would not be substantially faster. You want at least somewhat parity.
You don't need masked loads/stores for that. You want what eve has which is the ability to ignore some elements in the front some elements in the back. Is called for tails in find for example. See ignore? It can be: And this you can implement for sse2 and such efficiently, because it is a simpler operation: https://godbolt.org/z/6dfM9Mdfe
I don't believe masked load/stores have anything to do with it. I have not heard of a compressing load, I would love to see some code that does it, sounds amazing. The main operation there that I know can be done efficiently is Here is remove if for arm: https://godbolt.org/z/xbbMxdvve
Do you actually support gather/scatterr here? I think the interface for efficient masked gather scatter needs "load this index when garbage". If this is smth you really want to tackle, I can try to write one. |
You can test my implementation: https://compiler-explorer.com/z/bTcfn7MqY
I fully agree! I didn't mean to say that this use case should motivate the existence of generic masked loads and stores.
Hmm, the run-time sized |
I think this is, what you call a compress load, isn't it? Or a masked_gather if you will? I looked at where(k, v).copy_to(ptr, stdx::vector_aligned), seems to use https://www.laruence.com/x86/MASKMOVDQU.html This instruction will kill your performance, never use it:
omg if you come up with smth I'd be very happy. memcpy for 16/32 bytes is 2 overlapped stores. In that same stack overflow we talk about it. |
What sizes are you handling. Small simd might be faster as a loop, but read-modify-write sequences on larger simd seems to win in my experience (e.g,. char elements in 256-bit AVX2).
One incomplete part of the proposal that has been discussed at the last meeting, but not decided on, is to have resize/insert/extract functions. For example, in a loop tail you can do:
If N is dynamic then this won't work, and you will have to switch to using a generated mask instead.
Do they allow run-time In Intel's example implementation we have `simd_mask::first_n' which allows a runtime value to be converted to a mask (and you can obviously shift and invert it too to get related selection operations). That mask can then be passed in to a load/store, reduction, blend, or whatever. I haven't written that up as a proposal to the committee yet.
Both show up a lot in our code bases: https://isocpp.org/files/papers/P2664R3.html#permute_by_mask
Those show up in our code bases too: https://isocpp.org/files/papers/P2664R3.html#memory_permutes |
No, see how the load and store address calculation uses the same offset? It's not compressed. It simply uses |
Maybe you don't, but that doesn't mean never :-) I've seen it in our code bases, and it was added to Intel's implementation at the request of users who wanted this feature. You could do a conventional load, followed by a conditional operator (which doesn't exist in simd yet), which will generate identical code. The mask parameter allows that all to be rolled together for convenience. Also having a mask parameter allows a constructor to accept a mask too, which again is convenient. Masked loads are therefore more a simpler syntax than something that makes a difference to the generated code.
Their style of use, or the code they generate? They are used when you want to overwrite selected elements, both hard-coded or run-time selected positions, and that is something that when you need it, you really need it. Packet processing or telecoms requires them, for example. Code generation quality is going to depend upon the ISA - AVX512 is very good, AVX2 is good for some sizes, not so good further back! Masked stores are possibly more important to support in std::simd than masked loads. Masked loads can be efficiently implemented as above, but without masked stores users of std::simd would need to try to build their own using std::simd's existing API. This might be a read-modify-write sequence, for example, or some sort of loop with element extraction. This would be unfortunate when targeting processors which actually have hardware support. On those targets users would have to fall back to using intrinsics, which then removes the portability of their code. By putting the masked store into the std::simd API the simd library implementation can choose to use hardware facilities when available, or fall back to alternative implementations when they aren't, where the library's implementation should be at least as good as the user could manage for themselves. |
Right. This instruction should only be used if the user asks for a non-temporal store. But since WG21 took that store flag out of my proposal... 🤷 what can I do? The bad performance is probably due to this part of the instruction documentation:
And looking at the manual (emphasis mine):
If such a store happens for temporal data, then yes, that's going to be expensive. So you're right. The generic masked store in libstdc++ should not use a non-temporal store. Will file a PR (and fix). |
|
@DenisYaroshevskiy Okay if we "retarget" this issue to explore the addition of pro-/epilogue load and store API instead of changing anything about the existing masked load/store? |
Sorry if it wasn't clear - masked load doesn't do anything. masking an element in between two ignored does not do anything.
Let's not? i don't want to expand the scope of your proposal more then it is. My objection is that I believe this can be way to slow to be usable, that's it. Can you give me the sse4.2 and neon -aarch64 implementation of a masked store for chars that is not horribly slow? I will test that and if it's really ok: a) I'm very happy, b) I'll leave this alone? |
We are on the same page. + neon. It's like very very bad.
Yes, precisely. It's like My point is: let's not have masked stores unless you know how to do them very well in most cases.
I don't think we agree here: my take - if you can't implement something remotely well for a popular platform, you should not do it. |
Extreme case: You could have an "oversized" Less extreme: As Daniel said, masked loads are a short-hand for replacing simd<T> x(it);
x = k ? x : 0; with simd<T> x(it, k); Otherwise, I agree with you.
No, the P1928 scope wouldn't change. It's either a new paper or the issue sits here forever. 😉
I don't agree. My design principle has always been to provide an API that solves use cases and that is complete. I don't add or omit operations because of hardware support. (I started with Vc when SSE2 was all there was for x86...) The Precondition: For all selected indices i, The precondition allows a read of all values in between |
Does a read-modify-write work for you? The issue with that is whether it is allowed to read more values than expected: #27. |
No, unfortunately it doesn't, that breaks a lot of things.
I don't believe you can read more values than expected, because you can read past the end of the page, which will cause a signal. I believe though that reading the values in the middle of the register is OK. NOTE: OK in a sense that it won't crash not in a sense that it you can push it through committee.
is much faster on most platforms, right? I mean - it can get optimized to the mask load where one is avaliable and good but in general - it is a different and much more relaxed operation.
Ooh. But what if the resulting code is bad beyond unusable in many portable cases? And people will have to be told "oh don't use this". Anyways if we end up with clear choice:
People can vote on it. There is a committee of people with strong opinions I heard. |
In our experience, and that of users of our implementation, masked stores are required to solve certain problems. Because it is a common issue, simd should provide a way of accessing that feature which works consistently with the rest of the API. If simd itself doesn't provide a masked store then when the users need to do that they are going to be forced to roll their own, either by:
By putting masked stores into std::simd it allows the implementor of std::simd to provide the best possible experience for the user. The implementor will know all about the quirks of their platform, and when and how to use different code sequences or instructions to achieve the most efficient way of solving a particular masked store scenario. The user has no need to have that knowledge or experience, and can reasonably expect the std::simd implementation to be as good as anything they might do themselves. In summary, masked stores solve a problem that people will encounter, and putting it into std::simd itself gives the best possible chance that when it is needed, the library will do a good job of delivering the desired effect. |
I'm not an expert on threading models and C++ guarantees across threads, but expecting two threads to interact in a sane way at such fine granularity seems to be asking too much anyway. What else breaks? |
I believe we understand each other. Well - except for the part about interaction with std::simd and intrinsics - I believed that to be rather seamless. My objections are:
I suggest:
|
everything that breaks is around threading. In the olden days before threads compilers generated read modify write left right and center. |
Remembered, if you are att the end of the page, you can't do it. There is just nothing there to load/store |
Recall:
Meaning you need to extract the last index out of the mask ( I don't believe it is too far fetched for users to understand that concurrent writes to the same pointer but with disjoint masks is
|
I don't see at all how read modify write can be acceptable implementation for masked store, sorry. |
Can you elaborate, please? What behaviour makes it unacceptable? Can wording fix or clarify the issue? |
I feel like we are going around in circles a little bit.
In my mind should not be the same as
The latter should never touch the memory the mask said not to touch. It is not ok to touch that memory because: |
This is discussed in #27. A flag could be provided to say whether the user expects the memory to be untouched, or whether they allow it to be touched if that allows an efficient implementation. I think by default it should be that read-modify-write is permitted (after all, if the mask is dynamic, then potentially any value with [ptr, ptr + N) might be touched anyway). A flag would then overrule that when it isn't acceptable for cases like the ones that you highlight.
With #27 you could say:
Or for the page boundary case you could introduce a syntax more like yours for dealing specifically with that situation if that made things clearer. |
I see where you are coming from, however I can't agree with ergonomics or defaults.
|
That's irrelevant wrt. design of a complete and consistent API.
Sure. But if you want read-modify-write only as fallback for a masked store instruction then what?
I don't care for a function that does read-modify-write. I.e. I don't care for how exactly things are implemented. I care for use cases and semantics. And I care for readable end-user code, hiding all the implementation-specific madness behind the Edit: I also care for performance, of course. So I care whether things are implemented efficiently. By whatever means necessary. But the means don't prescribe the API. |
EDIT: I think we understand each other. That can also be voted on I think, everyone understands other person's position.
I meant there is an existing semantic to the "masked store".
If I understand it correctly
read modify write is a very different semantics for me from a masked store. |
Strike "easily", but yes, GCC can recognize the pattern and emits a masked store instruction without the load. Good point. https://compiler-explorer.com/z/rEhEdrhM7. Clang doesn't. I'm not sure whether relying on it is a safe bet performance-wise, but it's an important data point.
I'm actually not sure there is. There is no precedence in (pseudo-) standard C++ APIs for masked stores, right? So if there is any precedence then it's the behavior of a few recent CPU instructions. From the feedback I received this was a recurring question, that people didn't know how masked load/store would behave. I.e. there was no expectation - either way seemed like reasonable behavior. But for them to use it for epilogues they needed to know. And when I told them it wouldn't read past the end of what their mask allowed, that was all they cared for, for an answer. Nobody followed up about the in-betweens (IIRC). That's all anecdotal and no evidence. But from my point of view there is no "existing semantic". We can define it. All that said, I believe the default of a masked store should be "thread-safe", i.e. no read-modify-write allowed. Because that'd be more consistent with the rest of the standard library.
std::vector d;
d[offset, x.size] = mask ? x : d[offset, x.size]; into the standard... |
Let me see if I can find someone to ask.
I skipped that a bit - do you have an implementation for some platform that efficiently uses this restriction? |
I didn't say that. But maybe I'm misunderstanding your point? |
Yes you didn't, i didn't know what to precisely quote sorry 😀. Your what's allowed to be replaced and what's not are very peculiar. I was wondering why. Specifically what is the implementation you have in mind |
in eve we don't have full masked loads and stores, we only have
Here is why
Masked load is a very weird operation.
You never need to mask intermidiate elements - because allocation happens in pages, it only makes sense to mask the ements on the side
Masked stores are awful
For chars and shorts there is no implementation for x86 sse2-avx2 and for arm neon.
No one can do it to the best of my knowledge
(_mm_maskmoveu_si128 is a disaster)
https://stackoverflow.com/questions/62183557/how-to-most-efficiently-store-a-part-of-m128i-m256i-while-ignoring-some-num/62492369#62492369
In loops you know how many elements you should ignore from the beginning and end, why do this
How does the eve one work?
Well those just ignore beginning and end, so you can store on the stack and then memcpy to the destimation, which is a couple of overlapping loads and stores.
Proposal
Do not include any notion of the masked loads and stores
The text was updated successfully, but these errors were encountered: