-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] SVE-like flexible vectors #27
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please limit the PR to the instructions you want to add or change - and provide some background on why you want the change.
My main issue is that the PR replaces the entire proposal :) Also, there are some instructions other than just masks that are added - do you want to discuss those?
This is already true. Sorry, it wasn't very clear before, I've updated the readme to make this a bit more explicit. Masks change is still welcome - let's just fit it inside what is already set up if possible. If you fill like it can't be reconciled with the current state of the proposal I will be happy to hear why, maybe we can do something about it. |
Yeah, sorry about that. This PR is not intended to be a true PR: I don't expect it to be merged. The goal is to have a solid base on which we can discuss, and I believe that an issue would not be as convenient as a PR. As soon as we are confident enough this model can work and covers most use-cases, I will retro-fit this concept into your actual proposal.
Basically, I think we should discuss anything related to the flexible vector concept. Lane-wise operations are no issue, so I mostly skipped them, but shuffles and interleaving (for instance) need a proper definition. So basically, the points I wish to be discussed:
Nice ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry about that. This PR is not intended to be a true PR: I don't expect it to be merged. The goal is to have a solid base on which we can discuss, and I believe that an issue would not be as convenient as a PR. As soon as we are confident enough this model can work and covers most use-cases, I will retro-fit this concept into your actual proposal.
There are a few instructions in the PR we should probably merge :)
I am not sure I have all the answers, and some of the answers might be trivial (sorry about that):
So basically, the points I wish to be discussed:
- The semantics of masks (eg: do we need more/less operations?)
I think this is tied with "use cases" question - if we have use cases for more, we can add more, and if we should not add operations we don't have use cases for.
- The semantics of narrowing/widening operations
What would be the concern here? With values encoding lane types we can go between lane sizes relatively easily.
- The semantics of swizzling
That is the most "interesting" aspect of flexible length vector operations - it is very hard to do general purpose swizzle and shuffle if you don't know the length (especially shuffle with compile-time masks). However it is definitely possible to do things like concat, unpack, shift lane, etc.
- Is this model enough for most use cases?
The devil is in details - very basic algorithms would work with even less, but if we want to port code that already works with existing vector instruction sets we need to come up with something compatible. What sort of algorithms you have in mind? Also, there are some examples in #5.
- Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?
Masks are tricky on those architectures, however "set length" approach is tricky as well (that's why it is now optional). Originally I thought of implementing set length by zeroing out left out portion of the vector, but that would be still quite expensive.
However, I don't know how it will play with your optional
set_length
.
That's why I made it very explicit that length is actually a runtime constant (I don't want a trueset_length
).
Sure, I see your point - setting length might mean setting mask length, or something like that, but that is not very practical.
- `vec.m32.sign(a: vec.v32) -> vec.m32` | ||
- `vec.m64.sign(a: vec.v64) -> vec.m64` | ||
|
||
## Inter-lane operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need at least some of those operations anyway. Do you mind if I pull this into a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're confident enough about the usefulness of those, go on.
Out of curiosity, which ones do you want to pull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only one operation currently in is lane shift, so most of them, at least in terms of categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be warned, the vec.vX.lane_shift
is not equivalent to your shift operations. This one takes 2 input vectors. This makes it general enough to implement all shift related operations, including rotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are two flavors of this - one taking two inputs and the other taking only one. An as usual, the two major architecture families have made two different choices.
|
||
### LUT1 zero | ||
|
||
Elements whose index is out of bounds are set to `0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this would be used for, to prepare input for swizzle? In Wasm SIMD swizzle has built-in zeroing of elements referenced by out-of-range indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, this description is not a full description. It is an actual swizzle operation (look at the python pseudo code).
The description only shows the out-of-bounds policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, thought it was an instruction preparing LUT access. And I thought it wasn't a bad idea, given the pains of defining overflow behavior in simd128
:)
What is the difference between LUT and swizzle
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LUT and swizzle
/shuffle
are basically the same, but the concept of LUT can be extended to more inputs. For instance, Neon and SVE2 have LUT1, LUT2, LUT3 and LUT4. The naming is really easy to understand: LUTn is a lookup in a table of n registers.
I should probably add LUT2 to the spec. I don't think larger LUTs are that useful, so LUT1 and LUT2 should be enough.
Also, I don't mind using the terms swizzle
and shuffle
for LUT1 and LUT2, but I don't really like them either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small note about your example - if TBL
/TBX
is what you have in mind for LUTn
, then SVE2 supports only LUT1
and LUT2
(unlike Neon), but, of course, lookup into arbitrary large tables can be synthesized with TBX
(and probably vector DEC*
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LUTn would be the WASM equivalent of TBL/TBX.
Currently, only LUT1 and LUT2 are part of this PR. The reason is: only Neon provides the operation with more registers.
I agree it is tied to use cases.
There are others I'm not sure they are useful, and some that I incorrectly ported from SVE:
Also, for now, lane-wise operations have 4 mask policies:
I don't know if we actually need all 4 of them, especially the There is a last question about masks. In a masked ISA, almost all operations take a mask.
The question here is what lanes are combined together when narrowing 2 vectors together, and what lanes are selected when widening half of vector. Also, do narrowing operations take multiple input to form a full vector, or do we just fill half of the output?
Full-width shuffles with compile-time masks are out of the question. The best we could propose is full-width runtime shuffles, where the index vector comes from an eventual The common swizzling operations are however already defined here:
Full-width runtime shuffles are handled with
The latter makes it simpler to implement larger LUTs (ie: from more inputs), but might not be super useful as it is already possible from the former policy. Also, SVE and Neon provide LUT2, LUT3 and LUT4 (respectively taking from 2, 3, and 4 input vectors). There is a potential issue with those LUT operations with
I know that's the tricky bit... Here is a small list of algorithms that definitely need to be implementable:
I am also personally interested in more algorithms:
I think the best way would be to have a way for the engine to optimize away the mask when it is known to be full (or empty).
I think |
Right. SVE is limited to 2048 bit anyway but this could bite on RVV. FYI there seems to be a tendency there to impose at least a 64K element upper bound (previously there was none). I suspect that just wrapping around is the best we can do. |
Out-of-bound policy is completely orthogonal to maximal vector length. I think for LUTs (ie: shuffles), out-of-bounds are 0 (or fallback) are really useful to chain LUTs when the actual LUT is larger than a single vector. |
For iota(), wraparound seems much more useful than 0 - it would still allow differentiating odd and even lanes. |
Most operations on lanes use the wraparound policy, namely:
Yes, you can chain them with compare+select, but that could be less efficient on some architectures, especially Neon and SVE. |
I think there is consensus, at least for some part that narrowing concats the intputs and widening uses low/high. |
Change by Florian Lemaitre, orgially made in PR WebAssembly#27: WebAssembly#27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
I also had a chance to look at the proposal.
Several other comments:
... let mask types having sizes and have a way to load/store them in their natural, architectural representation, or have a way to load/store them in a compact format (1 bit per element).
The second would make the most sense as an abstraction layer, but could be slow on some architectures (Neon for sure, maybe SVE).
Not great for SVE either.
* `vec.mX.index_first|last`: transform a mask into a lane index, might be useful to chain with lane index operations (like insert or extract), but potientially hard to implement on SVE/Risc-V V.
IMHO it would not be hard to implement with SVE, but it would be very inefficient because it would involve a round-trip from the predicate registers to the general-purpose ones and back. Taking the combination of vec.mX.index_first
and extract_lane
as an example, my expectation would be that compilers would pattern-match the combination and just generate a LASTB
instruction (simplifications are possible in specific cases). From SVE's perspective, the ideal solution would be to have an operation that extracts an element based on a mask value, but I am not sure how well that maps to other architectures.
- `vec.v16.UNOP(a: vec.v16) -> vec.v16` | ||
- `vec.v32.UNOP(a: vec.v32) -> vec.v32` | ||
- `vec.v64.UNOP(a: vec.v64) -> vec.v64` | ||
- `vec.v128.UNOP(a: vec.v128) -> vec.v128` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your proposal specifies 4 different vec.v128
interpretations - how do we determine what the expected interpretation is (same for the binary arithmetic operators)?
This might have implications for the representation of 128-bit element masks (which are not supported natively by SVE).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are much more way to interpret a vec.v128
. Basically, a vec.v128
is a vector of v128
(from fixed-size SIMD), so there are as many ways to interpret a v128
than a vec.v128
.
This section is a simplification of what the actual section will look, and the interpretation of the content of the vector will be part of the instruction. eg:
vec.i32.add(a: vec.v32) -> vec.v32
vec.f32.add(a: vec.v32) -> vec.v32
As element-wise unary and binary operation semantic is trivial, it has not been detailed in this proposal yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, they do look kind of trivial, but they still raise some questions - for example, aren't the unmasked vec.v128
operations completely redundant?
That's why I think it could be useful to state at least some of the masked operations concretely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect vec.i8x16.add
to be a valid operation. The vec.v128.UNOP
operations are for operations that deals with v128
as a block, like reverse bytes. I'm not sure how many of those operations there will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's completely different; I was using the examples discussed in the description, i.e. neg
and not
. If you have such block use cases in mind, it's probably worth fleshing out the proposal a bit.
|
||
### LUT1 zero | ||
|
||
Elements whose index is out of bounds are set to `0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small note about your example - if TBL
/TBX
is what you have in mind for LUTn
, then SVE2 supports only LUT1
and LUT2
(unlike Neon), but, of course, lookup into arbitrary large tables can be synthesized with TBX
(and probably vector DEC*
).
|
||
Applies swizzle to each v128 of the vector. | ||
|
||
- `vec.i8x16.swizzle(a: vec.v128, s: vec.v128) -> vec.v128` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another small remark: The LUT*
operations take the index vector as the first parameter, while in this case it is the second one - is that for consistency with the WebAssembly SIMD specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is for consistency with the i8x16.swizzle
operation.
Thanks!
After thinking about it, I think we should provide multiple instructions to let the user decide how they want to store the masks:
Extracting a value based on a mask value would be a nightmare on x86, as you would have to first extract the mask in general purpose register, then find the index of the first set bit (with So what I proposed should somehow minimize the overhead across architectures. And yes, some pattern matching could be make it more efficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is consensus, at least for some part that narrowing concats the intputs and widening uses low/high.
While SVE makes it possible to implement different approaches, there is hardly a consensus. For example, of all the narrowing and widening conversion operations defined in this proposal, only the widening integer conversions map to a single instruction, SUNPK*
or UUNPK*
, e.g. SUNPKHI
(these instructions happen to be the building block for converting between the native SVE approach, i.e. overlapping even/odd elements or bottom/top in SVE parlance, and other representations).
And yes, some pattern matching could be make it more efficient.
Keep in mind that may require pattern-matching across basic blocks, which might prove a challenge, especially for simpler JIT compilers.
|
||
`idx` is interpreted modulo the length of the vector. | ||
|
||
- `vec.s8.extract_lane(v: vec.v8, idx: i32) -> i32` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you do provide a motivation for avoiding immediate parameters, I think it is worth mentioning the trade-off - immediates make it easy to specialize the operation (and this is one advantage of the original proposal), which is particularly important for SVE, since explicit indices are not its natural approach, as discussed. E.g. extracting lane 0 (which I suspect is going to be very common, if not the most) can be implemented just with a Neon move. TBH I don't think Neon fares much better in this case (explicit indices are fine, but they have to be compile-time constants).
Is your expectation that most indices are going to be compile-time constants, in which case pattern-matching operations would help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do expect indices to be almost always constants. And I do expect WASM engines to detect such constants with a constant folding pass (which should be both simpler and more robust than pattern matching, I assume).
We could have two instructions: one with runtime indices, and one with immediates. But I see two problems:
- it doubles the number of opcodes,
- except for 0, the index will usually not be an immediate as it will most likely depends on the vector length.
The vector length is a "platform constant", so can be detected by the engine as being constant, but is not an immediate. So constant detection looks more useful than immediates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* except for 0, the index will usually not be an immediate as it will most likely depends on the vector length.
BTW that might apply to more values, depending on how the discussion about the permissible vector lengths goes (given that you seem to be leaning towards powers of 2). E.g. in the case of SVE everything that fits within the lowest 16 bytes is fair game.
I know that pattern matching has limitations, and that's why the overhead without pattern matching should be minimal. Pattern matching should be the cherry on top of the cake. Most of the time, it would be possible to duplicate the instructions to overcome the genericity issue (in that case, one instruction taking a scalar index, and one taking a mask), but I think the number of opcodes is rather limited, so I chose the one that I thought had less overhead overall. |
Hi, I've done a little experiment to see what the effects are of using masked memory operations vs using a vector loop with a scalar remainder. I've used the most basic loop I could think of:
And this is compiled, with clang-13 on Linux, using various pragmas to disable loop unrolling, set the vector width and enable/disable predication. Below shows the relative execution speed of the loop with different pragmas.
I have no idea about AVX-2 and Skylake, but the super linear results for 4xi32 and 8xi32 surprised me. What wasn't so surprising was the negative affect that predicated memory operations have on performance. It would appear that using the native masking operations roughly halves the throughput compared to the non-predicated version. On the NEON based Cortex-A72, we don't have any native masking support, so performance is significantly worse than the scalar loop. So I can't see how introducing masked memory operations into wasm is currently viable, when they're not supported well (or at all) on the majority of currently available hardware. |
@sparker-arm I don't know what you measured, but I think you have put too many constraints on the compiler for your benchmark to be relevant. I don't get where you expect masks to be used in your code. The compiler should not generate masks instructions for the loop body because you have no conditionals. The only place it would make sense in your code would be in the epilog, but it is hard to measure the time of the epilog. If your loop is too long, the epilog time would be negligible; if it is too short, you will measure the time of your timer, not your code. So I gave a try on measuring the performance of masked stores on x86_64, but in the most complex case: ie, masked stores at every single iteration, and with random masks (not the special case ones from an epilog). The kernel measured looks like that: const int N = 1024;
int A[N], B[N];
void scalar() {
for (int i = 0; i < N; ++i) {
int b = B[i];
if (b) A[i] = b;
}
} The complete list of versions and their descriptions is available in this gist: https://gist.github.com/lemaitre/581f3416dc3abc9944d576d08c2a444b Here are the results on a i9-7900X (time in cycles per point):
Some caveats: v0s (*) are only valid in monothread context, and SSE4_v3 uses the maskmov instruction which bypass the cache and thus have a huge latency, but I only measured the throughput here. SSE4_v1 and v2 are plain old scalar emulation (so directly usable on ARM). So basically, scalar emulation is on par with plain scalar (SSE4_v2), and even possibly faster if we know there is a predictable pattern (SSE4_v4). For prologs, this can be made even faster because the masks have a special shape and can be further optimized. |
Hi @lemaitre Each loop was run over 100,000 elements and a minimum time was taken from 10 runs. I'm using predication to remove the epilogue, so all the memory operations in the loop are masked. I see this is one of the main reasons for having flexible vectors: so that wider vector units aren't penalised for smaller loops and also so we don't have to have distribute multiple versions of a loop. Or do you envision a compiler producing different versions of each loop? And pardon my ignorance, but why are you only looking at stores? |
Over such a large array on a simple computation like yours, you should have been limited by L3 cache bandwidth, not by masking.
I don't think that is a viable way to generate efficient loops as even on AVX512 (with native masking), the overhead of computing the mask every single loop iteration can become non-negligible (even though the masking itself is). So basically, the compiler would generate exactly 2 versions of the loop: without masks for provably full iteration, and with masks for the remainder. If we want, we could have specialized masks for more optimized epilog handling, but it is not even mandatory.
Because that's the hardest part of masking. Masked load could be done with full load, and in-register masking. For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load in that case (because inactive lanes should not trigger a segfault). We could even skip the check, and just trap the segfault. The problem with stores is you should not write to inactive lanes, because another thread might write to them: the full-width store can introduce race-conditions. That's why my v0s are wrong. If I make the masked store emulation as fast as scalar, I consider it a win, because I can write the rest of my kernel with SIMD without any extra overhead. |
Just a quick note - in practice aren't almost all loads effectively unaligned ones? The alignment specified by the |
I also think so. At least if we keep the hint semantic. Personally, I would much prefer to make it a constraint. |
Sure, considering each branch in isolation, the branching pattern should be mostly regular, but the branch density might have negative interactions with branch prediction (which wouldn't be an issue with the scalar version). That could be compensated with inserting no-op instructions at the expense of code size. |
First, the naive (branchy) scalar version will eat up the same amount of "branch predictor space" than my hypothetical branch on page boundary crossing (if scalar is unrolled, then scalar would be worse in this regard). Then, I wrote a branchless masked-store that is on-par with the branchless conditional scalar store. We could have the exact same thing for loads. Finally, I already mentioned it multiple times, but we could trap segfaults, and check if it comes from a legitimate masked load or not. The probability that such a masked load actually segfaults on inactive lanes is so low that the overhead from the signal is not relevant. |
Okay.... to make you happy, 1000 elements:
This is reasonable, but you'll still penalize SSE and Neon for kernels with small loop iterations, creating a performance cliff on certain architectures.
Right... and as mentioned, that's all of them?
Do you know of any compilers that do this currently?
This sounds like a bad design choice, a compiler shouldn't take valid code, make it invalid and then rely on all the runtimes to handle the fault. We (runtimes) would most likely just have to scalarize and the branch density will not be good. |
No, I did not make myself clear enough. My point was such a simple kernel is limited by something than memory bandwidth (or at least L3 bandwidth) for 12 MB data, your benchmark may be flawed somewhere (or you have an impossibly fast memory). But as you don't provide the code, I cannot tell you more.
How is it penalizing SSE and Neon? Can you write an epilog that is faster than using masked loads/stores? Scalar will be slower. And the larger the vector, the more expensive a scalar remainder will be. A vector remainder has not such problems, but is only possible using masked instructions. The real point is: not penalizing old archs just for the sake to make recent ones faster. But introducing masked instructions does not cripple SSE/Neon, because the alternative would be scalar, which is not faster. In actual code, scalar will even be slower because the whole loop would be scalar, while here, only the masked memory accesses would be.
You don't get it. If your scalar have a branch for the conditional access, my solution does not introduce any extra ones (one branch in both cases). If your scalar code does not have a branch, fine, we can also implement masked load/store without any branch (SSE4_v2), it would be just a bit slower than the branchy one on predictable patterns, but still as fast as the scalar one (or even faster on more complex codes).
The compiler has nothing to do with all of that! The compiler will just generate the WASM bytecode that will contain masked memory instructions. Then, the WASM engine (ie: runtime) will convert this bytecode into actual machine code, and this is where the page boundary cross check would be integrated and/or the signal trap. As it is done by the runtime, the runtime itself can be sure that trap will be there.
I don't get your point. First, it is possible to implement masked load/store without any branch (have you checked my code?), even if it is a "scalar emulation". Then, the signal approach does not introduce any branch either. Only where a load is at the edge of a system memory allocation that it would be possible to trigger the fault. It is very unlikely to happen in real code, and thus, this case can be slow, and we would not care. |
The only paper I've seen on this approach is this, and it certainly shows potential. If you have some more links I'd be interested in having a look. But for all your assertions about this approach being faster, I'd like to see some examples of a compiler choosing to vectorize, using masked load and stores, for an architecture that doesn't support them. Having an existing production system being able to make sensible choices would appease me completely, otherwise it seems like a leap of faith in we think a wasm compiler will be able to do. And if a wasm compiler can't use this stuff effectively, then it will just be disabled - and nobody wants that!
Treat my example as an epilogue, scalar is faster on my raspberry pi using the current magic in LLVM. I think the problem here is that we're thinking about masking from two different angles, I think you're mainly considering removing control-flow and I'm considering how to represent a vector loop in the 'flexible' manner. I'd suggest that your primary use case is a separate issue from enabling wider vector widths and is, instead, the next step in Wasm SIMD saga. |
I don't care about what compilers are doing now. I care about what we can do. And we can do much better in that regard. The kernel looks like this: void func() {
for (int i = 0; i < N; ++i) {
int a = A[i];
for (int j = 0; j < a; ++j) {
B[j] += 1;
}
}
} I have 5 versions:
The number of iteration in the inner loop is small and random. The range is given in the header of the table:
We can clearly see that scalar remainder is slow on tiny loops. While my branchless masked remainder does a great job and is quite close to the no remainder code. The most impressive aspect of this is that the loop body is ridiculously tiny (a load, a plus, and a store), so this is where the cost of the masking will be the most visible. But it still outperforms the scalar remainder nonetheless.
I agree with that. And my results show that it can be effective. Ok, for now, it is hand-coded, but if you look at the code, you will see that automatic generation of such masked memory access will be easy.
I don't consider your example at all as your results show there is something weird with your implementation, yet you do not provide it for me to check why.
My stance is that masks are needed anyway, so we can just use them to implement remainder. I showed that, performance wise, it is nice, even though the masks are emulated. |
Oops, closed by accident. |
@sparker-arm I am not sure about what is the point you are trying to make. Is that AVX2 (which your CPU has) doesn't have predicates? Or that there are challenges with trying to emit them in autovectorized programs? There are obvious differences between predicate support between different ISA generation and different architectures, but I am confused about how your example is supposed to demonstrate those.
Specifically, I am not quite sure what are you observing when you are adding loop pragmas to a scalar
(this a copy of the original table from #27 (comment), before reduction in size, to avoid scrolling up and down) If the effect of vectorization was 3 or 8 times drop in performance, as in this data, then nobody would be really using it :) Autovectorization is a multi-stage process, where first the compiler has to find enough parallelism, then select the right operation, and so on, all of it interacting with other transformations. By adding extra pragmas you are likely disrupting it somewhere. Two more symptoms are that supposedly predicated code is actually running on non-predicated instruction set, and increasing vector size beyond what platform supports yielding better performance on x86. In a way, comparison between predicated and non-predicated vector code running time runs counter to the point you seem to be making about mask support, but I have doubts about this methodology in general. If you want this to be a bit more thorough, it would be good to list assembly for each kernel paired with clang commands and full sources with the pragmas (in a gist or a separate repo). |
Sorry @penzn, the table shows the relative performance / speedup so hopefully that makes a bit more sense now! This is how I'm using the pragmas, to force the vectorizer into the various settings, and these work when targeting aarch64 or x64.
|
@lemaitre I'm still trying to get my head around your benchmarks, it's a lot of code for someone who doesn't know x86! I don't think I understand how your examples are valid in the wider context of wasm though, you can write branchless remainder code for x86, whereas we cannot for aarch64. In aarch32/thumb-2 this would be okay, but I'm not sure what other architectures support conditional load/stores. I'm also thinking about the feasibility of using traps to handle 'bad' vector loads. For an OoB load, how would we (runtimes) determine that the load is actually 'okay'? |
Even though Now, the code you generated with clang does work, but is horrendous: https://gcc.godbolt.org/z/zs38eMveK EDIT: The x86 version looks better, but it falls back to scalar operations, not masked vectors: https://gcc.godbolt.org/z/79brKWoxz The thing you seem to have missed is that the loop body does not need predication, only the remainder. This is a very simple way to make the predication cost negligible. In fact, that's what my code tries to benchmark. The inner loop is small and have a remainder (implemented either in scalar, or with predicated load/store), and the outer loop just repeats the inner one with a random number of iterations. That way, I can measure the time taken by the loop, while the remainder is not negligible. To be noted that my code is explicit about how to do the remainder. Of course, the method implies that the loop body is duplicated, but I think it is a "necessary evil" to have efficient loops.
We would need a list of masked memory loads. If the current instruction address is in the list, that was a masked load, and a more robust masked load should be used. In order to keep the list small, we could make this list per function. Or we could add some special noops before or after the load that the signal handler could check to see if the load was a masked one (and where the mask is). Finally, because WASM has only a single memory region, we could reserve an extra page at both end of the region to ensure a masked load with at least a valid lane would never fault. |
By this I hope you mean that the source program will allocate extra memory in the wasm heap and not that the implementation will do this for you, since most practical Wasm implementations will insist on controlling the memory beyond the heap and use it for bounds check elimination / address calculation optimization purposes (among other things). A couple of other drive-by comments about dealing with unaligned accesses that overlap the end of the heap in practical implementations: It is true that one can use a signal handler to fixup a partially-out-of-bounds or unaligned access, and Firefox did that for a while for in-bounds unaligned accesses on ARM, but it's a mess in practice - kernel data structures necessary to do so are not always documented and system headers are frequently not shipped in a usable form, requiring difficult workarounds and sometimes a reading of licenses that makes some people nervous. For partial-OOB stores the CPU specs have some tricky wording. Unaligned stores on ARMv8 will revert to byte-at-a-time semantics and it appears to be implementation-defined (since the Pixel2 Snapdragon and Apple M1 do this differently in my experience, though I need to investigate details here further) whether an address violation / accessibility check is performed for the entire store before the first byte is stored or whether each byte store is checked individually. As a consequence, an unaligned store that overlaps the end of the heap may store partial data at the end of the heap, or not. This is a bit of a mess and may require a Wasm implementation that uses page fault tricks for bounds checking to insert its own guard before each store on some CPUs if it can't prove that the store is completely in-bounds. (The alignment hint in wasm memory accesses is generally ignored by implementations I believe - certainly we ignore these hints in all cases. Unaligned accesses have to be handled anyway, CPUs are increasingly able to deal with unaligned accesses transparently, toolchains don't produce the hints, and there's no benefit to generating special code for accesses hinted as unaligned.) |
Ah, thanks, that makes sense now. I'll try this on my rpi.
Yes, I was thinking more about the complications of finding the mask value. So more like a map of a load to a compare, instead of a list, but then I'm still not sure what I'd do with that. But that maybe because I've never written a handler before... For the case where we have an 'masked' load, which will be merged in-register, the compare for the select may not have even been executed yet for that iteration, so how do I calculate my mask and recover gracefully? If I had source-level exceptions, I think I could just have the scalar loop body in the 'catch' block, but I'm really sure not at the runtime level.
Interesting! And just to point out, V8 on aarch64 (android/linux) currently doesn't use trap handlers for OoB. I'm definitely not a fan of the idea of relying on trap handlers and it's why I'm more inclined to think that the compiler would insert some runtime checks instead... Then I think we should be able to have a vectorized loop that wouldn't fault and a remainder to handle the rest. I need to think some more about this. |
@sparker-arm thank you for clarification on pragmas.
OK, that makes better sense, I thought it was relative execution time. Raspberry PI vector performance looks a bit low though. |
I do mean that the runtime adds the extra page around the heap, because the runtime cannot be sure that the user did it, so has to consider they have not, so the runtime needs another mechanism to ensure it works in all circumstances. Also, I doubt that many oob accesses can be caught this way because WASM has a single memory region so most software oob accesses would be within the region, and thus not catchable that way. This is very different from native processes where you can easily detect stack overflows using this method.
You don't really need to access kernel data structures to do the fallback. You could just make the signal handler setting the context inside a function where all registers are callee-saved, and that returns just after the "not-so faulty" load. Of course, the signal handler should set the stack-frame correctly, and the function should return the loaded value in the same register as the one used in the actual load, but it is definitely possible. And if this special function faults, that would be fine because the memory accesses inside it are not registered are masked ones, and the signal handler will thus forward the signal. Maybe it would be possible to just execute the fallback inside the signal handler, but I don't know how current platforms handle nested faults.
I don't get your point here. To me, it should be fine to say oob (like in: outside the heap) is undefined behavior, instead of saying it traps. Also, this has nothing to do with masks or even flexible vectors, because this problem can already arise today with 128-bit SIMD. Also, SIMD memory accesses must be allowed to slice, because no arch makes any guarantee about it. So the fact that some archs fallback to byte access is actually fine.
I am disappointed about alignment handling in WASM SIMD. A hint only is of no use and it would have been much better to have it a proper constraint with undefined behavior if not met at runtime. But I bet that's not the case because people don't want UB...
You got it. Basically, the map will tell you in which register the mask is, and the signal handler would just make the call to the fallback implementation (see above).
That is not the role of the compiler. The compiler would just generate WASM masked loads/stores (for the remainder only, or for the whole loop, and skip the remainder, as it pleases). The runtime checks (if any) would be generated by the runtime (ie: WASM engine) in case there is no masked load/store instructions on the runtime platform. And just to be clear, I do not object to that. I just want to explore multiple ways to do it. Maybe some runtimes will chose different strategies, and that's fine.
That would definitely be the case if the loop body is not predicated, and only the remainder actually uses masks. |
Sure. My point is merely that the runtime has to balance needs, and that the mechanism suggested in this thread to allow a wide read to succeed even when technically OOB is (so far as I can see) at odds with the mechanism runtimes already use to perform bounds checking. Many runtimes will probably prioritize the latter over the former. An argument that the former technique can be used to improve the performance of a computation has to come with the caveat that the technique may not be available in many runtimes.
I don't know what you mean by that. Firefox uses the page protection mechanism to catch almost all OOB accesses on 64-bit systems; normally we emit virtually no explicit bounds checks at all.
And for that you need access to "kernel" data structures. While integer registers are in the well-defined user-exposed part of the sigcontext and can be updated easily, floating point registers are not always so - at least not for floating point registers on ARM. For that you need to know the layout of the hidden bits of the sigcontext.
Undefined behavior is not acceptable on the web (and even implementation-defined behavior is undesirable), and yeah, this is a web context. In wasm an OOB access causes a trap which terminates the program, but the surrounding JS code can catch that trap and has access to the program's memory and can in principle (and frequently will) run more code in it. You can argue about whether that's completely sensible, but it will happen. In that case, the rogue writes near the end of the heap need to be avoided. |
So from a runtime implementer perspective, it sounds like a big headache to try to emulate the masking, so I somehow doubt it will be done - there's already enough to do! But my realisation yesterday was that, from a compiler implementer perspective, we will treat web assembly the same as any other backend, and only use features when it is likely to make sense for the majority of users. This doesn't have to be as coarse grained as a wasm proposal, but can be at the opcode level too. Depending on how fast this proposal moves, the majority of hardware and runtime support might not be ready to take advantage of everything here, and that's fine. So, I think this would be a really great use case for feature detection - the compiler can be less pessimistic while producing performance portable code for everyone. |
Agreed.
Sorry, I mixed up "C oob" and "runtime oob". A realization I just had: if you use signals to catch oob access, you can also use it to handle masked accesses. So there would be no need for an extra page.
I thought that FP registers were neither saved nor restored automatically by a signal handler. If that's indeed the case, that means you can just access FP registers directly.
I believe you cannot ensure that "for an unaligned access, CPU will trap before any write happens". But I think that's not an issue. If the code "segfaults", the dev should consider their code buggy, period. We should not make it harder for ourselves just because people are allowed to write buggy codes.
I think you underestimate the effort that goes in the runtime in general. I don't see emulating masking harder than most of the features implemented in the runtime. But that's true it's a complex beast. |
| ISA | SIMD width | | ||
|:--------------|:-------------| | ||
| SSE - SSE4.2 | 128 | | ||
| AVX - AVX2 | 256 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avx and avx2 instructions can work on both 128-bit and 256-bit vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was that this is the maximum width
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fair enough, but at least in upcoming avx10.1 and avx10.2 the maximum vector width (256-bit or 512-bit) will depend on hardware as the 512-bit width is optional in these standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is if you have native support for 256 bits vectors, you most likely want to have all those 256 bits from the flexible vectors. Basically, flexible vectors will always give you the largest vector supported by your plateform.
So if your hardware with AVX10 has 512 bits vectors, you will have flexible vectors of 512 bits, but otherwise, you you will have only 256 bits. It works the same for SVE: you will get the maximum width.
|:--------------|:-------------| | ||
| SSE - SSE4.2 | 128 | | ||
| AVX - AVX2 | 256 | | ||
| AVX512 - | 512 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avx512 instructions can work on 128, 256 and 512-bit vectors. also there are avx10.1 and avx10.2 prepared, which will be superset of avx512 when it comes to number of operations, but they will make 512-bit vectors optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I need to update this.
It seems that there is a bit activity back on this PR (omg, that's been 3 years). Is there any interest from you that I try to make it compatible with the current status of flexible vectors? |
If you have time, we have been working mostly on preparation for CG presentation for now. Do you want to join SIMD meetings again? :) |
Hello everyone, this is an attempt to formalize flexible vectors in a model that mimics SVE, as an alternative to #21 . The main goal of this PR is to discuss the model, letting the bikeshedding aside (at least for now).
The main idea is the following: each target architecture has a runtime constant SIMD width (ie: it cannot change during execution of the process), but this width is not known at compile time because we don't know the target architecture at this point.
All vectors have a width that is equal to this target SIMD width.
Smaller vectors are handled with masked operations. Each operation defines how it handles inactive elements. There usually is 3 policies: zero inactive elements, forward inactive elements from the first input, or letting them undefined. This third policy might be controversial and could be removed, but it gives more flexibility to WASM engines on the cases where the user actually does not care.
There is 10 types: 5 vectors (
vec.v8
,vec.v16
,vec.v32
,vec.v64
,vec.v128
) and 5 masks (vec.m8
,vec.m16
,vec.m32
,vec.m64
,vec.m128
). This is needed because of the different way target architectures handle masks. Some have type-width agnostic masks (eg: AVX512), while some have type-width aware masks (eg: SSE, SVE).Technically, it might not be necessary to have multiple vector types and only keep multiple mask types, but I think it simplifies the mental model to have a direct correspondence between vectors and masks (you cannot use a
vec.m8
mask with avec.v16
vector).128-bit elements might not be necessary, but they would provide users a simpler way to deal with flexible vectors as they could have all the SIMD operations on 128-bit blocks they are used to (including compile-time shuffles).
There are many operations missing, eg: gathers, scatters, compact, expand, narrowing arithmetic, widening arithmetic, some masks operations (equivalent to break control flow), first-fault loads.
Currently, mask sizes are not defined and thus, no wait to load/store masks from/to memory exist. It is possible always possible to convert a mask into a vector and have this vector in memory, but this looks contrived.
I see 2 orthogonal ways to solve this issue: let mask types having sizes and have a way to load/store them in their natural, architectural representation, or have a way to load/store them in a compact format (1 bit per element).
The second would make the most sense as an abstraction layer, but could be slow on some architectures (Neon for sure, maybe SVE).
More details in the README.
(edit)
The points I wish to be discussed: