- 1. Note
- 2. Implementation-defined constant parameters
- 3. Vector Register File
- 4. Vector CSRs
- 5. Vector type register,
vtype
- 6. Vector Length register
vl
- 7. Maximum vector Length register
vlmax
- 8.
vsetvli
/vsetvl
instructions - 9. Vector element mapping to vector register state
- 10. Vector instruction formats
- 11. Vector masking
- 12. Vector Loads and Stores
- 13. Vector Arithmetic Instructions
- 13.1. Vector-Vector and Vector-Scalar Arithmetic Instructions
- 13.2. Vector-Immediate Arithmetic Instructions
- 13.3. Widening Vector Arithmetic Instructions
- 13.4. Mask encoding
- 13.5. Vector Arithmetic Operand encoding
- 13.6. Legal Vector Arithmetic Instructions
- 13.7. Vector Comparison Instructions
- 13.8. Vector Merge Instruction
- 13.9. Vector multiply/divide
- 14. Vector Narrowing instructions
- 15. Vector fused Multiply-Adds
- 16. Vector Mask Operations
- 17. Vector Permutation Instructions
- 18. Examples
- 19. Increasing VLMAX through register grouping and
vlmul
field. - 20. Expanded SEW encoding
This is a draft of a stable proposal for the vector specification to be discussed at the RISC-V Summit. This version is intended to be stable enough to begin developing detailed encoding, toolchains, functional simulators, and initial implementations.
These parameters are fixed for a given machine. The ISA supports writing code, under certain constraints, that will be portable across machines with different values for these parameters.
Note
|
Code can be written that will expose implementation parameters. |
The maximum size of a single vector element in bits, \(ELEN\). Must be a power of 2.
The number of bits in a vector register, \(VLEN \geq ELEN\). Must be a power of 2.
Note
|
Platform profiles may set further constraints on these parameters, for example, requiring that \(ELEN \geq max(XLEN,FLEN)\) or requiring a minimum \(VLEN\) value. |
Note
|
Vector contexts cannot be migrated across vector units with different VLEN and ELEN settings. |
There are 32 architectural vector registers, v0
-v31
.
Each vector register has a fixed \(VLEN\) bits of state.
If the system has floating-point registers, the floating-point
register f
x is contained in the low \(FLEN\) bits of vector
register v
x.
Example, FLEN=32, VLEN=64
bytes 7 6 5 4 3 2 1 0
| v0 |
| f0 |
Note
|
To increase readability, vector register layouts are drawn with bytes ordered from right to left with increasing byte address. Bits within an element are numbered in a little-endian format with increasing bit index from right to left corresponding to increasing magnitude. |
Note
|
Zfinx ("F in X") is a new ISA option under consideration where
floating-point instructions take their arguments from the integer
register file. The vector extension is incompatible with this option.
Overlaying vectors on the integer registers for Zfinx would need
different code to avoid integer registers with special meanings in the
ABI, e.g., x0 and x1 .
|
The XLEN-wide vector type CSR, vtype
provides the default type
of variables contained in the vector register file, and is used to
provide a polymorphic interpretation for vector instructions. The
vector type also determines the number of elements that are held in
each vector register.
The type
register has two fields, vrep
, and vsew[2:0]
.
vtype layout
XLEN-1:4 Reserved (write 0)
3:1 vsew[2:0]
0 vrep
Note
|
Further standard and custom extensions to the vector base will extend these three fields to support a greater variety of data types. |
The vrep
field specifies how the bit patterns stored in each element
are to be interpeted by default. Instructions may explicitly override
the default representation.
'vrep' representation field encoding
0 Signed two's-complement integer
1 IEEE-754/2008 floating-point
The value in vsew
sets the dynamic standard element width
(SEW). By default, a vector register is viewed as being divided into
\(VLMAX = \frac{VLEN}{SEW}\) standard elements (always an integer
power of 2). The VLMAX derived from SEW is used to control the number
of iterations of standard stripmining loops.
vsew[2:0] (standard element width) encoding
vsew SEW
--- ----
000 8
001 16
010 32
011 64
100 128
101 256
110 512
111 1024
Note
|
For example, a machine with \(VLEN=128\) has the following \(VLMAX\) values for the following \(SEW\) values: (\(SEW=32b, VLMAX=4\)); (\(SEW=16b, VLMAX=8\)); (\(SEW=8b, VLMAX=16\)). |
The vector extension does not modify the behavior of standard scalar floating-point instructions. Standard scalar floating-point instructions operate on the lower FLEN bits of each vector register, and perform NaN-boxing on floating-point results that are narrower than FLEN.
Note
|
The standard scalar floating-point loads and stores move uninterpreted bit patterns between memory and registers and can be used to load and store the lower bits of a vector register, using a wider immediate offset than the vector extension scalar load and store instructions. Implementations using floating-point recoding techniques might experience a performance penalty when using scalar floating-point loads and stores to move values used as non-floating-point values. |
The \(XLEN\)-bit-wide read-only vl
CSR can only be updated by the
vsetvli
and vsetvl
instructions.
The vl
register holds an unsigned integer specifying the number of
elements to be updated by a vector instruction. Elements in the
destination vector with indices \(\geq vl\) are not updated during
execution of a vector instruction. As a degenerate case, when vl
=0,
no elements are updated in the destination vector.
The XLEN-wide vlmax
CSR is a read-only register whose value is
derived from the other state in the system. The vlmax
register
holds an unsigned integer representing the largest number of elements
that can be completed by a single vector instruction with the current
vtype
setting. The value in vlmax
\(= \frac{VLEN}{SEW}\).
vsetvli rd, rs1, vtypei # rd = new vl, rs1 = AVL, vtypei = new vtype setting # if rs1 = x0, then use maximum vector length vsetvl rd, rs1, rs2 # rd = new vl, rs1 = AVL, rs2 = new vtype value # if rs1 = x0, then use maximum vector length
The vsetvli
instruction sets the vtype
, vl
, and vlmax
CSRs
based on its arguments, and writes the new value of vl
into rd
.
The new vtype
setting is encoded in the immediate field vtypei
for
vsetvli
and in the rs2
register for vsetvl
.
Suggested assembler names used for vtypei setting
vint8 # 8b signed integers
vint16 # 16b signed integers
vint32 # 32b signed integers
vint64 # 64b signed integers
vint128 # 128b signed integers
vfp16 # 16b IEEE FP
vfp32 # 32b IEEE FP
vfp64 # 64b IEEE FP
vfp128 # 128b IEEE FP
Note
|
The immediate argument vtypei can be a compressed form of the
full vtype setting, capturing the most common use cases. For the base
proposed here, it is assumed that at least four bits of immediate are
available to write all standard values of vtype (vsew[2:0] and
vrep ).
|
The vtype
setting must be supported by the implementation, and the
vsetvl{i}
instructions will raise an illegal instruction exception
if the setting is not supported.
Note
|
Specifing that vtype is WARL is problematic as that would hide
errors. The current spec is problematic in that it requires a trap
based on a data value in a CSR write. It would simplify pipelines if
vtype value errors were flagged at use not write, but somehow need
to catch errant code without requiring full XLEN bits in vtype when
only a few bits are actually used. One alternative is to allow
substitution of a fixed illegal value in vtype , e.g., all 1s, if an
attempt is made to write an unsupported value. This would then cause
a trap on use.
|
The requested application vector length (AVL) is passed in rs1
as an
unsigned integer.
The vlmax
register is set to \(VLMAX\) based on the new
\(SEW\) in the vtype
setting .
The resulting vl
setting must satisfy the following constraints:
-
vl = AVL
ifAVL <= VLMAX
-
vl >= ceil(AVL / 2)
ifAVL < (2 * VLMAX)
-
vl = VLMAX
ifAVL >= (2 * VLMAX)
-
Deterministic on any given implementation for same input AVL and
vtype
values -
These specific properties follow from the prior rules:
-
vl = 0
ifAVL = 0
-
vl > 0
ifAVL > 0
-
vl <= VLMAX
-
vl <= AVL
-
Note
|
The For example, this permits an implementation to set |
The vsetvl
variant operates similary to vsetvli
except that it
takes a vtype
value from rs2
and can be used for context restore,
and when the vtypei
field is too small to hold the desired setting.
Note
|
Several active complex types can be held in different x
registers and swapped in as needed using vsetvl .
|
To represent a variety of different width datatypes in the same fixed-width vector registers, the mapping used between vector elements and bytes in a vector register depends on the runtime SEW setting.
Note
|
Previous RISC-V vector proposals hid this mapping from software,
whereas this proposal has a specific mapping for all configurations,
which reduces implementation flexibilty but removes need for zeroing
on config changes. Making the mapping explicit also has the advantage
of simplifying oblivious context save-restore code, as the code can
save the configuration in vl , vlmax , and vtype , then reset
vtype to a convenient value (e.g., vectors of ELEN) before saving
all vector register bits without needing to parse the configuration.
The reverse process will restore the state.
|
The following diagrams illustrate how different width elements are packed into the bytes of a vector register depending on current SEW setting.
The element index is shown placed at the least-significant byte of the stored element.
ELEN=32b
Byte 3 2 1 0
SEW=8b 3 2 1 0
SEW=16b 1 0
SEW=32b 0
ELEN=64b
Byte 7 6 5 4 3 2 1 0
SEW=8b 7 6 5 4 3 2 1 0
SEW=16b 3 2 1 0
SEW=32b 1 0
SEW=64b 0
ELEN=128b
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
SEW=8b F E D C B A 9 8 7 6 5 4 3 2 1 0
SEW=16b 7 6 5 4 3 2 1 0
SEW=32b 3 2 1 0
SEW=64b 1 0
SEW=128b 0
When \( VLEN > ELEN\), the element numbering continues into the following \(ELEN\)-wide units.
ELEN unit 3 2 1 0 Byte 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 SEW=8b F E D C B A 9 8 7 6 5 4 3 2 1 0 SEW=16b 7 6 5 4 3 2 1 0 SEW=32b 3 2 1 0
Some vector instructions have some operands that are wider than the current SEW setting. In this case, a group of vector registers are used to provide storage for the wider operands as shown below.
When an instruction has an operand twice as wide as SEW, e.g., a vector load of 32-bit words when SEW=16b, then an even-odd pair of vector registers are used to hold the double-width value as shown below:
Example 1: ELEN=32 ELEN unit 3 2 1 0 Byte 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 SEW=16b 7 6 5 4 3 2 1 0 <=16-bit elements v2*n 6 4 2 0 32-bit elements v2*n+1 7 5 3 1
The even-numbered vector register holds the even-numbered elements of the double-width vector, while the odd-numbered vector register holds the odd-numbered elements of the double-width vector.
Note
|
The pattern of storing elements in the pair of vector registers is designed to simplify datapath alignment for mixed-width operations. |
For quad-width operands that are \(4\times SEW\) a group of four aligned vector registers are used to hold the results:
ELEN unit 3 2 1 0 Byte 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 SEW=8b F E D C B A 9 8 7 6 5 4 3 2 1 0 8b elements v4*n C 8 4 0 32b elements v4*n+1 D 9 5 1 v4*n+2 E A 6 2 v4*n+3 F B 7 3
Note
|
A similar pattern is followed for octa-width operands \(8\times SEW\), though not clear that is necessary in mandatory base. |
Additional setvli
variants are provided to modify SEW to handle
double-width elements in a loop.
setvl2ci rs1, vtypei # sets vtypei, then sets vl according to AVL=ceil(rs1/2)
setvl2fi rs1, vtypei # sets vtypei, then sets vl according to AVL=floor(rs1/2)
Example: Load 16-bit values, widen multiply to 32b, shift 32b result
right by 3, store 32b values.
loop:
setvli t0, a0, vint16 # vtype = 16-bit integer vectors
vlh v2, (a1) # Get 16b vector
slli t0, t0, 1
add a1, a1, t0 # Bump pointer
vmulw.vs v0, v2, v3 # 32b in <v0,v1> pair
setvl2ci a0, vint32 # Ceil half length in 32b (can fuse with following)
vsrl.vi v0, v0, 3 # Elements 0, 2, 4,...
vsetvl2fi a0, vint32 # Floor half length in 32b (can fuse with following)
vsrl.vi v1, v1, 3 # Elements 1, 3, 5,...
vsetvli t0, a0, vint16 # Back to 16b
vsw v0, (a2) # Store vector of 32b <v0,v1> pair
sub a0, a0, t0 # Decrement count
slli t0, t0, 2
add a2, a2, t0 # Bump pointer
bnez a0, loop # Any more?
Alternative loop only using wider elements:
loop:
setvli t0, a0, vint32 # Use only 32-bit elements
vlh v0, (a1) # Sign-extend 16b load values to 32b elements
sll t1, t0, 1
add a1, a1, t1 # Bump pointer
vmul.vs v0, v0, v3 # 32b multiply result
vsrl.vi v0, v0, 3 # Shift elements
vsw v0, (a2) # Store vector of 32b results
sll t1, t0, 2
add a2, a2, t1 # Bump pointer
sub a0, a0, t0
bnez a0, loop # Any more?
The first loop is more complex but may have greater performance on
machines where 16b widening multiplies are faster than 32b integer
multiplies. Also, the 16b vector load may run faster due to the
larger number of elements per iteration.
This technique allows for multiple wider operations to be performed natively on each half of the wider vector. Conversion operations allow values to be copied into the double-width format, or back into the single-width formate.
Other forms for quad (and octal) widths:
setvl4ci #set correct length for vector v4*n
setvl4di #set correct length for vector v4*n+1
setvl4ei #set correct length for vector v4*n+2
setvl4fi #set correct length for vector v4*n+3
Vector loads and stores move bit patterns between vector register elements and memory.
Vector arithmetic instructions operate on values held in vector register elements.
Vector instructions can have scalar or vector source operands and produce scalar or vector results. Scalar operands and results are located in element 0 of a vector register.
Masking is supported on almost all vector instructions producing
vectors, with the mask supplied by vector register v0
. The
least-significant bit (LSB) of each \(SEW\)-wide element in v0
is
used as the mask, in either true or complement form. Element
operations that are masked off do not modify the destination vector
register element and never generate exceptions. Instructions
producing scalars are not maskable.
Masking is encoded in a two-bit m[1:0]
field (inst[26:25]
) for all
vector instructions.
m[1:0]
00 vector, where v0[0] = 0
01 vector, where v0[0] = 1
10 scalar operation
11 vector, always true
Scalar operations are written in assembler with a .s
after the
destination vector register specifier.
Vector masking is written as another vector operand, with .t
or .f
indicating if operation occurs
when v0[0]
is 1
or 0
respectively.
If no masking operand is specified, unmasked vector execution (m=11
) is assumed.
vop v1, v2, v3, vm
implies following combinations:
vop v1, v2, v3, v0.f # enabled where v0[0]=0, m=00
vop v1, v2, v3, v0.t # enabled where v0[0]=1, m=01
vop v1.s, v2, v3 # scalar opertaion, m=10
vop v1, v2, v3 # unmasked vector operation, m=11
Vector loads and stores are encoding within the scalar floating-point load and store major opcodes (LOAD-FP/STORE-FP).
The standard FDQ floating-point extensions' loads and stores retain their original meaning.
The standard floating-point loads (FLH, FLW, FLD, FLQ), read a single value from memory and update the low \(FLEN\) bits of the destination vector register. Floating-point types narrower than \(FLEN\) are NaN-boxed, setting upper bits to 1. If \(VLEN > FLEN\), the upper bits of the vector register are unchanged by the floating-point load.
The standard floating-point stores (FSH, FSW, FSD, FDQ) read the appropriate number of bits from the least-significant bits of the vector register and write these to memory.
The vector loads and stores are encoded using the width values that are not claimed by the standard scalar floating-point loads and stores.
Width xv Mem Reg opcode uoffset5 scale
[2:0] Bits Bits (set by width[1:0])
Standard scalar FP 001 x 16 FLEN FLH/FSH N/A
Standard scalar FP 010 x 32 FLEN FLW/FSW N/A
Standard scalar FP 011 x 64 FLEN FLD/FSD N/A
Standard scalar FP 100 x 128 FLEN FLQ/FSQ N/A
Vector byte 000 0 vl*8 vl*SEW VxB 1
Vector halfword 101 0 vl*16 vl*SEW VxH 2
Vector word 110 0 vl*32 vl*SEW VxW 4
Vector doubleword 111 0 vl*64 vl*SEW VxD 8
Vector single-width 000 1 vl*SEW vl*SEW VxE 1
Vector double-width 101 1 vl*2*SEW vl*2*SEW VxE2 2
Vector quad-width 110 1 vl*4*SEW vl*4*SEW VxE4 4
Vector octa-width 111 1 vl*8*SEW vl*8*SEW VxE8 8
The one-bit xv field encodes fixed or variable element width, and is located in imm12 field
Mem bits is the size of element moved in memory
Reg bits is the size of element accessed in register
uoffset5 scale is the amount by which the five-bit unsigned immediate is multiplied to obtain a byte offset
The vector load and store encodings repurpose a portion of the standard load/store 12-bit immediate field to provide further vector instruction encoding, with bits[26:25] holding the mask information.
Bits [31:27] hold a 5-bit unsigned offset that is added to the base
register during vector addressing. The offset is scaled according to
the low two bits of the width[2:0] field (effective offset =
uoffset[4:0] * 2width[1:0]), such that for fixed-width elements the
offset is scaled by the element size. For dynamic-width elements, the
offset is not affected by the vtype
setting to avoid having a
dependency between address generation and dynamic vtype
value.
Use of 12b immediate field in vector load/store instruction encoding
31 30 29 28 27 26 25 24 23 22 21 20 Load immediate bits
31 30 29 28 27 26 25 11 10 9 8 7 Store immediate bits
uoffset5 m1 m0 funct5 Field
funct5 encodes:
name bits encoding
xv [4]
0 fixed element size
1 variable element size
order [3]
0 sequential stores
1 unordered stores
0 unsigned load
1 signed load
mop [2:0]
0 0 0 unit-stride
0 0 1 unit-stride speculative loads (fault first)
0 1 0 constant-stride
0 1 1 indexed
1 0 0 reserved
1 0 1 reserved
1 1 0 reserved
1 1 1 reserved (AMO?)
Vector unit-stride, constant-stride, and indexed (scatter/gather) load/store instructions are supported.
Note
|
Vector AMO instructions are TBD. |
Vector load/store base registers and strides are taken from the GPR
x
registers.
Vector load/store instructions move bit patterns between vector register elements and memory.
An illegal instruction exception is raised if the register element is narrower than the memory operand.
Note
|
Debate whether it is useful to allow, e.g., 64-bit loads to 32-bit registers to retain only LSBs, to accelerate stride-2 loads. Comes at cost of addtional control/verification complexity. |
When vrep
is set to integer, vector load instructions can optionally
sign- or zero-extend narrower memory values into wider vector register
element destinations.
When vrep
is set to floating-point, then loads will NaN-box narrower
memory values into a wider register element, regardless of signed or
unsigned opcode.
When the m[1:0] field is set to scalar, the vector load/store instructions move a single value between element 0 of the vector register and memory.
The unit-stride fault-first load instructions are used to vectorize
loops with data-dependent exit conditions (while loops). These
instructions execute as a regular load except that they will only take
a trap on element 0. If an element > 0 raises an exception, the
result of that element and all following elements up to the active
vector length are written with 0 results, and the vector length vl
is reduced to the number of elements processed without a trap.
strlen example using unit-stride fault-first instruction
# size_t strlen(const char *str)
# a0 holds *str
mv a3, a0 # Save start
strlen:
setvli a1, x0, vint8 # Vector of bytes
vldbff.v v1, (a3) # Get bytes
csrr a1, vl # Get bytes read
add a3, a3, a1 # Bump pointer
vseq.vi v0, v1, 0 # Set v0[i] where v1[i] = 0
vmfirst a2, v0 # Find first set bit
bltz a2, strlen # Not found?
add a0, a0, a1 # Sum start + bump
add a3, a3, a2 # Add index
sub a0, a3, a0 # Subtract start address+bump
ret
Note
|
Strided and scatter-gather fault-first instructions are not provided as they represent a large security hole, allowing software to check multiple random pages for accessibility without experiencing a trap. The unit-stride versions only allow probing a region immediately contiguous to a known region. |
# vd destination, rs1 base address, rs2=x0, vm is mask encoding
# fixed-size element
vlb.v vd, offset(rs1), vm # 8b
vlh.v vd, offset(rs1), vm # 16b
vlw.v vd, offset(rs1), vm # 32b
vld.v vd, offset(rs1), vm # 64b
vle.v vd, offset(rs1), vm # SEW
vle2.v vd, offset(rs1), vm # 2*SEW
vle4.v vd, offset(rs1), vm # 4*SEW
vle8.v vd, offset(rs1), vm # 8*SEW
# first fault versions
vlbff.v vd, offset(rs1), vm # 8b
vlhff.v vd, offset(rs1), vm # 16b
vlwff.v vd, offset(rs1), vm # 32b
vldff.v vd, offset(rs1), vm # 64b
vleff.v vd, offset(rs1), vm # SEW
vle2ff.v vd, offset(rs1), vm # 2*SEW
vle4ff.v vd, offset(rs1), vm # 4*SEW
vle8ff.v vd, offset(rs1), vm # 8*SEW
# Scalar versions
vlb.s vd, offset(rs1) # 8b scalar load into element 0
...
Note
|
Could encode unit-stride as constant-stride with rs2=x0, but this would add to decode complexity. |
# vd destination, rs1 base address, rs2 byte stride
vlsb.v vd, offset(rs1), rs2, vm # 8b
vlsh.v vd, offset(rs1), rs2, vm # 16b
vlsw.v vd, offset(rs1), rs2, vm # 32b
vlsd.v vd, offset(rs1), rs2, vm # 64b
vlse.v vd, offset(rs1), rs2, vm # SEW
vlse2.v vd, offset(rs1), rs2, vm # 2*SEW
vlse4.v vd, offset(rs1), rs2, vm # 4*SEW
vlse8.v vd, offset(rs1), rs2, vm # 8*SEW
vlse8.s vd, offset(rs1), rs2, vm # 8*SEW scalar load
The stride is interpreted as an integer representing a byte offset.
# vd destination, rs1 base address, vs2 indices
vlxb.v vd, offset(rs1), vs2, vm # 8b
vlxh.v vd, offset(rs1), vs2, vm # 16b
vlxw.v vd, offset(rs1), vs2, vm # 32b
vlxd.v vd, offset(rs1), vs2, vm # 64b
vlxe.v vd, offset(rs1), vs2, vm # SEW
vlxe2.v vd, offset(rs1), vs2, vm # 2*SEW
vlxe4.v vd, offset(rs1), vs2, vm # 4*SEW
vlxe8.v vd, offset(rs1), vs2, vm # 8*SEW
Scatter/gather indices are treated as signed integers representing byte offsets. If \(SEW < XLEN\), then indices are sign-extended to \(XLEN\) before adding to the base. If \(SEW > XLEN\), the indices are taken from the least-significant \(XLEN\) bits.
Note
|
\(SEW\) has to be wide enough to hold the indices, which could mandate larger \(SEW\) than desired. Ideally want to support index vectors wider than \(SEW\), by adding new vector indexed loads and stores with double-width or greater vector indices. |
Vector stores move data values as bits taken from the LSBs of the source element. If the store datatype is wider than \(SEW\), then multiple vector registers are used to supply the data as described above.
vsb.v vs3, offset(rs1), vm # 8b
vsh.v vs3, offset(rs1), vm # 16b
vsw.v vs3, offset(rs1), vm # 32b
vsd.v vs3, offset(rs1), vm # 64b
vse.v vs3, offset(rs1), vm # SEW
vse2.v vs3, offset(rs1), vm # 2*SEW
vse4.v vs3, offset(rs1), vm # 4*SEW
vse8.v vs3, offset(rs1), vm # 8*SEW
vsb.s vs3, offset(rs1) # Scalar 8b store from element 0
...
vssb.v vs3, offset(rs1), rs2, vm # 8b
vssh.v vs3, offset(rs1), rs2, vm # 16b
vssw.v vs3, offset(rs1), rs2, vm # 32b
vssd.v vs3, offset(rs1), rs2, vm # 64b
vsse.v vs3, offset(rs1), rs2, vm # SEW
vsse2.v vs3, offset(rs1), rs2, vm # 2*SEW
vsse4.v vs3, offset(rs1), rs2, vm # 4*SEW
vsse8.v vs3, offset(rs1), rs2, vm # 8*SEW
vsxb.v vs3, offset(rs1), vs2, vm # 8b
vsxh.v vs3, offset(rs1), vs2, vm # 16b
vsxw.v vs3, offset(rs1), vs2, vm # 32b
vsxd.v vs3, offset(rs1), vs2, vm # 64b
vsxe.v vs3, offset(rs1), vs2, vm # SEW
vsxe2.v vs3, offset(rs1), vs2, vm # 2*SEW
vsxe4.v vs3, offset(rs1), vs2, vm # 4*SEW
vsxe8.v vs3, offset(rs1), vs2, vm # 8*SEW
vsuxb.v vs3, offset(rs1), vs2, vm # 8b
vsuxh.v vs3, offset(rs1), vs2, vm # 16b
vsuxw.v vs3, offset(rs1), vs2, vm # 32b
vsuxd.v vs3, offset(rs1), vs2, vm # 64b
vsuxe.v vs3, offset(rs1), vs2, vm # SEW
vsuxe2.v vs3, offset(rs1), vs2, vm # 2*SEW
vsuxe4.v vs3, offset(rs1), vs2, vm # 4*SEW
vsuxe8.v vs3, offset(rs1), vs2, vm # 8*SEW
Note
|
Dropped reverse-ordered scatter for now, can use rgather to reverse index order. |
Note
|
There is redundancy between all the scalar variants of unit-stride, constant-stride, and scatter-gather vector load/store instructions. |
Vector memory instructions appear to execute in program order on the local hart. Vector memory instructions follow RVWMO at the instruction level, and element operations are ordered within the instruction as if performed by an element-ordered sequence of syntactically independent scalar instructions. Vector indexed-ordered stores write elements to memory in element order.
The vector arithmetic instructions use a new major opcode (OP-V = 10101112) which neighbors OP-FP, but generally follow the encoding pattern of the scalar floating-point instructions under the OP-FP opcode.
Most vector arithmetic instructions have both vector-vector (.vv
),
where both operands are vectors of elements, and vector-scalar
(.vs
), where the second operand is a scalar taken from element 0 of
the second source vector register. A few non-commutative operations
(such as reverse subtract) subtract are encoded with special opcodes.
Many vector arithmetic instructions have vector-immediate forms
(.vi
) where the second scalar argument is a 5-bit immediate encoded
in rs2
space. The immediate is sign-extended to the standard
element width, and interpreted according to the vtype
setting.
vadd.vi vd, vrs1, 3
A few vector arithmetic instructions are defined to be widening
operations where the destination elements are \(2\times SEW\) wide
and are stored in an even-odd vector register pair. The first operand
can be either single or double-width. These are generally written with
a w
suffix on the opcode.
All vector arithmetic instructions can be masked according to the m[1:0] field.
mask encoding m[1:0] is held in inst[26:25]
m[1:0]
00 vector, where v0[0] = 0
01 vector, where v0[0] = 1
10 scalar
11 always true
rm[2:0] field is held in inst[14:12]
Encoding of operand pattern rm field for regular vector arithmetic
instructions.
rm2 rm1 rm0
0 0 0 Vector-vector SEW = SEW op SEW
0 0 1 Vector-vector
0 1 0 Vector-vector 2*SEW = SEW op SEW
0 1 1 Vector-vector 2*SEW = 2*SEW op SEW
1 0 0 Vector-scalar SEW = SEW op s_SEW
1 0 1 Vector-imm SEW = SEW op simm[4:0]
1 1 0 Vector-scalar 2*SEW = SEW op s_SEW
1 1 1 Vector-scalar 2*SEW = 2*SEW op s_SEW
Bit rm[2]
selects between vector second source or scalar
second source.
Bit rm[1]
selects whether the destination is twice the width of
\(SEW\).
Bit rm[0]
selects whether the first operand is one or two times the \(SEW\) or whether the second operand is a 5-bit sign-extended immediate held in the rs2
field.
The 5-bit immediate field is always treated as a signed integer and
sign-extended to \(SEW\) bits, regardless of vtype
setting.
Note
|
For floating-point representation, the 5-bit immediate can be used to supply 0.0. |
Assembly syntax pattern for vector arithmetic instructions
vop.vv vd, vs1, vs2, vm # vector-vector operation
vop.vs vd, vs1, rs2, vm # vector-scalar operation
vop.vi vd, vs1, imm, vm # vector-immediate operation
vopw.vv vd, vs1, vs2, vm # 2*SEW = SEW op SEW
vopw.vs vd, vs1, rs2, vm # 2*SEW = SEW op SEW
vopw.wv vd, vs1, vs2, vm # 2*SEW= 2*SEW op SEW
vopw.ws vd, vs1, rs2, vm # 2*SEW= 2*SEW op SEW
The following vector arithmetic instructions are provided
.vv .vs .vi w.vv w.vs w.wv w.ws
VADD x x x x x x x
VSUB x x x x x x x
VAND x x x
VOR x x x
VXOR x x x
VSLL x x x
VSRL x x x
VSRA x x x
VSEQ x x x
VSNE x x x
VSLT x x x
VSLTU x x x
VSLE x x x
VSLEU x x x
VMUL x x x x x x x
VMULU x x x x x x x
VMULSU x x x x x x x
VMULH x x x
VDIV x x x
VDIVU x x x
VREM x x x
VREMU x x x
VSQRT x x x
VFSGNJ x x x
VFSGNJN x x x
VFSGNJX x x x
VMIN x x x
VMAX x x x
VFCLASS x x x
FMV*
FCVT*
The following compare instructions write 1
to the destination
register if the comparison evaluates to true and produces 0
otherwise.
[NOTE] VSNE
is not needed with complementing masks but sometimes
predicate results feed into things other than predicate inputs and so
VSNE
can save an instruction.
[NOTE]: Need to revisit vector floating-point unordered compare instructions.
vseq.vv vd, vs1, vs2, vm
vseq.vs vd, vs1, rs2, vm
vseq.vi vd, vs1, imm, vm
vsne.vv vd, vs1, vs2, vm
vsne.vs vd, vs1, rs2, vm
vsne.vi vd, vs1, imm, vm
...
These conditionals effectively AND
in the mask when producing
0
/1
in output, e.g,
# (a < b) && (b < c) in two instructions
vslt.vv v0, va, vb
vslt.vv v0, vb, vc, vm
The combination of VLT and VLTE can cover all cases, including compares with scalars by complementing results:
v = s , ! (v = s) = (v != s)
v < s , ! (v < s) = (v >= s)
v <= s , ! (v <=s) = (v > s)
The vector merge instruction combines two vectors based on the mask field.
vmerge.vv vd, vs1, vs2, vm # vd[i] = vm[i] ? vs2[i] : vs1[i]
vmerge.vs vd, vs1, vs2, vm # vd[i] = vm[i] ? vs2[0] : vs1[i]
vmerge.vi vd, vs1, imm, vm # vd[i] = vm[i] ? imm : vs1[i]
The second operand is written where the mask is true.
Note
|
The vmerge.vi instruction can be used to initialize a vector
register with an immediate value, and the vmerge.vs instruction can
be used to splat a scalar value into all elements of a vector.
|
These are all equivalent to scalar integer multiply/divides, and operate on VSEW source and destination widths.
vmul.vv vd, vs1, vs2, vm
vmulh.vv vd, vs1, vs2, vm
vmulhsu.vv vd, vs1, vs2, vm
vmulhu.vv vd, vs1, vs2, vm
vdiv.vv vd, vs1, vs2, vm
vdivu.vv vd, vs1, vs2, vm
vrem.vv vd, vs1, vs2, vm
vremu.vv vd, vs1, vs2, vm
Also have .vs and .vi variants
A few instructions are provided to convert multi-width vectors into single-width vectors.
VSRN vector shift right narrowing
VSRAN vector shift right arithmetic narrowing
VCLIPN vector clip after shift right narrowing
VCLIPUN vector clip unsigned after shift right narrowing
vd[i] = clip(round(vs1[i] + rnd) >> vs2[i])
For VSRN/VSRAN, clip=nop, rnd = nop.
For VCLIPN, the value is treated as a signed integer and saturates if result would overflow the destination.
For VCLIPUN, the value is treated as a signed integer and saturates if result would overflow the destination.
For VCLIPN/VCLIPUN, the rounding mode is specified in the fcsr
in a
new vxrm[1:0]
field. Rounding occurs around the LSB of the
destination.
`vxrm[1:0]`
Holds fixed-point rounding mode.
00 rup round-up (+0.5 LSB)
01 rne round to nearest-even
10 trn truncate
11 jam jam (OR bits into LSB)
The narrowing instructions used a different operand encoding in
rm[2:0]
.
# vs1 = 2*SEW, 4*SEW
rm2 rm1 rm0
0 0 0 Vector-vector SEW = 2*SEW op SEW
0 0 1 Vector-vector
0 1 0 Vector-vector SEW = 4*SEW op SEW
0 1 1 Vector-vector
1 0 0 Vector-scalar SEW = 2*SEW op SEW
1 0 1 Vector-imm SEW = 2*SEW op imm
1 1 0 Vector-scalar SEW = 4*SEW op SEW
1 1 1 Vector-imm SEW = 4*SEW op imm
vclipn.vv vd, vs1, vs2, vm # SEW = 2*SEW >> SEW
vclipn.vs vd, vs1, rs2, vm # SEW = 2*SEW >> SEW
vclipn.vi vd, vs1, imm, vm # SEW = 2*SEW >> imm
vclipn.wv vd, vs1, vs2, vm # SEW = 4*SEW >> SEW
vclipn.ws vd, vs1, rs2, vm # SEW = 4*SEW >> SEW
vclipn.wi vd, vs1, imm, vm # SEW = 4*SEW >> imm
vclipun.vv vd, vs1, vs2, vm # SEW = 2*SEW >> SEW
vclipun.vs vd, vs1, rs2, vm # SEW = 2*SEW >> SEW
vclipun.vi vd, vs1, imm, vm # SEW = 2*SEW >> imm
vclipun.wv vd, vs1, vs2, vm # SEW = 4*SEW >> SEW
vclipun.ws vd, vs1, rs2, vm # SEW = 4*SEW >> SEW
vclipun.wi vd, vs1, imm, vm # SEW = 4*SEW >> imm
vsrln.vv vd, vs1, vs2, vm # SEW = 2*SEW >> SEW
vsrln.vs vd, vs1, rs2, vm # SEW = 2*SEW >> SEW
vsrln.vi vd, vs1, imm, vm # SEW = 2*SEW >> imm
vsrln.wv vd, vs1, vs2, vm # SEW = 4*SEW >> SEW
vsrln.ws vd, vs1, rs2, vm # SEW = 4*SEW >> SEW
vsrln.wi vd, vs1, imm, vm # SEW = 4*SEW >> imm
vsran.vv vd, vs1, vs2, vm # SEW = 2*SEW >> SEW
vsran.vs vd, vs1, rs2, vm # SEW = 2*SEW >> SEW
vsran.vi vd, vs1, imm, vm # SEW = 2*SEW >> imm
vsran.wv vd, vs1, vs2, vm # SEW = 4*SEW >> SEW
vsran.ws vd, vs1, rs2, vm # SEW = 4*SEW >> SEW
vsran.wi vd, vs1, imm, vm # SEW = 4*SEW >> imm
The standard scalar floating-point fused multiply-adds occupy four major opcodes.
There are two unused rounding modes that can be used to encode vector fused multiply-adds, in both vector-vector and vector-scalar forms, where the scalar is one input to the multiply. When a scalar input to the add is needed, this can be provided by splatting the value to a vector.
rm2 rm1 rm0
1 0 1 Vector-vector vd = vs3 + vs1 * vs2
1 1 0 Vector-scalar vd = vs3 + vs1 * rs2
The FNMADD and FNMSUB variants are dropped in favor of widening vector operations, which treat the add input and final result as double-width.
VMADD SEW = SEW + SEW*SEW
VMSUB SEW = SEW + SEW*SEW
VMADDW 2*SEW = 2*SEW + SEW*SEW
VMSUBW 2*SEW = 2*SEW + SEW*SEW
vmadd.vvv vd, vs1, vs2, vs3, vm
vmadd.vvs vd, vs1, rs2, vs3, vm
vmaddw.vvv vd, vs1, vs2, vs3, vm
vmaddw.vvs vd, vs1, rs2, vs3, vm
vmsub.vvv vd, vs1, vs2, vs3, vm
vmsub.vvs vd, vs1, rs2, vs3, vm
vmsubw.vvv vd, vs1, vs2, vs3, vm
vmsubw.vvs vd, vs1, rs2, vs3, vm
Additional fused multiply-add operations can be provided as destructive operations in the regular vector arithmetic encoding space.
These instructions take a vector and scalar (vs2[0]) as input, and produces a scalar result (vd[0]) that is a reduction over the source scalar and vector. Masked elements are ignored in the reduction.
vredsum.v vd, vs1, vs2, vm # SEW = SEW + sum(SEW)
vredsumw.v vd, vs1, vs2, vm # 2*SEW = 2*SEW + sum(SEW)
vredmax.v vd, vs1, vs2, vm
vredmaxu.v vd, vs1, vs2, vm
vredmin.v vd, vs1, vs2, vm
vredminu.v vd, vs1, vs2, vm
vredand.v vd, vs1, vs2, vm
vredor.v vd, vs1, vs2, vm
vredxor.v vd, vs1, vs2, vm
By default, when the operation is non-associative (e.g.,
floating-point addition) the reductions are specified to occur as if
done in sequential element order, but a user fcsr
mode bit can
specify that unordered reductions are allowed. In this case, the
reduction result must match some ordering of the individual sequential
operations.
A widening form of the sum reduction is provided that writes a double-width reduction result.
Several operations are provided to help operate on mask bits held in the LSB of elements of a vector register.
vmpopc rd, vs1, vm
The vmpopc
instruction counts the number of elements of the first
vl
elements of the vector source that have their low bit set,
excluding elements where the mask is false, and writes the result to a
GPR.
A range of permutation instructions are provided.
The VIOTA instruction reads v0
and writes to each element of the
destination the sum of all the least-significant bits of elements in
the mask selected by m[1:0] with index less than the element, e.g., a
parallel prefix sum of the mask values.
If the value would overflow the destination, the least-significant
bits are retained. This instruction is not masked, so writes all vl
elements of destination vector.
viota.v vd # Unmasked, writes index to each element, vd[i] = i
viota.v vd, v0.t # Writes to each element, sum of preceding true elements.
# Example
7 6 5 4 3 2 1 0 Element number
1 0 0 1 0 0 0 1 v0 contents
7 6 5 4 3 2 1 0 viota.v vd
2 2 2 1 1 1 1 0 viota.v vd, v0.t
5 4 3 3 2 1 0 0 viota.v vd, v0.f
Note
|
The viota instruction can be combined with scatter/gather
instructions to perform vector compress/expand instructions.
|
The first form of insert/extract operations transfer a single value between a GPR and one element of a vector register. A second scalar GPR operand gives the element index, treated as an unsigned integer. If the index is out of range on a vector extract, then zero is returned for the element value. If the index is out of range (i.e., \(>VLMAX\)) for a vector insert, the write is ignored.
vmv.x.v rd, vs1, rs2 # rd = vs1[rs2]
vmv.v.x vd, rs1, rs2 # vd[rs2] = rs1
The second form of insert/extract transfers a single value between element 0 of one vector register and one indexed element of a second vector register.
vmv.s.v vd, vs1, rs2 # vd[0] = vs1[rs2]
vmv.v.s vd, vs1, rs2 # vd[rs2] = vs1[0]
The slide instructions move elements up and down a vector.
vslideup.vs vd, vs1, rs2, vm # vd[i+rs2] = vs1[i]
vslideup.vi vd, vs1, imm, vm # vd[i+imm] = vs1[i]
For vslideup
, the value in vl
specifies the number of source
elements that are read. The destination elements below the start
index are left undisturbed. Destination elements past vl
can be
written, but writes past the end of the destination vector are
ignored.
vslidedown.vs vd, vs1, rs2, vm # vd[i] = vs1[i+rs2]
vslidedown.vs vd, vs1, rs2, vm # vd[i] = vs1[i+imm]
For vslidedown
, the value in vl
specifies the number of
destination elements that are written. Elements in the source vector
can be read past vl
. If a source vector index is out of range, zero
is returned for the element.
This instruction reads elements from a source vector at locations
given by a second source element index vector. The values in the
index vector are treated as unsigned integers. The number of elements
to write to the destination register is given by vl
. The source
vector can be read at any index, \(index < VLMAX \).
vrgather.vv vd, vs1, vs2, vm # vd[i] = vs1[vs2[i]]
If the element indices are out of range ( \( vs2[i] \geq VLMAX\) ) then zero is returned for the element value.
# vector-vector add routine of 32-bit integers
# void vvaddint32(size_t n, const int*x, const int*y, int*z)
# { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } }
#
# a0 = n, a1 = x, a2 = y, a3 = z
# Non-vector instructions are indented
vvaddint32:
vsetvli t0, a0, vint32 # Set vector length based on 32-bit vectors
vlw.v v0, (a1) # Get first vector
sub a0, a0, t0 # Decrement number done
slli t0, t0, 2 # Multiply number done by 4 bytes
add a1, a1, t0 # Bump pointer
vlw.v v1, (a2) # Get second vector
add a2, a2, t0 # Bump pointer
vadd.v v2, v0, v1 # Sum vectors
vsw.v v2, (a3) # Store result
add a3, a3, t0 # Bump pointer
bnez a0, vvaddint32 # Loop back
ret # Finished
# void *memcpy(void* dest, const void* src, size_t n)
# a0=dest, a1=src, a2=n
#
memcpy:
mv a3, a0 # Copy destination
loop:
vsetvli t0, a2, vint8 # Vectors of 8b
vlb.v v0, (a1) # Load bytes
add a1, a1, t0 # Bump pointer
sub a2, a2, t0 # Decrement count
vsb.v v0, (a3) # Store bytes
add a3, a3, t0 # Bump pointer
bnez a2, loop # Any more?
ret # Return
(int16) z[i] = ((int8) x[i] < 5) ? (int16) a[i] : (int16) b[i];
Fixed 16b SEW:
loop:
vsetvli t0, a0, vint16 # Use 16b elements.
vlb.v v0, (a1) # Get x[i], sign-extended to 16b
sub a0, a0, t0 # Decrement element count
add a1, a1, t0 # x[i] Bump pointer
vslti v0, v0, 5 # Set mask in v0
slli t0, t0, 1 # Multiply by 2 bytes
vlh.v v1, (a2), v0.t # z[i] = a[i] case
add a2, a2, t0 # a[i] bump pointer
vlh.v v1, (a3), v0.f # z[i] = b[i] case
add a3, a3, t0 # b[i] bump pointer
vsh.v v1, (a4) # Store z
add a4, a4, t0 # b[i] bump pointer
bnez a0, loop
An additional field can be added to vsetvl
configuration to increase
vector length when fewer architectural vector registers are needed by
grouping vector registers together.
vlmul #vregs VLMAX
00 32 VLEN/SEW
01 16 2*VLEN/SEW
10 8 4*VLEN/SEW
11 4 8*VLEN/SEW
ELEN unit 3 2 1 0 Byte 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 vlmul=4, SEW=32b v4*n C 8 4 0 32b elements v4*n+1 D 9 5 1 v4*n+2 E A 6 2 v4*n+3 F B 7 3
Note
|
This reuses the same element mapping pattern used in widening operations. Can probably replace vsetvl2fi etc. |
As a later extension, the vsew field is extended with three upper bits.
vsew[2:0] (standard element width) encoding
vsew[2:0] SEW
--- ----
000 8
001 16
010 32
011 64
100 128
101 256
110 512
111 1024
vxsew[5:0] (expanded element width) encoding
vxsew[5:0] SEW
--- ----
000000 8
001000 1
... 1..8, steps of 1
111000 7
000001 16
001001 9
... 9..16, steps of 1
111001 15
000010 32
001010 18
... 18-32, steps of 2
111010 30
...TBD