Skip to content

Latest commit

 

History

History
1493 lines (1164 loc) · 46.6 KB

v-spec.adoc

File metadata and controls

1493 lines (1164 loc) · 46.6 KB

Vector extension

Table of Contents

1. Note

This is a draft of a stable proposal for the vector specification to be discussed at the RISC-V Summit. This version is intended to be stable enough to begin developing detailed encoding, toolchains, functional simulators, and initial implementations.

2. Implementation-defined constant parameters

These parameters are fixed for a given machine. The ISA supports writing code, under certain constraints, that will be portable across machines with different values for these parameters.

Note
Code can be written that will expose implementation parameters.

The maximum size of a single vector element in bits, \(ELEN\). Must be a power of 2.

The number of bits in a vector register, \(VLEN \geq ELEN\). Must be a power of 2.

Note
Platform profiles may set further constraints on these parameters, for example, requiring that \(ELEN \geq max(XLEN,FLEN)\) or requiring a minimum \(VLEN\) value.
Note
Vector contexts cannot be migrated across vector units with different VLEN and ELEN settings.

3. Vector Register File

There are 32 architectural vector registers, v0-v31.

Each vector register has a fixed \(VLEN\) bits of state.

If the system has floating-point registers, the floating-point register fx is contained in the low \(FLEN\) bits of vector register vx.

Example, FLEN=32, VLEN=64

bytes        7 6 5 4 3 2 1 0
            |       v0      |
                    |  f0   |
Note
To increase readability, vector register layouts are drawn with bytes ordered from right to left with increasing byte address. Bits within an element are numbered in a little-endian format with increasing bit index from right to left corresponding to increasing magnitude.
Note
Zfinx ("F in X") is a new ISA option under consideration where floating-point instructions take their arguments from the integer register file. The vector extension is incompatible with this option. Overlaying vectors on the integer registers for Zfinx would need different code to avoid integer registers with special meanings in the ABI, e.g., x0 and x1.

4. Vector CSRs

Three new XLEN-width unprivileged CSRs are added: vtype, vl, vlmax.

5. Vector type register, vtype

The XLEN-wide vector type CSR, vtype provides the default type of variables contained in the vector register file, and is used to provide a polymorphic interpretation for vector instructions. The vector type also determines the number of elements that are held in each vector register.

The type register has two fields, vrep, and vsew[2:0].

vtype layout

XLEN-1:4    Reserved (write 0)
     3:1    vsew[2:0]
      0     vrep
Note
Further standard and custom extensions to the vector base will extend these three fields to support a greater variety of data types.

5.1. Vector representation vrep encoding

The vrep field specifies how the bit patterns stored in each element are to be interpeted by default. Instructions may explicitly override the default representation.

 'vrep' representation field encoding

 0  Signed two's-complement integer
 1  IEEE-754/2008 floating-point

5.2. Vector standard element width vsew

The value in vsew sets the dynamic standard element width (SEW). By default, a vector register is viewed as being divided into \(VLMAX = \frac{VLEN}{SEW}\) standard elements (always an integer power of 2). The VLMAX derived from SEW is used to control the number of iterations of standard stripmining loops.

  vsew[2:0] (standard element width) encoding

  vsew  SEW
  ---  ----
  000     8
  001    16
  010    32
  011    64
  100   128
  101   256
  110   512
  111  1024
Note
For example, a machine with \(VLEN=128\) has the following \(VLMAX\) values for the following \(SEW\) values: (\(SEW=32b, VLMAX=4\)); (\(SEW=16b, VLMAX=8\)); (\(SEW=8b, VLMAX=16\)).

5.3. Interaction of vectors and standard scalar floating-point code

The vector extension does not modify the behavior of standard scalar floating-point instructions. Standard scalar floating-point instructions operate on the lower FLEN bits of each vector register, and perform NaN-boxing on floating-point results that are narrower than FLEN.

Note
The standard scalar floating-point loads and stores move uninterpreted bit patterns between memory and registers and can be used to load and store the lower bits of a vector register, using a wider immediate offset than the vector extension scalar load and store instructions. Implementations using floating-point recoding techniques might experience a performance penalty when using scalar floating-point loads and stores to move values used as non-floating-point values.

6. Vector Length register vl

The \(XLEN\)-bit-wide read-only vl CSR can only be updated by the vsetvli and vsetvl instructions.

The vl register holds an unsigned integer specifying the number of elements to be updated by a vector instruction. Elements in the destination vector with indices \(\geq vl\) are not updated during execution of a vector instruction. As a degenerate case, when vl=0, no elements are updated in the destination vector.

7. Maximum vector Length register vlmax

The XLEN-wide vlmax CSR is a read-only register whose value is derived from the other state in the system. The vlmax register holds an unsigned integer representing the largest number of elements that can be completed by a single vector instruction with the current vtype setting. The value in vlmax\(= \frac{VLEN}{SEW}\).

8. vsetvli/vsetvl instructions

 vsetvli rd, rs1, vtypei # rd = new vl, rs1 = AVL, vtypei = new vtype setting
                         # if rs1 = x0, then use maximum vector length
 vsetvl  rd, rs1, rs2    # rd = new vl, rs1 = AVL, rs2 = new vtype value
                         # if rs1 = x0, then use maximum vector length

The vsetvli instruction sets the vtype, vl, and vlmax CSRs based on its arguments, and writes the new value of vl into rd.

The new vtype setting is encoded in the immediate field vtypei for vsetvli and in the rs2 register for vsetvl.

 Suggested assembler names used for vtypei setting

 vint8    #   8b signed integers
 vint16   #  16b signed integers
 vint32   #  32b signed integers
 vint64   #  64b signed integers
 vint128  # 128b signed integers

 vfp16    #  16b IEEE FP
 vfp32    #  32b IEEE FP
 vfp64    #  64b IEEE FP
 vfp128   # 128b IEEE FP
Note
The immediate argument vtypei can be a compressed form of the full vtype setting, capturing the most common use cases. For the base proposed here, it is assumed that at least four bits of immediate are available to write all standard values of vtype (vsew[2:0] and vrep).

The vtype setting must be supported by the implementation, and the vsetvl{i} instructions will raise an illegal instruction exception if the setting is not supported.

Note
Specifing that vtype is WARL is problematic as that would hide errors. The current spec is problematic in that it requires a trap based on a data value in a CSR write. It would simplify pipelines if vtype value errors were flagged at use not write, but somehow need to catch errant code without requiring full XLEN bits in vtype when only a few bits are actually used. One alternative is to allow substitution of a fixed illegal value in vtype, e.g., all 1s, if an attempt is made to write an unsupported value. This would then cause a trap on use.

The requested application vector length (AVL) is passed in rs1 as an unsigned integer.

The vlmax register is set to \(VLMAX\) based on the new \(SEW\) in the vtype setting .

8.1. Constraints on setting vl

The resulting vl setting must satisfy the following constraints:

  1. vl = AVL if AVL <= VLMAX

  2. vl >= ceil(AVL / 2) if AVL < (2 * VLMAX)

  3. vl = VLMAX if AVL >= (2 * VLMAX)

  4. Deterministic on any given implementation for same input AVL and vtype values

  5. These specific properties follow from the prior rules:

    1. vl = 0 if AVL = 0

    2. vl > 0 if AVL > 0

    3. vl <= VLMAX

    4. vl <= AVL

Note

The vl setting rules are designed to be sufficiently strict to preserve vl behavior across register spills and context swaps for AVL <= VLMAX, yet flexible enough to enable implementations to improve vector lane utilization for AVL > VLMAX.

For example, this permits an implementation to set vl = ceil(AVL / 2) for VLMAX < AVL < 2*VLMAX in order to evenly distribute work over the last two iterations of a stripmine loop. Requirement 2 ensures that the first stripmine iteration of reduction loops uses the largest vector length of all iterations, even in the case of AVL < 2*VLMAX. This allows software to avoid needing to explicitly calculate a running maximum of vector lengths observed during a stripmined loop.

8.2. vsetvl instruction

The vsetvl variant operates similary to vsetvli except that it takes a vtype value from rs2 and can be used for context restore, and when the vtypei field is too small to hold the desired setting.

Note
Several active complex types can be held in different x registers and swapped in as needed using vsetvl.

9. Vector element mapping to vector register state

To represent a variety of different width datatypes in the same fixed-width vector registers, the mapping used between vector elements and bytes in a vector register depends on the runtime SEW setting.

Note
Previous RISC-V vector proposals hid this mapping from software, whereas this proposal has a specific mapping for all configurations, which reduces implementation flexibilty but removes need for zeroing on config changes. Making the mapping explicit also has the advantage of simplifying oblivious context save-restore code, as the code can save the configuration in vl, vlmax, and vtype, then reset vtype to a convenient value (e.g., vectors of ELEN) before saving all vector register bits without needing to parse the configuration. The reverse process will restore the state.

The following diagrams illustrate how different width elements are packed into the bytes of a vector register depending on current SEW setting.

  The element index is shown placed at the least-significant byte of the stored element.

 ELEN=32b

 Byte         3 2 1 0

 SEW=8b       3 2 1 0
 SEW=16b        1   0
 SEW=32b            0

 ELEN=64b

 Byte        7 6 5 4 3 2 1 0

 SEW=8b      7 6 5 4 3 2 1 0
 SEW=16b       3   2   1   0
 SEW=32b           1       0
 SEW=64b                   0


 ELEN=128b

 Byte        F E D C B A 9 8 7 6 5 4 3 2 1 0

 SEW=8b      F E D C B A 9 8 7 6 5 4 3 2 1 0
 SEW=16b       7   6   5   4   3   2   1   0
 SEW=32b           3       2       1       0
 SEW=64b                   1               0
 SEW=128b                                  0

When \( VLEN > ELEN\), the element numbering continues into the following \(ELEN\)-wide units.

 ELEN unit        3       2       1       0
 Byte          3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0

 SEW=8b        F E D C B A 9 8 7 6 5 4 3 2 1 0
 SEW=16b         7   6   5   4   3   2   1   0
 SEW=32b             3       2       1       0

Some vector instructions have some operands that are wider than the current SEW setting. In this case, a group of vector registers are used to provide storage for the wider operands as shown below.

When an instruction has an operand twice as wide as SEW, e.g., a vector load of 32-bit words when SEW=16b, then an even-odd pair of vector registers are used to hold the double-width value as shown below:

 Example 1: ELEN=32
 ELEN unit      3       2       1       0
 Byte        3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0
 SEW=16b       7   6   5   4   3   2   1   0   <=16-bit elements
 v2*n              6       4       2       0   32-bit elements
 v2*n+1            7       5       3       1

The even-numbered vector register holds the even-numbered elements of the double-width vector, while the odd-numbered vector register holds the odd-numbered elements of the double-width vector.

Note
The pattern of storing elements in the pair of vector registers is designed to simplify datapath alignment for mixed-width operations.

For quad-width operands that are \(4\times SEW\) a group of four aligned vector registers are used to hold the results:

 ELEN unit        3       2       1       0
 Byte          3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0

 SEW=8b        F E D C B A 9 8 7 6 5 4 3 2 1 0   8b elements
 v4*n                C       8       4       0   32b elements
 v4*n+1              D       9       5       1
 v4*n+2              E       A       6       2
 v4*n+3              F       B       7       3
Note
A similar pattern is followed for octa-width operands \(8\times SEW\), though not clear that is necessary in mandatory base.

9.1. Supporting Mixed-Width Operations at Full Throughput

Additional setvli variants are provided to modify SEW to handle double-width elements in a loop.

setvl2ci rs1, vtypei  # sets vtypei, then sets vl according to AVL=ceil(rs1/2)
setvl2fi rs1, vtypei  # sets vtypei, then sets vl according to AVL=floor(rs1/2)

Example: Load 16-bit values, widen multiply to 32b, shift 32b result
right by 3, store 32b values.

loop:
    setvli t0, a0, vint16 # vtype = 16-bit integer vectors
    vlh v2, (a1)              # Get 16b vector
      slli t0, t0, 1
      add a1, a1, t0          # Bump pointer
    vmulw.vs v0, v2, v3       # 32b in <v0,v1> pair
    setvl2ci a0, vint32    # Ceil half length in 32b (can fuse with following)
    vsrl.vi v0, v0, 3        # Elements 0, 2, 4,...
    vsetvl2fi a0, vint32   # Floor half length in 32b (can fuse with following)
    vsrl.vi v1, v1, 3        # Elements 1, 3, 5,...
    vsetvli t0, a0, vint16 # Back to 16b
    vsw v0, (a2)              # Store vector of 32b <v0,v1> pair
      sub a0, a0, t0          # Decrement count
      slli t0, t0, 2
      add a2, a2, t0          # Bump pointer
      bnez a0, loop           # Any more?

Alternative loop only using wider elements:

loop:
    setvli t0, a0, vint32 # Use only 32-bit elements
    vlh v0, (a1)            # Sign-extend 16b load values to 32b elements
      sll t1, t0, 1
      add a1, a1, t1        # Bump pointer
    vmul.vs  v0, v0, v3     # 32b multiply result
    vsrl.vi  v0, v0, 3      # Shift elements
    vsw v0, (a2)            # Store vector of 32b results
      sll t1, t0, 2
      add a2, a2, t1        # Bump pointer
      sub a0, a0, t0
      bnez a0, loop         # Any more?

The first loop is more complex but may have greater performance on
machines where 16b widening multiplies are faster than 32b integer
multiplies.  Also, the 16b vector load may run faster due to the
larger number of elements per iteration.

This technique allows for multiple wider operations to be performed natively on each half of the wider vector. Conversion operations allow values to be copied into the double-width format, or back into the single-width formate.

Other forms for quad (and octal) widths:

setvl4ci    #set correct length for vector v4*n
setvl4di    #set correct length for vector v4*n+1
setvl4ei    #set correct length for vector v4*n+2
setvl4fi    #set correct length for vector v4*n+3

10. Vector instruction formats

Vector loads and stores move bit patterns between vector register elements and memory.

Vector arithmetic instructions operate on values held in vector register elements.

Vector instructions can have scalar or vector source operands and produce scalar or vector results. Scalar operands and results are located in element 0 of a vector register.

11. Vector masking

Masking is supported on almost all vector instructions producing vectors, with the mask supplied by vector register v0. The least-significant bit (LSB) of each \(SEW\)-wide element in v0 is used as the mask, in either true or complement form. Element operations that are masked off do not modify the destination vector register element and never generate exceptions. Instructions producing scalars are not maskable.

Masking is encoded in a two-bit m[1:0] field (inst[26:25]) for all vector instructions.

m[1:0]

  00    vector, where v0[0] = 0
  01    vector, where v0[0] = 1
  10    scalar operation
  11    vector, always true

11.1. Assembler syntax

Scalar operations are written in assembler with a .s after the destination vector register specifier. Vector masking is written as another vector operand, with .t or .f indicating if operation occurs when v0[0] is 1 or 0 respectively. If no masking operand is specified, unmasked vector execution (m=11) is assumed.

vop v1, v2, v3, vm implies following combinations:

    vop    v1,   v2, v3, v0.f  # enabled where v0[0]=0,     m=00
    vop    v1,   v2, v3, v0.t  # enabled where v0[0]=1,     m=01
    vop    v1.s, v2, v3        # scalar opertaion,          m=10
    vop    v1,   v2, v3        # unmasked vector operation, m=11

12. Vector Loads and Stores

Vector loads and stores are encoding within the scalar floating-point load and store major opcodes (LOAD-FP/STORE-FP).

12.1. Operation of Floating-Point Load/Store Instructions in Vector Extension

The standard FDQ floating-point extensions' loads and stores retain their original meaning.

The standard floating-point loads (FLH, FLW, FLD, FLQ), read a single value from memory and update the low \(FLEN\) bits of the destination vector register. Floating-point types narrower than \(FLEN\) are NaN-boxed, setting upper bits to 1. If \(VLEN > FLEN\), the upper bits of the vector register are unchanged by the floating-point load.

The standard floating-point stores (FSH, FSW, FSD, FDQ) read the appropriate number of bits from the least-significant bits of the vector register and write these to memory.

12.2. Vector Load/Store Instruction Encoding

The vector loads and stores are encoded using the width values that are not claimed by the standard scalar floating-point loads and stores.

                     Width xv  Mem     Reg       opcode uoffset5 scale
                     [2:0]     Bits    Bits             (set by width[1:0])

Standard scalar FP    001  x    16     FLEN      FLH/FSH N/A
Standard scalar FP    010  x    32     FLEN      FLW/FSW N/A
Standard scalar FP    011  x    64     FLEN      FLD/FSD N/A
Standard scalar FP    100  x   128     FLEN      FLQ/FSQ N/A
Vector byte           000  0  vl*8     vl*SEW    VxB     1
Vector halfword       101  0  vl*16    vl*SEW    VxH     2
Vector word           110  0  vl*32    vl*SEW    VxW     4
Vector doubleword     111  0  vl*64    vl*SEW    VxD     8
Vector single-width   000  1  vl*SEW   vl*SEW    VxE     1
Vector double-width   101  1  vl*2*SEW vl*2*SEW  VxE2    2
Vector quad-width     110  1  vl*4*SEW vl*4*SEW  VxE4    4
Vector octa-width     111  1  vl*8*SEW vl*8*SEW  VxE8    8

The one-bit xv field encodes fixed or variable element width, and is located in imm12 field
Mem bits is the size of element moved in memory
Reg bits is the size of element accessed in register
uoffset5 scale is the amount by which the five-bit unsigned immediate is multiplied to obtain a byte offset

The vector load and store encodings repurpose a portion of the standard load/store 12-bit immediate field to provide further vector instruction encoding, with bits[26:25] holding the mask information.

Bits [31:27] hold a 5-bit unsigned offset that is added to the base register during vector addressing. The offset is scaled according to the low two bits of the width[2:0] field (effective offset = uoffset[4:0] * 2width[1:0]), such that for fixed-width elements the offset is scaled by the element size. For dynamic-width elements, the offset is not affected by the vtype setting to avoid having a dependency between address generation and dynamic vtype value.

 Use of 12b immediate field in vector load/store instruction encoding

  31 30 29 28 27 26 25 24 23 22 21 20  Load   immediate bits
  31 30 29 28 27 26 25 11 10  9  8  7  Store  immediate bits
       uoffset5  m1 m0       funct5    Field


funct5 encodes:
name bits encoding
 xv   [4]
       0 fixed element size
       1 variable element size

order [3]
       0 sequential stores
       1 unordered stores

       0 unsigned load
       1 signed load

mop [2:0]
   0 0 0 unit-stride
   0 0 1 unit-stride speculative loads (fault first)
   0 1 0 constant-stride
   0 1 1 indexed
   1 0 0 reserved
   1 0 1 reserved
   1 1 0 reserved
   1 1 1 reserved (AMO?)

Vector unit-stride, constant-stride, and indexed (scatter/gather) load/store instructions are supported.

Note
Vector AMO instructions are TBD.

Vector load/store base registers and strides are taken from the GPR x registers.

Vector load/store instructions move bit patterns between vector register elements and memory.

An illegal instruction exception is raised if the register element is narrower than the memory operand.

Note
Debate whether it is useful to allow, e.g., 64-bit loads to 32-bit registers to retain only LSBs, to accelerate stride-2 loads. Comes at cost of addtional control/verification complexity.

When vrep is set to integer, vector load instructions can optionally sign- or zero-extend narrower memory values into wider vector register element destinations.

When vrep is set to floating-point, then loads will NaN-box narrower memory values into a wider register element, regardless of signed or unsigned opcode.

When the m[1:0] field is set to scalar, the vector load/store instructions move a single value between element 0 of the vector register and memory.

The unit-stride fault-first load instructions are used to vectorize loops with data-dependent exit conditions (while loops). These instructions execute as a regular load except that they will only take a trap on element 0. If an element > 0 raises an exception, the result of that element and all following elements up to the active vector length are written with 0 results, and the vector length vl is reduced to the number of elements processed without a trap.

strlen example using unit-stride fault-first instruction

# size_t strlen(const char *str)
# a0 holds *str

    mv a3, a0             # Save start
strlen:
    setvli a1, x0, vint8  # Vector of bytes
    vldbff.v v1, (a3)     # Get bytes
    csrr a1, vl           # Get bytes read
    add a3, a3, a1        # Bump pointer
    vseq.vi v0, v1, 0     # Set v0[i] where v1[i] = 0
    vmfirst a2, v0        # Find first set bit
    bltz a2, strlen       # Not found?

    add a0, a0, a1        # Sum start + bump
    add a3, a3, a2        # Add index
    sub a0, a3, a0        # Subtract start address+bump

    ret
Note
Strided and scatter-gather fault-first instructions are not provided as they represent a large security hole, allowing software to check multiple random pages for accessibility without experiencing a trap. The unit-stride versions only allow probing a region immediately contiguous to a known region.

12.3. Vector load instructions assembler code

12.3.1. unit-stride instructions

    # vd destination, rs1 base address, rs2=x0, vm is mask encoding

    # fixed-size element
    vlb.v    vd, offset(rs1), vm # 8b
    vlh.v    vd, offset(rs1), vm # 16b
    vlw.v    vd, offset(rs1), vm # 32b
    vld.v    vd, offset(rs1), vm # 64b
    vle.v    vd, offset(rs1), vm # SEW
    vle2.v   vd, offset(rs1), vm # 2*SEW
    vle4.v   vd, offset(rs1), vm # 4*SEW
    vle8.v   vd, offset(rs1), vm # 8*SEW

    # first fault versions
    vlbff.v    vd, offset(rs1), vm # 8b
    vlhff.v    vd, offset(rs1), vm # 16b
    vlwff.v    vd, offset(rs1), vm # 32b
    vldff.v    vd, offset(rs1), vm # 64b
    vleff.v    vd, offset(rs1), vm # SEW
    vle2ff.v   vd, offset(rs1), vm # 2*SEW
    vle4ff.v   vd, offset(rs1), vm # 4*SEW
    vle8ff.v   vd, offset(rs1), vm # 8*SEW

    # Scalar versions
    vlb.s vd, offset(rs1)      # 8b scalar load into element 0
          ...
Note
Could encode unit-stride as constant-stride with rs2=x0, but this would add to decode complexity.

12.3.2. constant-stride instructions

    # vd destination, rs1 base address, rs2 byte stride
    vlsb.v    vd, offset(rs1), rs2, vm # 8b
    vlsh.v    vd, offset(rs1), rs2, vm # 16b
    vlsw.v    vd, offset(rs1), rs2, vm # 32b
    vlsd.v    vd, offset(rs1), rs2, vm # 64b
    vlse.v    vd, offset(rs1), rs2, vm  # SEW
    vlse2.v   vd, offset(rs1), rs2, vm  # 2*SEW
    vlse4.v   vd, offset(rs1), rs2, vm  # 4*SEW
    vlse8.v   vd, offset(rs1), rs2, vm  # 8*SEW

    vlse8.s   vd, offset(rs1), rs2, vm  # 8*SEW scalar load

The stride is interpreted as an integer representing a byte offset.

12.3.3. indexed (scatter-gather) instructions

    # vd destination, rs1 base address, vs2 indices
    vlxb.v    vd, offset(rs1), vs2, vm  # 8b
    vlxh.v    vd, offset(rs1), vs2, vm  # 16b
    vlxw.v    vd, offset(rs1), vs2, vm  # 32b
    vlxd.v    vd, offset(rs1), vs2, vm  # 64b
    vlxe.v    vd, offset(rs1), vs2, vm  # SEW
    vlxe2.v   vd, offset(rs1), vs2, vm  # 2*SEW
    vlxe4.v   vd, offset(rs1), vs2, vm  # 4*SEW
    vlxe8.v   vd, offset(rs1), vs2, vm  # 8*SEW

Scatter/gather indices are treated as signed integers representing byte offsets. If \(SEW < XLEN\), then indices are sign-extended to \(XLEN\) before adding to the base. If \(SEW > XLEN\), the indices are taken from the least-significant \(XLEN\) bits.

Note
\(SEW\) has to be wide enough to hold the indices, which could mandate larger \(SEW\) than desired. Ideally want to support index vectors wider than \(SEW\), by adding new vector indexed loads and stores with double-width or greater vector indices.

12.4. Vector stores

Vector stores move data values as bits taken from the LSBs of the source element. If the store datatype is wider than \(SEW\), then multiple vector registers are used to supply the data as described above.

12.4.1. unit-stride store instructions

    vsb.v     vs3, offset(rs1), vm  # 8b
    vsh.v     vs3, offset(rs1), vm  # 16b
    vsw.v     vs3, offset(rs1), vm  # 32b
    vsd.v     vs3, offset(rs1), vm  # 64b
    vse.v     vs3, offset(rs1), vm  # SEW
    vse2.v    vs3, offset(rs1), vm  # 2*SEW
    vse4.v    vs3, offset(rs1), vm  # 4*SEW
    vse8.v    vs3, offset(rs1), vm  # 8*SEW

    vsb.s   vs3, offset(rs1)      # Scalar 8b store from element 0
    ...

12.4.2. constant-stride store instructions

    vssb.v    vs3, offset(rs1), rs2, vm  # 8b
    vssh.v    vs3, offset(rs1), rs2, vm  # 16b
    vssw.v    vs3, offset(rs1), rs2, vm  # 32b
    vssd.v    vs3, offset(rs1), rs2, vm  # 64b
    vsse.v    vs3, offset(rs1), rs2, vm  # SEW
    vsse2.v   vs3, offset(rs1), rs2, vm  # 2*SEW
    vsse4.v   vs3, offset(rs1), rs2, vm  # 4*SEW
    vsse8.v   vs3, offset(rs1), rs2, vm  # 8*SEW

12.4.3. indexed store (scatter) instructions (ordered by element)

    vsxb.v    vs3, offset(rs1), vs2, vm  # 8b
    vsxh.v    vs3, offset(rs1), vs2, vm  # 16b
    vsxw.v    vs3, offset(rs1), vs2, vm  # 32b
    vsxd.v    vs3, offset(rs1), vs2, vm  # 64b
    vsxe.v    vs3, offset(rs1), vs2, vm  # SEW
    vsxe2.v   vs3, offset(rs1), vs2, vm  # 2*SEW
    vsxe4.v   vs3, offset(rs1), vs2, vm  # 4*SEW
    vsxe8.v   vs3, offset(rs1), vs2, vm  # 8*SEW

12.4.4. unordered-indexed (scatter-gather) instructions

    vsuxb.v   vs3, offset(rs1), vs2, vm  # 8b
    vsuxh.v   vs3, offset(rs1), vs2, vm  # 16b
    vsuxw.v   vs3, offset(rs1), vs2, vm  # 32b
    vsuxd.v   vs3, offset(rs1), vs2, vm  # 64b
    vsuxe.v   vs3, offset(rs1), vs2, vm  # SEW
    vsuxe2.v  vs3, offset(rs1), vs2, vm  # 2*SEW
    vsuxe4.v  vs3, offset(rs1), vs2, vm  # 4*SEW
    vsuxe8.v  vs3, offset(rs1), vs2, vm  # 8*SEW
Note
Dropped reverse-ordered scatter for now, can use rgather to reverse index order.
Note
There is redundancy between all the scalar variants of unit-stride, constant-stride, and scatter-gather vector load/store instructions.

12.5. Vector memory model

Vector memory instructions appear to execute in program order on the local hart. Vector memory instructions follow RVWMO at the instruction level, and element operations are ordered within the instruction as if performed by an element-ordered sequence of syntactically independent scalar instructions. Vector indexed-ordered stores write elements to memory in element order.

13. Vector Arithmetic Instructions

The vector arithmetic instructions use a new major opcode (OP-V = 10101112) which neighbors OP-FP, but generally follow the encoding pattern of the scalar floating-point instructions under the OP-FP opcode.

13.1. Vector-Vector and Vector-Scalar Arithmetic Instructions

Most vector arithmetic instructions have both vector-vector (.vv), where both operands are vectors of elements, and vector-scalar (.vs), where the second operand is a scalar taken from element 0 of the second source vector register. A few non-commutative operations (such as reverse subtract) subtract are encoded with special opcodes.

13.2. Vector-Immediate Arithmetic Instructions

Many vector arithmetic instructions have vector-immediate forms (.vi) where the second scalar argument is a 5-bit immediate encoded in rs2 space. The immediate is sign-extended to the standard element width, and interpreted according to the vtype setting.

vadd.vi vd, vrs1, 3

13.3. Widening Vector Arithmetic Instructions

A few vector arithmetic instructions are defined to be widening operations where the destination elements are \(2\times SEW\) wide and are stored in an even-odd vector register pair. The first operand can be either single or double-width. These are generally written with a w suffix on the opcode.

13.4. Mask encoding

All vector arithmetic instructions can be masked according to the m[1:0] field.

mask encoding m[1:0] is held in inst[26:25]

m[1:0]
  00    vector, where v0[0] = 0
  01    vector, where v0[0] = 1
  10    scalar
  11    always true

13.5. Vector Arithmetic Operand encoding

rm[2:0] field is held in inst[14:12]

Encoding of operand pattern rm field for regular vector arithmetic
instructions.

rm2 rm1 rm0

0     0   0      Vector-vector   SEW =   SEW op SEW
0     0   1      Vector-vector
0     1   0      Vector-vector 2*SEW =   SEW op SEW
0     1   1      Vector-vector 2*SEW = 2*SEW op SEW

1     0   0      Vector-scalar   SEW =   SEW op s_SEW
1     0   1      Vector-imm      SEW =   SEW op simm[4:0]
1     1   0      Vector-scalar 2*SEW =   SEW op s_SEW
1     1   1      Vector-scalar 2*SEW = 2*SEW op s_SEW

Bit rm[2] selects between vector second source or scalar second source.

Bit rm[1] selects whether the destination is twice the width of \(SEW\).

Bit rm[0] selects whether the first operand is one or two times the \(SEW\) or whether the second operand is a 5-bit sign-extended immediate held in the rs2 field.

The 5-bit immediate field is always treated as a signed integer and sign-extended to \(SEW\) bits, regardless of vtype setting.

Note
For floating-point representation, the 5-bit immediate can be used to supply 0.0.
Assembly syntax pattern for vector arithmetic instructions

vop.vv  vd, vs1, vs2, vm    # vector-vector operation
vop.vs  vd, vs1, rs2, vm    # vector-scalar operation
vop.vi  vd, vs1, imm, vm    # vector-immediate operation

vopw.vv  vd, vs1, vs2, vm    # 2*SEW = SEW op SEW
vopw.vs  vd, vs1, rs2, vm    # 2*SEW = SEW op SEW

vopw.wv  vd, vs1, vs2, vm    # 2*SEW= 2*SEW op SEW
vopw.ws  vd, vs1, rs2, vm    # 2*SEW= 2*SEW op SEW

The following vector arithmetic instructions are provided

         .vv .vs .vi w.vv w.vs w.wv w.ws
VADD      x   x   x   x    x    x    x
VSUB      x   x   x   x    x    x    x

VAND      x   x   x
VOR       x   x   x
VXOR      x   x   x

VSLL      x   x   x
VSRL      x   x   x
VSRA      x   x   x

VSEQ      x   x   x
VSNE      x   x   x
VSLT      x   x   x
VSLTU     x   x   x
VSLE      x   x   x
VSLEU     x   x   x

VMUL      x   x   x   x    x    x    x
VMULU     x   x   x   x    x    x    x
VMULSU    x   x   x   x    x    x    x
VMULH     x   x   x

VDIV      x   x   x
VDIVU     x   x   x
VREM      x   x   x
VREMU     x   x   x

VSQRT     x   x   x

VFSGNJ    x   x   x
VFSGNJN   x   x   x
VFSGNJX   x   x   x

VMIN      x   x   x
VMAX      x   x   x

VFCLASS   x   x   x

FMV*
FCVT*

13.7. Vector Comparison Instructions

The following compare instructions write 1 to the destination register if the comparison evaluates to true and produces 0 otherwise.

[NOTE] VSNE is not needed with complementing masks but sometimes predicate results feed into things other than predicate inputs and so VSNE can save an instruction.

[NOTE]: Need to revisit vector floating-point unordered compare instructions.

    vseq.vv    vd, vs1, vs2, vm
    vseq.vs    vd, vs1, rs2, vm
    vseq.vi    vd, vs1, imm, vm

    vsne.vv    vd, vs1, vs2, vm
    vsne.vs    vd, vs1, rs2, vm
    vsne.vi    vd, vs1, imm, vm

    ...

These conditionals effectively AND in the mask when producing 0/1 in output, e.g,

    # (a < b) && (b < c) in two instructions
    vslt.vv    v0, va, vb
    vslt.vv    v0, vb, vc, vm

The combination of VLT and VLTE can cover all cases, including compares with scalars by complementing results:

v = s ,  ! (v = s) = (v != s)
v < s ,  ! (v < s) = (v >= s)
v <= s , ! (v <=s) = (v > s)

13.8. Vector Merge Instruction

The vector merge instruction combines two vectors based on the mask field.

vmerge.vv vd, vs1, vs2, vm  # vd[i] = vm[i] ? vs2[i] : vs1[i]
vmerge.vs vd, vs1, vs2, vm  # vd[i] = vm[i] ? vs2[0] : vs1[i]
vmerge.vi vd, vs1, imm, vm  # vd[i] = vm[i] ? imm    : vs1[i]

The second operand is written where the mask is true.

Note
The vmerge.vi instruction can be used to initialize a vector register with an immediate value, and the vmerge.vs instruction can be used to splat a scalar value into all elements of a vector.

13.9. Vector multiply/divide

These are all equivalent to scalar integer multiply/divides, and operate on VSEW source and destination widths.

    vmul.vv      vd, vs1, vs2, vm
    vmulh.vv     vd, vs1, vs2, vm
    vmulhsu.vv   vd, vs1, vs2, vm
    vmulhu.vv    vd, vs1, vs2, vm
    vdiv.vv      vd, vs1, vs2, vm
    vdivu.vv     vd, vs1, vs2, vm
    vrem.vv      vd, vs1, vs2, vm
    vremu.vv     vd, vs1, vs2, vm

Also have .vs and .vi variants

14. Vector Narrowing instructions

A few instructions are provided to convert multi-width vectors into single-width vectors.

 VSRN   vector shift right narrowing
 VSRAN  vector shift right arithmetic narrowing
 VCLIPN   vector clip after shift right narrowing
 VCLIPUN  vector clip unsigned after shift right narrowing

 vd[i] = clip(round(vs1[i] + rnd) >> vs2[i])

For VSRN/VSRAN, clip=nop, rnd = nop.

For VCLIPN, the value is treated as a signed integer and saturates if result would overflow the destination.

For VCLIPUN, the value is treated as a signed integer and saturates if result would overflow the destination.

For VCLIPN/VCLIPUN, the rounding mode is specified in the fcsr in a new vxrm[1:0] field. Rounding occurs around the LSB of the destination.

 `vxrm[1:0]`
 Holds fixed-point rounding mode.

 00      rup   round-up (+0.5 LSB)
 01      rne   round to nearest-even
 10      trn   truncate
 11      jam   jam (OR bits into LSB)

The narrowing instructions used a different operand encoding in rm[2:0].

# vs1 = 2*SEW, 4*SEW

 rm2 rm1 rm0

 0     0   0      Vector-vector  SEW =  2*SEW op SEW
 0     0   1      Vector-vector
 0     1   0      Vector-vector  SEW =  4*SEW op SEW
 0     1   1      Vector-vector

 1     0   0      Vector-scalar  SEW =  2*SEW op SEW
 1     0   1      Vector-imm     SEW =  2*SEW op imm
 1     1   0      Vector-scalar  SEW =  4*SEW op SEW
 1     1   1      Vector-imm     SEW =  4*SEW op imm
vclipn.vv vd, vs1, vs2, vm  # SEW = 2*SEW >> SEW
vclipn.vs vd, vs1, rs2, vm  # SEW = 2*SEW >> SEW
vclipn.vi vd, vs1, imm, vm  # SEW = 2*SEW >> imm

vclipn.wv vd, vs1, vs2, vm  # SEW = 4*SEW >> SEW
vclipn.ws vd, vs1, rs2, vm  # SEW = 4*SEW >> SEW
vclipn.wi vd, vs1, imm, vm  # SEW = 4*SEW >> imm

vclipun.vv vd, vs1, vs2, vm  # SEW = 2*SEW >> SEW
vclipun.vs vd, vs1, rs2, vm  # SEW = 2*SEW >> SEW
vclipun.vi vd, vs1, imm, vm  # SEW = 2*SEW >> imm

vclipun.wv vd, vs1, vs2, vm  # SEW = 4*SEW >> SEW
vclipun.ws vd, vs1, rs2, vm  # SEW = 4*SEW >> SEW
vclipun.wi vd, vs1, imm, vm  # SEW = 4*SEW >> imm

vsrln.vv vd, vs1, vs2, vm  # SEW = 2*SEW >> SEW
vsrln.vs vd, vs1, rs2, vm  # SEW = 2*SEW >> SEW
vsrln.vi vd, vs1, imm, vm  # SEW = 2*SEW >> imm

vsrln.wv vd, vs1, vs2, vm  # SEW = 4*SEW >> SEW
vsrln.ws vd, vs1, rs2, vm  # SEW = 4*SEW >> SEW
vsrln.wi vd, vs1, imm, vm  # SEW = 4*SEW >> imm

vsran.vv vd, vs1, vs2, vm  # SEW = 2*SEW >> SEW
vsran.vs vd, vs1, rs2, vm  # SEW = 2*SEW >> SEW
vsran.vi vd, vs1, imm, vm  # SEW = 2*SEW >> imm

vsran.wv vd, vs1, vs2, vm  # SEW = 4*SEW >> SEW
vsran.ws vd, vs1, rs2, vm  # SEW = 4*SEW >> SEW
vsran.wi vd, vs1, imm, vm  # SEW = 4*SEW >> imm

15. Vector fused Multiply-Adds

The standard scalar floating-point fused multiply-adds occupy four major opcodes.

There are two unused rounding modes that can be used to encode vector fused multiply-adds, in both vector-vector and vector-scalar forms, where the scalar is one input to the multiply. When a scalar input to the add is needed, this can be provided by splatting the value to a vector.

rm2 rm1 rm0
 1   0   1      Vector-vector  vd = vs3 + vs1 * vs2
 1   1   0      Vector-scalar  vd = vs3 + vs1 * rs2

The FNMADD and FNMSUB variants are dropped in favor of widening vector operations, which treat the add input and final result as double-width.

VMADD     SEW = SEW + SEW*SEW
VMSUB     SEW = SEW + SEW*SEW
VMADDW  2*SEW = 2*SEW + SEW*SEW
VMSUBW  2*SEW = 2*SEW + SEW*SEW

15.1. Vector fused-multiply-add instructions

  vmadd.vvv vd, vs1, vs2, vs3, vm
  vmadd.vvs vd, vs1, rs2, vs3, vm
  vmaddw.vvv vd, vs1, vs2, vs3, vm
  vmaddw.vvs vd, vs1, rs2, vs3, vm
  vmsub.vvv vd, vs1, vs2, vs3, vm
  vmsub.vvs vd, vs1, rs2, vs3, vm
  vmsubw.vvv vd, vs1, vs2, vs3, vm
  vmsubw.vvs vd, vs1, rs2, vs3, vm

Additional fused multiply-add operations can be provided as destructive operations in the regular vector arithmetic encoding space.

15.2. Vector Reduction Operations

These instructions take a vector and scalar (vs2[0]) as input, and produces a scalar result (vd[0]) that is a reduction over the source scalar and vector. Masked elements are ignored in the reduction.

    vredsum.v   vd, vs1, vs2, vm #   SEW = SEW   + sum(SEW)
    vredsumw.v  vd, vs1, vs2, vm # 2*SEW = 2*SEW + sum(SEW)
    vredmax.v   vd, vs1, vs2, vm
    vredmaxu.v  vd, vs1, vs2, vm
    vredmin.v   vd, vs1, vs2, vm
    vredminu.v  vd, vs1, vs2, vm
    vredand.v   vd, vs1, vs2, vm
    vredor.v    vd, vs1, vs2, vm
    vredxor.v   vd, vs1, vs2, vm

By default, when the operation is non-associative (e.g., floating-point addition) the reductions are specified to occur as if done in sequential element order, but a user fcsr mode bit can specify that unordered reductions are allowed. In this case, the reduction result must match some ordering of the individual sequential operations.

A widening form of the sum reduction is provided that writes a double-width reduction result.

16. Vector Mask Operations

Several operations are provided to help operate on mask bits held in the LSB of elements of a vector register.

16.1. vmpopc mask population count

    vmpopc rd, vs1, vm

The vmpopc instruction counts the number of elements of the first vl elements of the vector source that have their low bit set, excluding elements where the mask is false, and writes the result to a GPR.

16.2. vmfirst find first set mask bit

    vmfirst rd, vs1, vm

The vmfirst instruction finds the lowest-numbered element of the source vector that has its LSB set excluding elements where the mask is false, and writes that element’s index to a GPR. If no element has an LSB set, it writes -1 to the GPR.

17. Vector Permutation Instructions

A range of permutation instructions are provided.

17.1. Vector Iota instruction

The VIOTA instruction reads v0 and writes to each element of the destination the sum of all the least-significant bits of elements in the mask selected by m[1:0] with index less than the element, e.g., a parallel prefix sum of the mask values.

If the value would overflow the destination, the least-significant bits are retained. This instruction is not masked, so writes all vl elements of destination vector.

 viota.v vd        # Unmasked, writes index to each element, vd[i] = i
 viota.v vd, v0.t  # Writes to each element, sum of preceding true elements.

 # Example

     7 6 5 4 3 2 1 0   Element number
     1 0 0 1 0 0 0 1   v0 contents

     7 6 5 4 3 2 1 0   viota.v vd
     2 2 2 1 1 1 1 0   viota.v vd, v0.t
     5 4 3 3 2 1 0 0   viota.v vd, v0.f
Note
The viota instruction can be combined with scatter/gather instructions to perform vector compress/expand instructions.

17.2. Insert/Extract

The first form of insert/extract operations transfer a single value between a GPR and one element of a vector register. A second scalar GPR operand gives the element index, treated as an unsigned integer. If the index is out of range on a vector extract, then zero is returned for the element value. If the index is out of range (i.e., \(>VLMAX\)) for a vector insert, the write is ignored.

vmv.x.v rd, vs1, rs2  # rd = vs1[rs2]
vmv.v.x vd, rs1, rs2  # vd[rs2] = rs1

The second form of insert/extract transfers a single value between element 0 of one vector register and one indexed element of a second vector register.

vmv.s.v vd, vs1, rs2 # vd[0] = vs1[rs2]
vmv.v.s vd, vs1, rs2 # vd[rs2] = vs1[0]

17.3. Slides

The slide instructions move elements up and down a vector.

 vslideup.vs vd, vs1, rs2, vm   # vd[i+rs2] = vs1[i]
 vslideup.vi vd, vs1, imm, vm   # vd[i+imm] = vs1[i]

For vslideup, the value in vl specifies the number of source elements that are read. The destination elements below the start index are left undisturbed. Destination elements past vl can be written, but writes past the end of the destination vector are ignored.

 vslidedown.vs vd, vs1, rs2, vm # vd[i] = vs1[i+rs2]
 vslidedown.vs vd, vs1, rs2, vm # vd[i] = vs1[i+imm]

For vslidedown, the value in vl specifies the number of destination elements that are written. Elements in the source vector can be read past vl. If a source vector index is out of range, zero is returned for the element.

17.4. Register Gather

This instruction reads elements from a source vector at locations given by a second source element index vector. The values in the index vector are treated as unsigned integers. The number of elements to write to the destination register is given by vl. The source vector can be read at any index, \(index < VLMAX \).

vrgather.vv vd, vs1, vs2, vm # vd[i] = vs1[vs2[i]]

If the element indices are out of range ( \( vs2[i] \geq VLMAX\) ) then zero is returned for the element value.

18. Examples

18.1. Vector-vector add example

    # vector-vector add routine of 32-bit integers
    # void vvaddint32(size_t n, const int*x, const int*y, int*z)
    # { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } }
    #
    # a0 = n, a1 = x, a2 = y, a3 = z
    # Non-vector instructions are indented
vvaddint32:
    vsetvli t0, a0, vint32 # Set vector length based on 32-bit vectors
    vlw.v v0, (a1)           # Get first vector
      sub a0, a0, t0         # Decrement number done
      slli t0, t0, 2         # Multiply number done by 4 bytes
      add a1, a1, t0         # Bump pointer
    vlw.v v1, (a2)           # Get second vector
      add a2, a2, t0         # Bump pointer
    vadd.v v2, v0, v1        # Sum vectors
    vsw.v v2, (a3)           # Store result
      add a3, a3, t0         # Bump pointer
      bnez a0, vvaddint32    # Loop back
      ret                    # Finished

18.2. Memcpy example

    # void *memcpy(void* dest, const void* src, size_t n)
    # a0=dest, a1=src, a2=n
    #
  memcpy:
      mv a3, a0 # Copy destination
  loop:
    vsetvli t0, a2, vint8  # Vectors of 8b
    vlb.v v0, (a1)              # Load bytes
      add a1, a1, t0            # Bump pointer
      sub a2, a2, t0            # Decrement count
    vsb.v v0, (a3)              # Store bytes
      add a3, a3, t0            # Bump pointer
      bnez a2, loop             # Any more?
      ret                       # Return

18.3. Conditional example

       (int16) z[i] = ((int8) x[i] < 5) ? (int16) a[i] : (int16) b[i];

Fixed 16b SEW:
loop:
    vsetvli t0, a0, vint16  # Use 16b elements.
    vlb.v v0, (a1)               # Get x[i], sign-extended to 16b
      sub a0, a0, t0           # Decrement element count
      add a1, a1, t0           # x[i] Bump pointer
    vslti v0, v0, 5            # Set mask in v0
      slli t0, t0, 1             # Multiply by 2 bytes
    vlh.v v1, (a2), v0.t         # z[i] = a[i] case
      add a2, a2, t0           # a[i] bump pointer
    vlh.v v1, (a3), v0.f         # z[i] = b[i] case
      add a3, a3, t0           # b[i] bump pointer
    vsh.v v1, (a4)               # Store z
      add a4, a4, t0           # b[i] bump pointer
      bnez a0, loop

19. Increasing VLMAX through register grouping and vlmul field.

An additional field can be added to vsetvl configuration to increase vector length when fewer architectural vector registers are needed by grouping vector registers together.

 vlmul  #vregs   VLMAX
 00         32   VLEN/SEW
 01         16   2*VLEN/SEW
 10          8   4*VLEN/SEW
 11          4   8*VLEN/SEW
 ELEN unit        3       2       1       0
 Byte          3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0

 vlmul=4,  SEW=32b
 v4*n                C       8       4       0   32b elements
 v4*n+1              D       9       5       1
 v4*n+2              E       A       6       2
 v4*n+3              F       B       7       3
Note
This reuses the same element mapping pattern used in widening operations. Can probably replace vsetvl2fi etc.

20. Expanded SEW encoding

As a later extension, the vsew field is extended with three upper bits.

  vsew[2:0] (standard element width) encoding

  vsew[2:0]   SEW
  ---        ----
  000           8
  001          16
  010          32
  011          64
  100         128
  101         256
  110         512
  111        1024

  vxsew[5:0] (expanded element width) encoding

  vxsew[5:0]  SEW
  ---        ----
  000000       8
  001000       1
    ...          1..8, steps of 1
  111000       7
  000001      16
  001001       9
    ...          9..16, steps of 1
  111001      15
  000010      32
  001010      18
    ...          18-32, steps of 2
  111010      30

  ...TBD