-
Notifications
You must be signed in to change notification settings - Fork 357
Compiler ABI
An LLVM backend targets this architecture and supports scalar and vector operations. The LLVM infrastructure can be a backend for any language, but focus is currently on the clang C/C++ compiler. The port is based on LLVM trunk and supports many recent features up to C++1z (http://clang.llvm.org/cxx_status.html)
This backend does not support:
- Exceptions
- Thread local storage (__thread variable attribute)
- Position independent code
There is not a C++ standard library port yet.
The compiler defines the preprocessor macro __NYUZI__ .
The toolchain is installed by default in /usr/local/llvm-nyuzi. The tools are in the bin/ directory:
- clang/clang++: C/C++ compiler with integrated assembler
- ld.lld: LLD linker (this symlink invokes the 'ld' flavor of LLD)
- lldb: Symbolic debugger http://lldb.llvm.org/
- elf2hex: Converts ELF executables into a format that can be run in simulator/emulator/FPGA
- llvm-ar: LLVM version of ar for creating static libraries
- llvm-objdump: Object dump utility. Useful for seeing assembly listing of generated file.
Inline assembler statements are available using GCC's extended assembler syntax (http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html). The 'v' constraint used for vector operands and 'r' is for scalar operands, for example:
asm("store_v %0, (%1)" : : "v" (value), "r" (address));
The compiler supports vector types using GCC syntax http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html. The 'ext_vector_type' attribute indicates vector types:
typedef int veci16_t __attribute__((ext_vector_type(16)));
typedef unsigned int vecu16_t __attribute__((ext_vector_type(16)));
typedef float vecf16_t __attribute__((ext_vector_type(16)));
Vectors are first class types that can be local variables, global variables, parameters, or struct/class members. The compiler uses registers to store these wherever possible.
If a vector is a member of the structure, you must align that structure on a 64 byte multiple. The compiler automatically aligns vector members within the structure, stack allocated local variables, and global variables. However, if a structure is heap allocated, the heap implementation must align it.
Standard arithmetic operators are available for vector operations. For example, to add two vectors:
veci16_t foo;
veci16_t bar;
veci16_t baz;
...
foo = bar + baz;
Individual elements of a vector are set/read using the array operator. These compile to the getlane
instruction and mask register moves.
veci16_t foo;
int total;
for (int i = 0; i < 16; i++)
{
total += foo[i];
foo[i] += i;
}
Vectors can be initialized using curly bracket syntax. If the members are constant, this is loaded from the constant pool.
const veci16_t steps = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
You can also use non-constant members, in which case the compiler will generate a series of masked moves to load the vector.
int a, b, c, d;
...
veci16_t values = { a, b, c, d, a, b, c, d, a, b, c, d, a, b, c, d };
Scalar and vector values can be mixed:
veci16_t foo;
int bar;
veci16_t baz;
foo = baz + bar;
The backend recognizes when it can use mixed vector/scalar instructions. For example:
add_i v0, v0, s0
In some situations, you may need to cast a scalar type to widen it:
void somefunc(vecu16_t f);
somefunc((vecu16_t) 12);
Floating point conversions are a little weird. Consider the following code:
veci16_t i;
vecf16_t f;
f = (vecf16_t) i;
If these were scalar types, this would cause an integer to floating point conversion (eg. 1 would become 1.0). However, because they are vectors it does a bitcast instead. This is standard GCC behavior. Use __builtin_convertvector to convert the type:
vecu16_t a;
vecf16_t b = __builtin_convertvector(a, vecf16_t);
The GCC syntax supports vector comparisons that result in another vector type. For example:
veci16_t a, b, c;
a = b > c;
The instruction set does not natively support this. Comparisons set bitmasks in scalar registers. The compiler emulates the former behavior using masked move instructions. Builtins support native bitmask comparisons (f is for floats, i is for ints), for example:
veci16_t b, c;
uint32_t a = __builtin_nyuzi_mask_cmpi_sgt(b, c); // Signed greater than
Two flexible compiler builtins support predicated instructions: __builtin_nyuzi_vector_mixf and __builtin_nyuzi_vector_mixi functions. Each one takes a mask and two vectors. Each of the low 16 bits in the mask selects whether the vector lane value comes from the first parameter or the second. A one bit pulls from the first, a zero from the second. These builtins don't necessarily emit instructions; they compiler inserts predicated instructions where possible. For example:
vecf16_t a = __builtin_nyuzi_vector_mixf(mask, a + b, a);
Generates a single instruction:
add_f_mask v0, s0, v0, v1
__builtin_nyuzi_shuffle instruction allows rearranging vector contents. It takes two vector parameters. The first is a source vector and the second is a set of indices (0-15) into the first.
While the LLVM toolchain supports auto-vectorization, the backend for this processor doesn't. The focus is on explicit vectorization in code.
int __builtin_nyuzi_read_control_reg(int index);
void __builtin_nyuzi_write_control_reg(int index, int value);
veci16_t __builtin_nyuzi_vector_mixi(int mask, veci16_t a, veci16_t b);
vecf16_t __builtin_nyuzi_vector_mixf(int mask, vecf16_t a, vecf16_t b);
veci16_t __builtin_nyuzi_shufflei(veci16_t sourceVector, veci16_t laneIndices);
vecf16_t __builtin_nyuzi_shufflef(vecf16_t sourceVector, veci16_t laneIndices);
veci16_t __builtin_nyuzi_gather_loadi(veci16_t sourcePtrs);
veci16_t __builtin_nyuzi_gather_loadi_masked(veci16_t sourcePtrs, int mask);
veci16_t __builtin_nyuzi_gather_loadf(veci16_t pointers);
veci16_t __builtin_nyuzi_gather_loadf_masked(veci16_t pointers, int mask);
void __builtin_nyuzi_scatter_storei(veci16_t destPtrs, veci16_t sourceValue);
void __builtin_nyuzi_scatter_storei_masked(veci16_t destPtrs, veci16_t sourceValue, int mask);
void __builtin_nyuzi_scatter_storef(veci16_t destPtrs, vecf16_t sourceValue);
void __builtin_nyuzi_scatter_storef_masked(veci16_t destPtrs, vecf16_t sourceValue, int mask);
veci16_t __builtin_nyuzi_block_loadi_masked(veci16_t *source, int mask);
vecf16_t __builtin_nyuzi_block_loadf_masked(vecf16_t *source, int mask);
void __builtin_nyuzi_block_storei_masked(veci16_t *dest, veci16_t values, int mask);
void __builtin_nyuzi_block_storef_masked(vecf16_t *dest, vecf16_t values, int mask);
int __builtin_nyuzi_mask_cmpi_ugt(vecu16_t a, vecu16_t b);
int __builtin_nyuzi_mask_cmpi_uge(vecu16_t a, vecu16_t b);
int __builtin_nyuzi_mask_cmpi_ult(vecu16_t a, vecu16_t b);
int __builtin_nyuzi_mask_cmpi_ule(vecu16_t a, vecu16_t b);
int __builtin_nyuzi_mask_cmpi_sgt(veci16_t a, veci16_t b);
int __builtin_nyuzi_mask_cmpi_sge(veci16_t a, veci16_t b);
int __builtin_nyuzi_mask_cmpi_slt(veci16_t a, veci16_t b);
int __builtin_nyuzi_mask_cmpi_sle(veci16_t a, veci16_t b);
int __builtin_nyuzi_mask_cmpi_eq(veci16_t a, veci16_t b);
int __builtin_nyuzi_mask_cmpi_ne(veci16_t a, veci16_t b);
int __builtin_nyuzi_mask_cmpf_gt(vecf16_t a, vecf16_t b);
int __builtin_nyuzi_mask_cmpf_ge(vecf16_t a, vecf16_t b);
int __builtin_nyuzi_mask_cmpf_lt(vecf16_t a, vecf16_t b);
int __builtin_nyuzi_mask_cmpf_le(vecf16_t a, vecf16_t b);
int __builtin_nyuzi_mask_cmpf_eq(vecf16_t a, vecf16_t b);
int __builtin_nyuzi_mask_cmpf_ne(vecf16_t a, vecf16_t b);
It is often useful to see a disassembled listing of the executable to debug issues. The llvm-objdump command disassembles ELF output files from the compiler.
/usr/local/llvm-nyuzi/bin/llvm-objdump --disassemble program.elf ... vsnprintf: 95b8: bd 03 ff 02 add_i sp, sp, -64 95bc: 1d f3 00 88 store_32 s24, 60(sp) 95c0: 3d e3 00 88 store_32 s25, 56(sp) 95c4: 5d d3 00 88 store_32 s26, 52(sp) 95c8: dd c3 00 88 store_32 ra, 48(sp) 95cc: 1d 60 00 88 store_32 s0, 24(sp)
type | size |
---|---|
char | 1 byte |
short | 2 bytes |
int | 4 bytes |
long | 4 bytes |
long long | 8 bytes |
void* | 4 bytes |
float | 4 bytes |
double | 4 bytes (see below) |
- The compiler passes the first 8 scalar and vector function arguments in registers. It pushes the rest of the arguments on the stack in order, aligned by size.
- 64-bit values are passed in two adjacent scalar registers, with the lower numbered register being the least significant word.
- If a function is has variable number of arguments, then it pushes all arguments on the stack.
- When a function returns a struct by value, the caller reserves space for the result in its own stack frame and passes the address of that region in s0. The parameters of the function then start at s1.
- Scalar regs 24-27 and vector regs 26-31 are callee save. The others are caller save.
- s28 is as a frame pointer for function calls when needed. Most of the time it isn't; it only is if the function uses the frame or return address with __builtin_frame_address and __builtin_return_address intrinsics or if the function uses variable sized stack allocations.
- s29 is the stack pointer, which is 64 byte (vector width) aligned.
- the 'double' type is 32-bits wide and is actually a IEEE single precision float. This is because there is no hardware support for double precision floating point and the compiler defaults to double for many operations. While unusual, I believe this is technically spec compliant.
- The compiler emits constants (global addresses, floating point values, large integers, and vector values) immediately before the function that uses them. Code loads them using PC relative addressing.
- Integer modulus and division are not supported in hardware and generate calls to library functions __udivsi3, __divsi3, __umodsi3, and __modsi3. These are in compiler_rt.a software/libs/compiler-rt.
- Floating point division operations emit a reciprocal estimate instruction followed by two Newton-Raphson iterations, (9 instructions for reciprocal, 10 for division)
The Nyuzi ELF format supports the following relocation types:
Name | Description |
---|---|
R_NYUZI_ABS32 | 32 bit absolute relocation |
R_NYUZI_BRANCH | 20 bit PC-relative offset (branch instruction) |
R_NYUZI_PCREL_MEM_EXT | 15 bit PC-relative offset (memory instruction without mask) |
R_NYUZI_PCREL_LEA | 13 bit PC-relative offset (load effective address) |
The ELF machine ID for Nyuzi is currently set to 9999, as it doesn't have an official ID.
The elf2hex tool builds memory images that tools load:
- It uses the format that the Verilog $readmemh system function understands. Each line is 8 characters hexadecimal ASCII, which encodes four bytes. The processor is little endian, so, if a line is "002600f6", the processor will read the instruction as 0xf6002600.
- It unpacks the ELF file into a flat memory representation with the segments at their proper addresses, BSS regions cleared, etc.
- It clobbers the first word of the unused ELF header with a jump instruction to the start address.