Compiler ABI

An LLVM backend targets this architecture and supports scalar and vector operations. The LLVM infrastructure can be a backend for any language, but focus is currently on the clang C/C++ compiler. The port is based on LLVM trunk and supports many recent features up to C++1z (http://clang.llvm.org/cxx_status.html)

This backend does not support:

Exceptions
Thread local storage (__thread variable attribute)
Position independent code (programs can only be statically linked and do not use GOT/PLT for dynamic relocation)

There is not a C++ standard library port yet.

The compiler defines the preprocessor macro __NYUZI__ .

The toolchain is installed by default in /usr/local/llvm-nyuzi. The tools are in the bin/ directory:

clang/clang++: C/C++ compiler with integrated assembler
ld.lld: LLD linker (this symlink invokes the 'ld' flavor of LLD)
lldb: Symbolic debugger http://lldb.llvm.org/
elf2hex: Converts ELF executables into a format that can be run in simulator/emulator/FPGA
llvm-ar: LLVM version of ar for creating static libraries
llvm-objdump: Object dump utility. Useful for seeing assembly listing of generated file.

Inline assembly

Inline assembler statements are available using GCC's extended assembler syntax (http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html). The 'v' constraint used for vector operands and 'r' is for scalar operands, for example:

asm("store_v %0, (%1)" : : "v" (value), "r" (address));

Vector Support

The compiler supports vector types using GCC syntax http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html. The 'ext_vector_type' attribute indicates vector types:

typedef int veci16_t __attribute__((ext_vector_type(16)));
typedef unsigned int vecu16_t __attribute__((ext_vector_type(16)));
typedef float vecf16_t __attribute__((ext_vector_type(16)));

Vectors are first class types that can be local variables, global variables, parameters, or struct/class members. The compiler uses registers to store these wherever possible.

If a vector is a member of the structure, you must align that structure on a 64 byte multiple. The compiler automatically aligns vector members within the structure, stack allocated local variables, and global variables. However, if a structure is heap allocated, the heap implementation must align it (this is not the default behavior of most implementations).

Standard arithmetic operators are available for vector operations. For example, to add two vectors:

    veci16_t foo;
    veci16_t bar;
    veci16_t baz;
...
    foo = bar + baz;

Individual elements of a vector are set/read using the array operator. These compile to the getlane instruction and mask register moves.

    veci16_t foo;
    int total;
    for (int i = 0; i < 16; i++)
    {
        total += foo[i];
        foo[i] += i;
    }

Vectors can be initialized using curly bracket syntax. If the members are constant, this is loaded from the constant pool.

  const veci16_t steps = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };

You can also use non-constant members, in which case the compiler will generate a series of masked moves to load the vector.

  int a, b, c, d;
...
  veci16_t values = { a, b, c, d, a, b, c, d, a, b, c, d, a, b, c, d };

Scalar and vector values can be mixed:

    veci16_t foo;
    int bar;
    veci16_t baz;

    foo = baz + bar;

The backend recognizes when it can use mixed vector/scalar instructions. For example:

    add_i v0, v0, s0

In some situations, you may need to cast a scalar type to widen it:

   void somefunc(vecu16_t f);

   somefunc((vecu16_t) 12);

Floating point conversions are a little weird. Consider the following code:

    veci16_t i;
    vecf16_t f;

    f = (vecf16_t) i;

If these were scalar types, this would cause an integer to floating point conversion (eg. 1 would become 1.0). However, because they are vectors it does a bitcast instead. This is standard GCC behavior. Use __builtin_convertvector to convert the type:

    vecu16_t a;
    vecf16_t b = __builtin_convertvector(a, vecf16_t);

The GCC syntax supports vector comparisons that result in another vector type. For example:

    veci16_t a, b, c;
    a = b > c;

The instruction set does not natively support this. Comparisons set bitmasks in scalar registers. The compiler emulates the former behavior using masked move instructions. Builtins support native bitmask comparisons (f is for floats, i is for ints), for example:

    veci16_t b, c;
    uint32_t a = __builtin_nyuzi_mask_cmpi_sgt(b, c);  // Signed greater than

Two flexible compiler builtins support predicated instructions: __builtin_nyuzi_vector_mixf and __builtin_nyuzi_vector_mixi functions. Each one takes a mask and two vectors. Each of the low 16 bits in the mask selects whether the vector lane value comes from the first parameter or the second. A one bit pulls from the first, a zero from the second. These builtins don't necessarily emit instructions; they compiler inserts predicated instructions where possible. For example:

   vecf16_t a = __builtin_nyuzi_vector_mixf(mask, a + b, a);

Generates a single instruction:

   add_f_mask v0, s0, v0, v1

__builtin_nyuzi_shuffle instruction allows rearranging vector contents. It takes two vector parameters. The first is a source vector and the second is a set of indices (0-15) into the first.

While the LLVM toolchain supports auto-vectorization, the backend for this processor doesn't. The focus is on explicit vectorization in code.

Built-in functions

int __builtin_nyuzi_read_control_reg(int index);
void __builtin_nyuzi_write_control_reg(int index, int value);
veci16_t __builtin_nyuzi_vector_mixi(unsigned short mask, veci16_t a, veci16_t b);
vecf16_t __builtin_nyuzi_vector_mixf(unsigned short mask, vecf16_t a, vecf16_t b);
veci16_t __builtin_nyuzi_shufflei(veci16_t sourceVector, veci16_t laneIndices);
vecf16_t __builtin_nyuzi_shufflef(vecf16_t sourceVector, veci16_t laneIndices);
veci16_t __builtin_nyuzi_gather_loadi(veci16_t sourcePtrs);
veci16_t __builtin_nyuzi_gather_loadi_masked(veci16_t sourcePtrs, unsigned short mask);
veci16_t __builtin_nyuzi_gather_loadf(veci16_t pointers);
veci16_t __builtin_nyuzi_gather_loadf_masked(veci16_t pointers, unsigned short mask);
void __builtin_nyuzi_scatter_storei(veci16_t destPtrs, veci16_t sourceValue);
void __builtin_nyuzi_scatter_storei_masked(veci16_t destPtrs, veci16_t sourceValue, unsigned short mask);
void __builtin_nyuzi_scatter_storef(veci16_t destPtrs, vecf16_t sourceValue);
void __builtin_nyuzi_scatter_storef_masked(veci16_t destPtrs, vecf16_t sourceValue, unsigned short mask);
void __builtin_nyuzi_block_storei_masked(veci16_t *dest, veci16_t values, unsigned short mask);
void __builtin_nyuzi_block_storef_masked(vecf16_t *dest, vecf16_t values, unsigned short mask);
unsigned short __builtin_nyuzi_mask_cmpi_ugt(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_uge(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_ult(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_ule(vecu16_t a, vecu16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_sgt(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_sge(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_slt(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_sle(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_eq(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpi_ne(veci16_t a, veci16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_gt(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_ge(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_lt(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_le(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_eq(vecf16_t a, vecf16_t b);
unsigned short __builtin_nyuzi_mask_cmpf_ne(vecf16_t a, vecf16_t b);

Disassembler

It is often useful to see a disassembled listing of the executable to debug issues. The llvm-objdump command disassembles ELF output files from the compiler.

/usr/local/llvm-nyuzi/bin/llvm-objdump --disassemble program.elf
...

vsnprintf:
    95b8:	bd 03 ff 02 	add_i sp, sp, -64
    95bc:	1d f3 00 88 	store_32 s24, 60(sp)
    95c0:	3d e3 00 88 	store_32 s25, 56(sp)
    95c4:	5d d3 00 88 	store_32 s26, 52(sp)
    95c8:	dd c3 00 88 	store_32 ra, 48(sp)
    95cc:	1d 60 00 88 	store_32 s0, 24(sp)

ABI/Code Generation

type	size
char	1 byte
short	2 bytes
int	4 bytes
long	4 bytes
long long	8 bytes
void*	4 bytes
float	4 bytes
double	4 bytes (see below)

The compiler passes the first 8 scalar and vector function arguments in registers. It pushes the rest of the arguments on the stack in order, aligned by size.
64-bit values are passed in two adjacent scalar registers, with the lower numbered register being the least significant word.
If a function is has variable number of arguments, then it pushes all arguments on the stack.
When a function returns a struct by value, the caller reserves space for the result in its own stack frame and passes the address of that region in s0. The parameters of the function then start at s1.
Scalar regs 24-28 and vector regs 26-31 are callee save. The others are caller save.
s29 is as a frame pointer for function calls when needed. Most of the time it isn't; it only is if the function uses the frame or return address with __builtin_frame_address and __builtin_return_address intrinsics or if the function uses variable sized stack allocations.
s30 is the stack pointer, which is 64 byte (vector width) aligned.
The hardware uses s31 as the return address register. It sets this when a call instruction is executed.
the 'double' type is 32-bits wide and is actually a IEEE single precision float. This is because there is no hardware support for double precision floating point and the compiler defaults to double for many operations. While unusual, I believe this is technically spec compliant.
Integer modulus and division are not supported in hardware and generate calls to library functions __udivsi3, __divsi3, __umodsi3, and __modsi3. These are in compiler_rt.a software/libs/compiler-rt.
Floating point division operations emit a reciprocal estimate instruction followed by two Newton-Raphson iterations, (9 instructions for reciprocal, 10 for division)

The Nyuzi ELF format supports the following relocation types:

ID	Name	Description
1	R_NYUZI_ABS32	32 bit absolute relocation
2	R_NYUZI_BRANCH20	20 bit PC-relative offset (branch instruction)
3	R_NYUZI_BRANCH25	25 bit PC-relative offset (branch instruction)
4	R_NYUZI_HI19	movehi instruction that loads upper 19 bits of absolute address
5	R_NYUZI_IMM_LO13	updates immediate field of RI instruction with low 13 bits of absolute address

The ELF machine ID for Nyuzi is currently set to 9999, as it doesn't have an official ID.

Other tools

The elf2hex tool builds memory images that tools load:

It uses the format that the Verilog $readmemh system function understands. Each line is 8 characters hexadecimal ASCII, which encodes four bytes. The processor is little endian, so, if a line is "002600f6", the processor will read the instruction as 0xf6002600.
It unpacks the ELF file into a flat memory representation with the segments at their proper addresses, BSS regions cleared, etc.
It clobbers the first word of the unused ELF header with a jump instruction to the start address.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly