Skip to content

Latest commit

 

History

History
142 lines (101 loc) · 5.44 KB

README.md

File metadata and controls

142 lines (101 loc) · 5.44 KB

fast-SVD


We are using Verilog to implement an architecture which optimizes CORDIC-SVD computations, described in this paper: http://scholarship.rice.edu/bitstream/handle/1911/20042/Kot1999Aug2Architectu.PDF

The current goal is to have a CPU that computes the SVD of any 2x2 matrix.

The ultimate goal is to have a CPU and a grid of systolic arrays that computes the SVD of any matrix that we can adapt the architecture to support, in nlog(n) time!.


#Setting Up

This CPU was developed in Icarus Verilog on an Ubuntu machine. Icarus Verilog's installation guide can be found here: http://iverilog.wikia.com/wiki/Installation_Guide

Icarus Verilog has two main executables:

  • iverilog - compiles .v modules into .vvp binaries
  • vvp - executes .vvp binaries and produces outputs from the simulated hardware

If you care about waveform outputs then you can use GTKWave to generate waveforms from vvp's monitor data. http://gtkwave.sourceforge.net/ Personally I just looked at the printouts as I needed.

We use shell scripting to test modules, covered in the next section.


#Workflow

All components relevant to the SVD processor are in the directory labeled cpu.

Here is a typical workflow, assuming we have a Verilog module called module.v:

vim module.v          #edit the module with whatever changes you are looking for
./test.sh             #use shell scripting to compile module.v and run the resulting binary

Sample output of test.sh, assuming that it is running a CPU being told to load some numbers into registers r[0] to r[2], while r[0] and r[1] are initialized to have values 00000001 and 00000002 at the beginning:

WARNING: fetcher.v:13: $readmemh: Standard inconsistency, following 1364-2005.
 $t clk PC		instr	  r[0]	  r[1]	   r[2]	    r[3]	   rs rt rd
   0 0 00000000 xxxxxxxx  00000001 00000002 xxxxxxxx xxxxxxxx  xx xx xx
 100 1 00000000 3c000005  00000001 00000002 xxxxxxxx xxxxxxxx  00 00 00
 200 0 00000001 3c000005  00000005 00000002 xxxxxxxx xxxxxxxx  00 00 00
 300 1 00000001 3c200006  00000005 00000002 xxxxxxxx xxxxxxxx  01 00 00
 400 0 00000002 3c200006  00000005 00000006 xxxxxxxx xxxxxxxx  01 00 00
 500 1 00000002 3c400006  00000005 00000006 xxxxxxxx xxxxxxxx  02 00 00
 600 0 00000003 3c400006  00000005 00000006 00000006 xxxxxxxx  02 00 00

How did we tell the CPU to do these instructions? We wrote a file called assem.txt, which looks like this:

li $0 5
li $1 6
li $2 6

We ran assem.txt through compiler.py with the following command:

python compiler.py assem.txt > test.hex

Note that the output is directed into test.hex.

text.hex is read by the fetcher.v module. The register within cpu.v called "programCounter" points to the line number of the next instruction from test.hex to grab.

Now we'll talk about how the CPU uses this stuff.


#CPU

Note that the test bench (cpuTest.v) has only one function: initialize a cpu.v module and an external clock, and make the clock go up and down a bunch of times.

cpu.v is dependent on the following modules: fetcher.v, decode.v, and alu.v

  • fetcher.v is set to grab the next instruction from program memory at every positive clock edge.
  • decode.v is set to decode the instruction as soon as fetcher.v is done fetching.
  • cpu.v is set to prepare the input data for the alu every negative clock edge (after decode.v is finished)
  • alu.v is set to compute inputs and outputs in place as soon as the data (a, b, out) are prepared
  • cpu.v is then set to update the registers based on the alu as soon as the alu is finished.

It runs like clockwork!


#Description

An processor with an assembly math library. Designed to compute singular value decompositions via CORDIC operations. Includes a compiler for assembly input (compiler.py) and a decompiler (decipher.py) to decompose binary machine instructions into recognizable MIPS assembly parameters (e.g. opcode, shamt, funct).

It supports branch, jump, ALU, and load/store word instructions. It'll be RISC, Harvard architecture. Library is modeled off MIPS assembly. CORDIC gets a unitless I1Q31, gives 16b angles with 16b precision.


The following operations are required to calculate the SVD of a 2x2 matrix:

  • 2x2 matrix multiplcation
  • 2x2 matrix transpose
  • Inverse tangent

To calculate CORDIC, the following calculations are required:

  • three comparisons to determine the appropriate value with which to normalize the input value to between 0 and pi/2.
  • a maximum of one subtraction where the input theta is subtracted from the previously determined value
  • a maximum of one store of the previously determined offset value to a known memory location or place on the stack
  • one multiply / divide by constant to convert input I1Q31 number (b/t -1 and 1) to a CORDIC-manipulatable integer

for each precision-increasing iteration of the algorithm:

  • three right shifts (one to get sign, two to k)
  • six add/subtracts
  • three xors (against d)
  • one 32b fetch from LUT memory (CORDIC constant)
  • increment iteration counter (k)

at the end of iterating:

  • one 32b fetch of normalization / offset value
  • one 32b addition of normalization / offset value

This gives a total of eight operations plus fourteen operations per loop iteration.

Trials(see cordic.c and make.sh) indicate that at least ten iterations are required to deliver reasonable results.

This gives a total of 148 operations per calculation of sine and cosine.