Skip to content

Latest commit

 

History

History
456 lines (344 loc) · 18.7 KB

bytecode.md

File metadata and controls

456 lines (344 loc) · 18.7 KB

Eval bytecode

Consider this definition.

def foo = let x = 5 in fun y -> x + y

Eval programs can be compiled to a bytecode, to be executed on the Eval VM. The Eval VM is a stack-based virtual machine.

Registers

EBCI uses a fixed number of special registers. It does not have any general-purpose registers.

  • PC the program counter is the offset into the code segment of the next instruction to execute.
  • BP the base pointer is an offset into the argument stack that identifies a function call boundary. This is used for calculating the address of function arguments.
  • SP the stack pointer is the offset into the argument stack of the next unused item. (Seen differently, it is the current size of the stack.)

Memory segments

A bytecode program for the EVM consists of multiple memory segments.

  1. Heap. Large, variably-sized objects have to be stored somewhere and referred to on the stack by addresses. These objects are constructors and closures. The heap is prepopulated with any large constants that appear directly in the program text.
  2. Code. This is all the instructions of the program.
  3. Argument stack. Holds temporary values, arguments to functions.
  4. Call stack. Holds return addresses and environment addresses. Instructions affecting this stack generally have an r prefix (as in FORTH) for 'return'.

Values

  • Each item on the argument stack is called a value. Each value is a 64 bit integer.
  • There are two kinds of values, and these are tagged by the LSB of the integer:
    • An unboxed integer ends in 1, so it's really only 63 bits wide.
    • A heap address ends in 0.

(Heap) objects

Stack items are inherently temporary and scoped to function calls. Items that must outlive a function or are too large to fit into a value are stored on the heap, and are referred to by an address that can be moved around on the stack. These items are called heap objects or just objects.

Every heap object is made up of several values, each 64 bits wide.

  • first it has one value making up a header.
    • 8 bits tag identifying what kind it is: CLO, CON, PAP
    • and every following octet represents a count or index according to the type of object
    • 8 bits count 1
    • 8 bits count 2
    • 8 bits count 3
    • for a total of 7 counts
  • then some values, usually an amount specified in a count, or up to some limit specified in a count.

CON - Constructors

  • ID: 0

  • count 1: constructor ID

  • count 2: number of fields

Each field is a value.

CLO - Closures

  • ID: 1

  • count 1: env size

  • first value: closure body address

  • subsequent values: environment values

PAP - Partial application

  • ID: 2

  • count 1: held arguments

  • count 2: missing arguments

  • first value: address of a CLO object

  • subsequent values: held arguments

ARR - Arrays

  • ID: 3

  • count 1: capacity of the array, expressed as a count of values

  • count 2: length of the data, expressed as a count of bytes

  • capacity many values follow.

  • Each value, being 64 bites wide, stores 8 entries (bytes) of the array

The code segment

The code segment of a program contains all the bytecode instructions. It is divided into two sections:

  • function definitions
    • these correspond to definitions in the program of nonzero arity, e.g. def foo = fun x -> x + 2 does not need to compute anything upfront, so we store the code for x+2 into a subsection named foo.
  • the program entrypoint
    • the entrypoint code contains all the code that runs "while the module is being loaded"
    • these correspond to definitions in the program "of arity 0", e.g. def four = 2 + 2 needs to compute 2+2 and store it in a well-known location named four.

This code is stored in the code segment.

All the code of each function is stored sequentially in the code segment. Functions (either pure ones or closure bodies) are referred to by addresses to the code segment.

The heap

Constructors and closures cause heap allocations: constructors for their fields and closures for their environments.

The example program earlier forms a closure, so its code would look like.

push param int 5
push param addr $foo      ; load the address of the function code, which is statically known
mkclo 1                   ; construct a closure capturing one value into the environment
                          ; this consumes the function address and the one value on the stack
                          ; now the stack holds just the address of the closure on the heap
store well-known $clo_foo ; move the closure heap address into the well-known location

Since the example program was a top level definition, it isn't anonymous! Other programs will want to refer to foo and would like to object the address of the closure that was created. The programs 'communicate' thru the store_v / load_v pair of instructions. In the human-readable bytecode, we use the symbol $clo_foo to refer to the well-known name -- in the actual bytecode this symbol is represented as an integer encoded into the store_v instruction.

Calling a known closure

Recall the example def foo = let x = 5 in fun y -> x + y.

This allocates a closure and stores the address of the closure in the well-known location foo. The body of the closure x + y will generate a function foo_closure_1, that is the body of the closure. Closure bodies are special, in that user code can never call them directly. Jumps into them happen from within the interpreter, once the environment has been loaded. Likewise, closure bodies use a special return instruction that tells the interpreter to clean up the environment.

Now later on, there is an application in the program that looks like foo 2. This is what we mean by "calling a known closure".

This needs to be translated into:

push param int 2 (* the literal 2 *)
load well-known foo (* load the closure address from the well-known location `foo` *)
call closure 1 (* invoke the closure with 1 argument *)

The "call closure" instruction deferences the closure address to fetch the environment address and the closure body address. The environment is copied into the call stack (since we can't use the argument stack for this) and the closure body address is jumped to. Since the closure body knows how many environment variables there were, it contains the necessary code to remove that many entries from the call stack before it returns. In particular, it will use the ret closure N K instruction, where N is the size of the environment and K is the number of function parameters. The interpreter will:

  • pop N many values from the call stack, thus cleaning up the environment
  • moves the top of the param stack (the return value of the function) onto the call stack
  • pop K many values from the param stack, thus cleaning up the parameters.
  • move the top of the call stack (the function return value) onto the param stack (as expected by the caller)
  • pop the return address off of the call stack and jump there

Remark on degenerate closures and unhoisting

Closure conversion will turn every (sequence of) Fun abstraction(s) into an MkClo node, so def foo = fun x -> x + 2 will actually be turned into an MkClo node, and the code for x+2 hoisted to a definition for foo_closure_1. However, this closure does not capture any environment variables. We call it a degenerate closure. Degenerate closures associated to top-level definitions get unhoisted to define a pure function rather than a closure body. (Static) calls to pure functions are optimized since we don't need to load/destroy an environment, and we get to avoid one indirection thru the heap to get the function code address.

load_e 0 ; load captured env var 0 (x)
load_p 0 ; load parameter var 0 (y)
add
ret      ; pops the env and return addresses from the call stack, sets PC to return address

What if there are too many arguments?

In this case, we're calling a known closure that is returning an unknown function.

The code might look like f a b c and f refers to a statically-known function that requires one argument. The generated bytecode is then:

<get a on the stack>
<get b on the stack>
<get c on the stack>
load well-known $clo_f
call clo 1
; the call will removes `f` and `a` from the stack
; when the call completes, the address of a closure will be on the stack instead
; so to continue, we just make a dynamic call with the remaining number of arguments given
push return 2
call dynamic

Calling a known pure function

Calling known pure functions is pretty much the same as calling known closures. The difference is that when calling a pure function, we can avoid doing any work involving environments as an optimization.

The mnemonic becomes call func N, which won't touch the EP, and the code of that function returns with return func N, which won't touch the EP.

Calling an unknown function

When a function is passed as an argument to another function, then we don't know how many arguments we really need to actually enter the function.

For example, in the expression f x y, the function f might only really take one argument, then return another function that should be applied to y. Or maybe f needs three arguments, so we need to construct a partial application holding x and y but waiting for one more argument.

The approach we use in EBCI is called eval/apply, in which the caller examines (evaluates) the function and determines whether to actually call it or to form a partial application or even to continue applying more arguments to the result of the call. The bytecode looks like

<get the value of x onto the stack>
<get the value of y onto the stack>
<get the value of f onto the stack>
push return 2
call dynamic

The small integer 2 represents the statically-known number of arguments passed to the function. All the hard stuff happens inside the implementation of call dynamic.

The basic logic is the same for calling a CLO or a PAP: examine the number of required arguments and compare with the given number of arguments.

  • Less arguments than required are given: form a PAP
  • Exact arguments are given: make an exact call
  • Too many arguments are given: then we need to make an exact call with the required number of arguments, then repeat until we run out of arguments

This last case is the most challenging to implement, since we need to store a 'remaining number of arguments' and keep on calling until this goes to zero. We use the return stack to hold this count.

icall():
    n <- rpop()
    if n == 0: jump out
    -- we consumed all arguments, so what's on the stack is only the last call's return value
    -- there are no more arguments on the stack, and the count `n` has just been removed from the
    -- call stack.

    hdr <- *(SP-1) -- dereference the top of stack to get the header of the function

    if is_closure(hdr):
        compare n with hdr.param_cnt:
            - LT: jump to implementation of mkpap_clo(n, hdr.param_cnt - n)
            - EQ: jump to implementation of dcall_clo(n)
            - GT:
                rpush(n - hdr.param_cnt) -- update count of remaining args to consume
                rpush(PC - 1) -- address of the currently executing instruction
                jump to implementation of dcall_clo(hdr.param_cnt) _after_ rpush

    if is_pap(hdr):
        compare n with hdr.missing_cnt:
            - LT: jump to implementation of mkpap_pap(n, hdr.missing_cnt - n)
            - EQ: jump to implementation of dcall_pap(n)
            - GT:
                rpush(n - hdr.missing_cnt) -- update count of reminaing args to consume
                rpush(PC - 1) -- address of the currently executing icall instruction
                jump to implementation of dcall_pap(hdr.missing_cnt) _after_ rpush

The LT and EQ cases are straightforward. The GT case is the tricky one, because we need to continue by calling the returned function with the remaining number of arguments. The returned function is itself an unknown function, so the trick is to push the address of the currently executing icall onto the return stack, and then perform a 'half-call'. Normally, a call begins by pushing the address of the next instruction on to the stack before entering the function body. Instead, we want to replace the part where we push the address of the next instruction with a push of the address of the current instruction. This way when we return, we are back at call dynamic, but with the count of remaining arguments decreased!

Pattern matching compilation

How the fuck are we going to compile patterns?

At the point that we arrive at match t with p1 -> t1 | ... | pN -> tN:

  • we evaluate t, so its value is on the top of the stack
  • then we need to arrange to compile p1 so that when matching succeeds, t's value is gone from the stack and is replaced with the values corresponding to the variables in p1; but if matching fails, then the value of t needs to remain on the stack so that the next pattern can try.

These base cases are trivial because they can't fail:

  • a wildcard pattern just drops the value from the stack and succeeds
  • a variable pattern just does nothing, since the value to bind is already in position 0

There is a less obvious base case, because it can fail:

  • a literal (let's concentrate only on integer literals for now) we check if the top of stack equals the literal if it does we succeed by doing nothing since the comparison would have popped the int off if it fails then we need to jump... to a failure continuation!

Yes, we need continuations. We have to build up in the failure continuation all the work required to destroy the partially-matched value before jumping to the next case to try. The initial continuation jumps to code to crash on failed matching.

The only step case is for a constructor pattern. In this case, the value on the top of the stack is a heap address of a CON object, so we perform a special constructor load instruction to bring onto the stack the constructor tag (as an integer) and all the constructor's N fields.

Matching the constructor tag performs a subtraction, removing the tag from the stack, and branching on whether the result is zero. - If it is, then the constructor matches, so the first field is on top of the stack. - We do some recursion. - If it is not, then the constructor failed to match. The N fields of the constructor need to be removed from the stack, so we jump to label remove_n generate code from the first pattern, which regardless of success or failure

Recursive closures

def rec foo = 1 + foo

This one should rely on the idea that at some point, load is gonna have to happen for foo, but that would have happened before the first store to foo, so the interpreter detects this and crashes.

What about this:

def rec foo = let n = 5 in fun y -> foo n

During closure conversion of foo, we're going to have to convert the call foo n, but what's the arity of foo? Also, what kind of value is foo? This impacts how foo will be called.

We can calculate the static arity of a term. Oh wait actually we can't. Look at this:

def foo = let f = fun x -> x in f

actually f would need to be eta-expanded when we have type information.

Why does the compiler check arity anyway?

We have a problem with variables

In running this code

def rec fib = fun n -> match n with
    | 0 -> 0
    | 1 -> 1
    | n -> fib (n + -1) + fib (n + -2)

def fib_5 = fib 2

I found a problem.

The bytecode generated for fib (n + -2) looks like

push param int -2
load param 0
push param addr <fib>
call func

This is wrong! The load param 0 is supposed to load the value of the variable n, but we already pushed -2 onto the stack at this point, so the load would need to be offset.

In general this seems to be the case with any kind of application, since those are the only constructs that introduce stuff "temporarily in the way" like this on the stack. I guess one approach to fix this during compilation is to track a stack offset for variables while compiling a spine. After compiling each spine item (from right to left) we increase the offset by one. The offset goes away after the App is compiled because it will consume all the values for the spine.

So I implemented this offset approach and it works for fib, but I think that in general it doesn't. Consider this:

(match l with Cons x xs -> x) + 2

The prim application spine gets compiled first, so 2 gets pushed to the stack before the code for the match runs. The code for the match will then load the CON object that l refers to onto the stack -- its fields will be on top when the branch body runs. We would want to emit a load param 1 (to skip over xs and get to x). However, because we're inside an application spine, there will be an offset of 1. Therefore, the load param 1 will actually be a load param 2, which is incorrect.

An simple offset is insufficient. We actually need a full renaming. Remember that an offset is a special case of a renaming! To see why the full renaming is necessary, and works, consider a slightly more complex example.

let y = 5 in
(match l with Cons x xs -> V) + 2

When V = x, we need to make sure we get a load param 0. When V = y, we need to make sure we get a laod param 3 (and not 2) because we need to skip over the 2 that was added to the stack during the prim application spine construction.

So how does this renaming work? It maps variable indices to stack indices. Every binding construct adds a new item to the renaming and shifts everything in the renaming up by one. Every time a temporary value is added to the stack, the renaming simply gets shifted up by one.

Ok, so I think this approach works, but is it the best?

The issue ultimately arises due to some operations producing anonymous results. For example, in fib, we have fib (n + -1) + fib (n + -2) which has a lot of anonymous results. If we expressed this as

let m1 = -1 in
let n1 = n + m1 in
let f1 = fib n1 in
let m2 = -2 in
let n2 = n + m2 in
let f2 = fib n2 in
f1 + f2

Then we establish a one-to-one correspondence between stack indices and variable indices, which makes the renaming unnecessary during compilation. In other words, the requirement is that spines consist only of variables.

This is practically a CPS conversion / SSA form since we introduce a name for every intermediate result and make the order of operations completely explicit via the chain of lets.

Primitives: strings

One primitive in Eval is strings. Strings are variably-sized, and so need to be heap-allocated. However, they're a kind of heap object different from CON, CLO, and PAP objects. To represent strings, we introduce the ARR heap object for arrays. These are contiguous blobs of opaque binary data.