Skip to content

Commit

Permalink
Add more thoughts about interfaces & latency counting
Browse files Browse the repository at this point in the history
  • Loading branch information
VonTum committed Mar 4, 2024
1 parent e5ae820 commit ebe0108
Show file tree
Hide file tree
Showing 5 changed files with 81 additions and 31 deletions.
9 changes: 0 additions & 9 deletions philosophy/combinatorial_dependency.md

This file was deleted.

File renamed without changes.
File renamed without changes
47 changes: 47 additions & 0 deletions philosophy/interfaces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Interfaces

A module can have one or more interfaces, each of which consists of multiple input and output ports.

Interfaces form the only method by which one can cross latency and clock domains. Each interface has with it it's associated hardware, that operates within the same clock and latency counting domain. Wires which belong to the same latency counting group should be placed in the same interface.

The code in one interface can read wires from other interfaces, provided the proper clock domain crossing method is used, in case of a clock domain crossing. Writes however, can naturally only be done by the interface that owns that wire.

To transfer data from one interface to another, use the `cross` keyword. To ensure that multiple wires stay in sync when needed, you can cross multiple wires together: `cross wire_a, wire_b, wire_c;`. This ensures that any relative latencies are maintained in the target interface. Latencies are not kept in sync for wires in separate `cross` statements.

## Examples
Example implementation of `memory_block`:
```Verilog
module memory_block<gen int DEPTH, T> {
interface write : T data, int addr, bool wr {
T[DEPTH] memory;
if wr {
memory[addr] = data;
}
}
interface read : int addr -> T data {
cross memory;
data = memory[addr];
}
}
```

Actually, we can implement `rebase_latency` from [latency.md](latency.md).

Example implementation of `rebase_latency`:
```Verilog
// This module rebases the latency by an offset DELTA without adding registers.
module rebase_latency<gen int DELTA, T> : T data_in'0 -> T data_out'DELTA {
cross cross_o;
data_out = cross_o;
// Create an anonymous interface, such that we can break the latency dependency.
interface _ {
cross data_in;
T cross_o = a;
}
}
```

56 changes: 34 additions & 22 deletions philosophy/latency.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# On Latency
# Latency Counting
For state see [state](state.md)

## Latency Counting
## Theory
Inserting latency registers on every path that requires them is an incredibly tedious job. Especicially if one has many signals that have to be kept in sync for every latency register added. This is why I propose a terse pipelining notation. Simply add the `reg` keyword to any critical path and any paths running parallel to it will get latency added to compensate. This is accomplished by adding a 'latency' field to every path. Starting from an arbitrary starting point, all locals connected to it can then get an 'absolute' latency value, where locals dependent on multiple paths take the maximum latency of their source paths. From this we can then recompute the path latencies to be exact latencies, and add the necessary registers.

Example:
Expand All @@ -13,43 +13,55 @@ B -- reg --/------------------/
```

### Combinatorial loops with latency are still combinatorial loops
This is in my opinion a big benefit to making the distinction. When inserting latency registers, we are saying in effect "If we could perform these computations instantaneously, we would", and thus, a loop containing latency registers would still be a combinatorial loop.
This is in my opinion a big benefit to making the distinction. When inserting latency registers, we are saying in effect "If we could perform these computations instantaneously, we would", and thus, a loop containing latency registers would still be a combinatorial loop. Sadly, this does break down a little when explicitly building a pipelined loop. Also combinatorial dependencies could show up across interfaces as well. Perhaps we should rethink this feature.

### Latency counting with state
Of course, state registers are also moved around by latency. This means that while it appears like two state modules get updated at the same time, if they are independent they need not.
Of course, state registers are also moved around by latency. This means that while it appears like (and we want the programmer's mental model to say that) two state registers get updated at the same time, they may actually be offset from one another in time.

However, state registers should not count towards the latency count. So specifying `reg reg` should increase the latency count by 2, but specifying `state` does not. This makes sense, because this means a feedback loop to a state register has a latency of 0, which it requires to stay within. Also, this maintains that by removing all latency registers, the total latency count becomes 0 on all ports.

If this rule holds for all possible hardware designs is up for further research.
However, state registers do not count towards the latency count. So specifying `reg` increases the latency count by 1, but specifying `state` does not. This makes sense, because state registers are meant to carry data across cycles, whereas latency registers are only meant for meeting timing closure, and don't allow sequential data packets to interact.

### Maximum Latency Requirements
It's the intention of the language to hide fixed-size latency as much as possible, making it easy to create pipelined designs.

Often however, there are limits to how long latency is allowed to be. The most common case is a state to itself feedback loop. If a state register must be updated every cycle, and it depends on itself, the loopback computation path may not include any latency.
Often however, there are limits to how long latency is allowed to be. The most common case is a `state` to itself feedback loop. If a state register must be updated every cycle, and it depends on itself, the loopback computation path may not include any latency.

For example, a FIFO with an almost_full threshold of _N_, may have at most a `ready_out -> valid_in` latency of _N_.

For state to state paths, this could be relaxed in several ways:
- If it is proven the register won't be read for some cycles, then the latency can be hidden in these cycles. (Requires complex validity checking)
- Slow the rate of state updating to the maximum latency, possibly allow automatic C-Slowing.

#### Implementation using negative back-edges

We can provide a primitive construct to the user to enable them to use latencies at all, which is the _negative latency backedge_. The user simply instantiates this module betwheen two points. The negative backedge requires that it is part of a combinatorial loop.
#### Breaking out of latency counting
Feed-forward pipelines are an important and useful construct. But feed forward pipelines can't describe all hardware. We need to be able to break out of the latency counting system, and tell the latency counting system that there exists a latency differential between two nodes, without instantiating registers between them. In the case of making a dependent latency earlier than the source one, we call this edge a 'negative backedge'. This should be provided as a template module in the standard library:
```Verilog
module rebase_latency<T, gen int delta> : T i'0 -> T o'delta {/*...*/}
```

Usually it negative backedges would be found on the output of a module, which is combinatorially 'before' the inputs that should be returned to it (See [Combinatorial Dependency](combinatorial_dependency.md)). As an example, we can have a module with an internal negative backedge of -3, which itself contains some state that the backedge can originate from. The module wraps the backedge this way, and proves all of the safety requirements that come with using it. The user then is free to connect the output of this module combinatorially with the input, and with at most 3 cycles of latency.
As an example, we can have a module with an internal negative backedge of -3, which itself contains some state that the backedge can originate from. The module wraps the backedge this way, and proves all of the safety requirements that come with using it. The user then is free to connect the output of this module combinatorially with the input, and with at most 3 cycles of latency.

![Negative Backedge Concept](images/negativeBackedgeConcept.png)

As a more concrete example, consider the write side of a FIFO.
![FIFO Negative Backedge](images/fifoExample.png)

### Latency specification
Specifying latencies on every single input and output is a bit cumbersome, so we wish to infer latencies as much as possible. Sometims however specific constructions with indeterminable latencies require the user to explicitly specify the latencies. We will explore such cases in later chapters.

When the user specifies latencies on multiple nodes, they are in effect stating that they know what the exact latencies between the given nodes are. Theorethically we should include this into the algorithm as extra edges between each pair of nodes `a` and `b`, of latency `lb-la` and `-(lb-la)` backwards. This fixes the latency differential to be as the user specifies it. In practice in the Latency Counting Algorithm, we don't actually do this in this way, but instead handle specified latencies in a dedicated part of the algorithm, to provide the user with more readable error messages. In any case, the latencies assigned by the algorithm should be equivalent to doing it with the extra edges.

If the user provides not a single latency annotation, then we allow the compiler to set an arbitrary node's latency to 0, as a seed for the algorithm. Because input and output nodes have to be fully constrained, the compiler picks one of these if possible.

## Requirements for the latency counting system
- Addition or Removal of any latency registers that do not violate a constraint must not affect the operation of the design.
- Feedback loops containing only latency are considered combinatorial feedback loops, and are therefore forbidden. Feedback loops must therefore have at least one state register in them.
- When the user specifies a latency of 1 somewhere using the `reg` keyword, this instructs the compiler that the _minimum_ latency between these points is now 1. The compiler is allowed to insert additional latency registers between any two points as it sees fit.
- When the user specifies a latency of 1 somewhere using the `reg` keyword, this instructs the compiler that the _minimum_ latency between these points is now 1. The compiler is allowed to insert additional latency registers between any two nodes as it sees fit.
- State registers to not impact the latency counting. They count as 0 latency.
- Any loop (which must contain at least one state register) must have a roundtrip latency ≤ 0. Negative values are permitted, and are simply attributed to the use of negative back edges.
- Specified latencies must be matched exactly.

### Extra requirements to allow latency inference
- The latency between input and output nodes that have a combinatorial connection must be *minimal*. In other words, if an output `o'lo` is reachable from an input `i'li` by only following forward dependencies, then `|lo|-|li|` is exactly the latency of the longest path between them.
- Nodes that are not an input or output, don't have a latency specified, and have multiple options for their latency are set to the earliest possible latency.

## Latency Counting Graph Algorithm
We are given a directed graph of all wires and how they combinatorially depend on each other. Each edge can have a number of latency registers placed on it.
Expand All @@ -60,10 +72,10 @@ Example:
module Accumulator : int term, bool done -> int total_out {
state int total;
reg int term_twice = term * 2;
int new_total = total + term;
if done {
reg total_out = total + term_twice;
reg total_out = new_total;
total = 0;
} else {
total = new_total;
Expand All @@ -78,12 +90,12 @@ Nodes are coloured by role. Blue nodes are inputs, green nodes are outputs, oran

On the edges are noted the minimum latency offsets in black. These are given. The goal of the algorithm is to compute a set of 'absolute latencies', which are all relative to an arbitrary node. These are given with the red nodes on the picture. Because these absolute latencies are relative to an arbitrary reference point, we accept any constant shift applied to all absolute latencies as equivalent.

### Non Deterministic inference of Input and Output absolute latencies
Sadly, while it appears reasonable to think it's possible to assign a deterministic latency. Observe this contrived example:
### Non Determinable inference of Input and Output absolute latencies
Sadly, while it appears reasonable to think it's possible to assign a determinable latency. Observe this contrived example:

```Verilog
// timeline is omitted here, not important
module NonDeterministic : int a, int b -> int x, int y {
module NonDeterminable : int a, int b -> int x, int y {
reg int a_d = a;
reg int t = a_d + b;
reg reg reg int a_ddd = a;
Expand All @@ -93,14 +105,14 @@ module NonDeterministic : int a, int b -> int x, int y {
```

Simplified latency graph:
![Non Determinism](images/nonDeterminism.png)
![Non Uniquely Determinable Example](images/nonDeterminable.png)

The issue starts when the inputs and outputs don't have predefined absolute latency. We are tempted to add maximization and minimization to the input and output absolute latencies, to force the module's latency span to be as compact as possible, and therefore maximize how free the user of this module is in using it. But sadly, we cannot make a deterministic latency assignment for our inputs and outputs, as in this example b and y press against each other, permitting two possible implementations.
The issue starts when the inputs and outputs don't have predefined absolute latency. We are tempted to add maximization and minimization to the input and output absolute latencies, to force the module's latency span to be as compact as possible, and therefore maximize how free the user of this module is in using it. But sadly, we cannot make a uniquely determinable latency assignment for our inputs and outputs, as in this example b and y press against each other, permitting two possible implementations.

One may think the solution would simply be to prefer inputs over outputs or something, just to get a deterministic latency assignment. Just move b to be the earliest of the available latencies, but even in this case, if we instead looked at the possibilities of a, and fixed b, we would again make b later by making a earlier. And since there's no way to distinguish meaningfully between inputs, there's no deterministic solution either.
One may think the solution would simply be to prefer inputs over outputs or something, just to get a unique latency assignment. Just move b to be the earliest of the available latencies, but even in this case, if we instead looked at the possibilities of a, and fixed b, we would again make b later by making a earlier. And since there's no way to distinguish meaningfully between inputs, there's no uniquely determinable solution either.

To this problem I only really see three options:
- Still perform full latency computation when compiling each module separately. In the case of non-deterministic latency assignment, reject the code and require the programmer to add explicit latency annotations. The benefit is better encapsulation, the programmer requires only the module itself to know what latencies are. The downside is of course less flexible modules. Though is this flexibility _really_ needed?
- Still perform full latency computation when compiling each module separately. In the case of non-determinable latency assignment, reject the code and require the programmer to add explicit latency annotations. The benefit is better encapsulation, the programmer requires only the module itself to know what latencies are. The downside is of course less flexible modules. Though is this flexibility _really_ needed?
- Infer absolute latencies on the inputs and outputs of submodules using templates which can be inferred. This would be really handy to allow latency information to flow back into the templating system, thus allowing a FIFO that alters its almostFull threshold based on its input latency. Of course, this makes absolute latency information flow from top-down instead of bottom up, so now getting the latency information back from the module would be impossible. The issue is that templates can't be instantiated partially. Either the submodule takes all of its port latencies from the calling module, or it determines its latencies itself.
- Perform latency computation at integration level, we don't define the absolute latencies on the ports of a module, unless the programmer explicitly does so. For simlpicity, this requires that every single module instantiation now compiles to its own Verilog module though, which is less than ideal for debugging.

Expand Down

0 comments on commit ebe0108

Please sign in to comment.