Skip to content

Commit

Permalink
Merge pull request #956 from pascalgouedo/dev_dd_pgo_doc
Browse files Browse the repository at this point in the history
Some User Manual updates.
  • Loading branch information
davideschiavone authored Mar 12, 2024
2 parents e0772a0 + 3a3b4c4 commit 1ad59cb
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 11 deletions.
3 changes: 3 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,9 @@
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None

# Tags for conditional text
#tags.add('USER')
#tags.add('PMP')

# -- Options for HTML output -------------------------------------------------

Expand Down
9 changes: 6 additions & 3 deletions docs/source/corev_hw_loop.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,12 @@ The HWLoop constraints are:

- HWLoop body must contain at least 3 instructions.

- When both loops are nested, the End address of the outermost HWLoop (must be #1) must be at least 2
instructions further than the End address of the innermost HWLoop (must be #0),
i.e. HWLoop[1].endaddress >= HWLoop[0].endaddress + 8.
- When both loops are nested, at least 1 instruction should be present between last innermost HWLoop (must be #0) instruction and
last outermost HWLoop (must be #1) instruction. In other words the End address of the outermost HWLoop must be at least 8
bytes further than the End address of the innermost HWLoop (HWLoop[1].endaddress >= HWLoop[0].endaddress + 8).

In the example below the first "addi %[j], %[j], 2;" instruction is the one added due to this constraint.
The code could have been simpler by using only one "addi %[j], %[j], 4;" instruction but to respect this constraint it has been split in two instructions.

- HWLoop must always be entered from its start location (no branch/jump to a location inside a HWLoop body).

Expand Down
4 changes: 2 additions & 2 deletions docs/source/instruction_set_extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -789,15 +789,15 @@ General ALU operations
| | |
| | else rD = rs1 |
| | |
| | Note: rs2 is unsigned. |
| | Note: rs2 is unsigned and must be in the range (0x0-0x7FFFFFFF). |
+-------------------------------------------+------------------------------------------------------------------------+
| **cv.clipur rD, rs1, rs2** | if rs1 <= 0, rD = 0, |
| | |
| | else if rs1 >= rs2, rD = rs2, |
| | |
| | else rD = rs1 |
| | |
| | Note: rs2 is unsigned. |
| | Note: rs2 is unsigned and must be in the range (0x0-0x7FFFFFFF). |
+-------------------------------------------+------------------------------------------------------------------------+
| **cv.addN rD, rs1, rs2, Is3** | rD = (rs1 + rs2) >>> Is3 |
| | |
Expand Down
18 changes: 12 additions & 6 deletions docs/source/integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,21 +259,27 @@ be provided.
FPGA Synthesis
^^^^^^^^^^^^^^^

FPGA synthesis is only supported for CV32E40P.
The user needs to provide a technology specific implementation of a clock gating cell as described
in :ref:`clock-gating-cell`.
FPGA synthesis is supported for CV32E40P and it has been successfully implemented using both AMD® Vivado® and Intel® Quartus® Prime Pro Edition tools.

Due to some advanced System Verilog features used by CV32E40P RTL design, Intel® Quartus® Prime Standard Edition isn't able to parse some CV32E40P System Verilog files.

The user needs to provide a technology specific implementation of a clock gating cell as described in :ref:`clock-gating-cell`.

.. _synthesis_with_fpu:

Synthesizing with the FPU
^^^^^^^^^^^^^^^^^^^^^^^^^

By default the pipeline of the FPU is purely combinatorial (FPU_*_LAT = 0). In this case FPU instructions latency is the same than simple ALU operations (except FP multicycle DIV/SQRT ones).
By default the pipeline of the FPU is purely combinatorial (FPU_*_LAT = 0). In this case FPU instructions latency is the same than simple ALU operations (except multicycle FDIV/FSQRT ones).
But as FPU operations are much more complex than ALU ones, maximum achievable frequency is much lower than ALU one when FPU is enabled.

If this can be fine for low frequency systems, it is possible to indicate how many pipeline registers are instantiated in the FPU to reach higher target frequency.
This is done with FPU_*_LAT CV32E40P parameters setting to perfectly fit target frequency.
This is done by adjusting FPU_*_LAT CV32E40P parameters setting to perfectly fit target frequency.

It should be noted that any additional pipeline register is impacting FPU instructions latency and could cause performances degradation depending of applications using Floating-Point operations.

Those pipeline registers are all added at the end of the FPU pipeline with all operators before them. Optimal frequency is only achievable using automatic retiming commands in implementation tools.
This can be achieved with the following command for Synopsys Design Compiler:
As an exemple, this can be done for Synopsys® Design Compiler with the following command:

“set_optimize_registers true -designs [get_object_name [get_designs "\*cv32e40p_fp_wrapper\*"]]”.

0 comments on commit 1ad59cb

Please sign in to comment.