Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification updates to IOMMU v1.0.0 #243

Merged
merged 17 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
e57a159
A set of typographic errors and editorial updates were made.
ved-rivos Jul 24, 2023
8b64911
Clarified that translations cached in IOMMU ATC do not require explicit
ved-rivos Aug 15, 2023
5b3a480
Clarified that memory faults encountered by commands also set the `cq…
ved-rivos Aug 28, 2023
89c53f5
Clarified that values tested by the algorithm in the SW Guidelines se…
ved-rivos Sep 8, 2023
51fb752
Included SW guidelines for modifying non-leaf PDT entries.
ved-rivos Sep 8, 2023
f8b00bf
Clarified the behavior for in-flight transactions observed at the tim…
ved-rivos Sep 19, 2023
cd6c798
Clarified the behavior when `IOTINVAL` is invoked with an invalid add…
ved-rivos Feb 24, 2024
c343c04
Stated that faults leading to UR/CA ATS responses are reported in the…
ved-rivos Apr 9, 2024
02fee6d
Added a detailed description of the `capabilities.PAS` field.
ved-rivos Apr 9, 2024
61a3527
Included software guidelines for changing IOMMU modes and provided
ved-rivos Apr 17, 2024
628db60
Stated that the PCIe specification requires granting execute permission
ved-rivos Apr 19, 2024
371eb3d
Clarified the handling of hardware implementations that internally split
ved-rivos Apr 19, 2024
1e3e13e
Noted that shadow stack encodings introduced by Zicfiss are reserved
ved-rivos Apr 21, 2024
29147aa
Listed the fault codes reported for faults detected by Page Request.
ved-rivos Apr 25, 2024
173d29e
Updated Fig 31 to remove the unused Destination ID field for ATS.PRGR
ved-rivos Jun 1, 2024
efea91a
Included a software guideline for IOMMU emulation.
ved-rivos Jun 12, 2024
0076b5f
Include QoS ID standard extension
ved-rivos Jul 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions src/images/ddt-base.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 1 addition & 5 deletions src/images/ddt-ext.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 1 addition & 5 deletions src/images/guest-OS.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 1 addition & 5 deletions src/images/hypervisor.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 1 addition & 5 deletions src/images/msi-imsic.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 1 addition & 5 deletions src/images/non-virt-OS.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 1 addition & 5 deletions src/images/pdt.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 48 additions & 0 deletions src/iommu.bib
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,51 @@ @electronic{AIA
title = {RISC-V Advanced Interrupt Architecture},
url = {https://github.com/riscv/riscv-aia}
}
@electronic{CFI,
title = {RISC-V Shadow Stacks and Landing Pads},
url = {https://github.com/riscv/riscv-cfi}
}
@electronic{PR243,
title = {Clarification updates to IOMMU v1.0.0},
url = {https://github.com/riscv-non-isa/riscv-iommu/pull/243/commits}
}
@electronic{CBQRI,
title = {RISC-V Capacity and Bandwidth QoS Register Interface},
url = {https://github.com/riscv-non-isa/riscv-cbqri}
}
@article{PTCAMP,
author = {Du Bois, Kristof and Eyerman, Stijn and Eeckhout, Lieven},
title = {Per-Thread Cycle Accounting in Multicore Processors},
year = {2013},
issue_date = {January 2013},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {9},
number = {4},
issn = {1544-3566},
url = {https://doi.org/10.1145/2400682.2400688},
doi = {10.1145/2400682.2400688},
journal = {ACM Trans. Archit. Code Optim.},
month = {jan},
articleno = {29},
numpages = {22},
}
@inproceedings{HERACLES,
author = {Lo, David and Cheng, Liqun and Govindaraju, Rama and Ranganathan, Parthasarathy and Kozyrakis, Christos},
title = {Heracles: Improving Resource Efficiency at Scale},
year = {2015},
isbn = {9781450334020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2749469.2749475},
doi = {10.1145/2749469.2749475},
booktitle = {Proceedings of the 42nd Annual International Symposium on Computer Architecture},
pages = {450–462},
numpages = {13},
location = {Portland, Oregon},
series = {ISCA '15}
}
@electronic{SSQOSID,
title = {RISC-V Quality-of-Service (QoS) Identifiers},
url = {https://github.com/riscv/riscv-ssqosid}
}
120 changes: 82 additions & 38 deletions src/iommu_data_structures.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,8 @@ is not prohibited by this specification.
The DDT is a 1, 2, or 3-level radix-tree indexed using the device directory
index (DDI) bits of the `device_id` to locate a `DC`.

<<<

The following diagrams illustrate the DDT radix-tree. The PPN of the root
device-directory-table is held in a memory-mapped register called the
device-directory-table pointer (`ddtp`).
Expand All @@ -150,7 +152,7 @@ next device-directory-table.
A valid leaf device-directory-table entry holds the device-context (`DC`).

.Three, two and single-level device directory with extended format `DC`
image::ddt-ext.svg[width=800,height=400]
image::ddt-ext.svg[width=800,height=400, align="center"]
//["ditaa",shadows=false, separation=false, font=courier, fontsize: 16]
//....
// +-------+-------+-------+ +-------+-------+ +-------+
Expand All @@ -174,7 +176,7 @@ image::ddt-ext.svg[width=800,height=400]
//....

.Three, two and single-level device directory with base format `DC`
image::ddt-base.svg[width=800,height=400]
image::ddt-base.svg[width=800,height=400, align="center"]
//["ditaa",shadows=false, separation=false, font=courier, fontsize: 16]
//....
// +-------+-------+-------+ +-------+-------+ +-------+
Expand Down Expand Up @@ -213,6 +215,8 @@ A valid (`V==1`) non-leaf DDT entry provides the PPN of the next level DDT.
], config:{lanes: 2, hspace:1024, fontsize: 16}}
....

<<<

==== Leaf DDT entry
The leaf DDT page is indexed by `DDI[0]` and holds the device-context (`DC`).

Expand Down Expand Up @@ -312,17 +316,22 @@ Such addresses also cannot be routed within the device when peer-to-peer
transactions within the device (e.g. between functions of a device) are
supported.

Use of `T2GPA` set to 1 may not be compatible with devices that implement caches
tagged by the translated address returned in response to a PCIe ATS Translation
Request.
====

<<<

[NOTE]
====
Hypervisors that configure `T2GPA` to 1 must ensure through protocol-specific
means that translated accesses are routed through the host such that the IOMMU
may translate the GPA and then route the transaction based on PA to memory or
to a peer device. For PCIe, for example, the Access Control Service (ACS) must
be configured to always redirect peer-to-peer (P2P) requests upstream to the
host.

Use of `T2GPA` set to 1 may not be compatible with devices that implement caches
tagged by the translated address returned in response to a PCIe ATS Translation
Request.

As an alternative to setting `T2GPA` to 1, the hypervisor may establish a trust
relationship with the device if authentication protocols are supported by the
device. For PCIe, for example, the PCIe component measurement and authentication
Expand Down Expand Up @@ -406,8 +415,7 @@ When `SXL` is 1, the following rules apply:
* If the first-stage is not Bare, then a page fault corresponding to the original
access type occurs if the `IOVA` has bits beyond bit 31 set to 1.
* If the second-stage is not Bare, then a guest page fault corresponding to the
original access type occurs if the incoming GPA has bits beyond bit 33 set to
1.
original access type occurs if the incoming GPA has bits beyond bit 33 set to 1.

===== IO hypervisor guest address translation and protection (`iohgatp`)

Expand Down Expand Up @@ -437,11 +445,11 @@ encodings are as follows:

[[IOHGATP_MODE_ENC]]
.Encodings of `iohgatp.MODE` field
[width=75%]
[%header, cols="3,3,20"]
[%autowidth,float="center",align="center"]
[%header, cols="^3,^3,20"]
|===
3+^| `fctl.GXL=0`
|Value | Name | Description
^|Value ^| Name ^| Description
| 0 | `Bare` | No translation or protection.
| 1-7 | -- | Reserved for standard use.
| 8 | `Sv39x4` | Page-based 41-bit virtual addressing (2-bit extension
Expand Down Expand Up @@ -476,6 +484,7 @@ the PTEs from the first page table or the second page table. These are the only
expected behaviors.
====

[[DC_TA]]
===== Translation attributes (`ta`)

.Translation attributes (`ta`) field
Expand All @@ -484,7 +493,9 @@ expected behaviors.
{reg: [
{bits: 12, name: 'reserved'},
{bits: 20, name: 'PSCID'},
{bits: 32, name: 'reserved'},
{bits: 8, name: 'reserved'},
{bits: 12, name: 'RCID'},
{bits: 12, name: 'MCID'},
], config:{lanes: 2, hspace: 1024, fontsize: 16}}
....

Expand All @@ -494,6 +505,21 @@ fences on a per-address-space basis. The `PSCID` field in `ta` is used as the
address-space ID if `DC.tc.PDTV` is 0 and the `iosatp.MODE` field is not `Bare`.
When `DC.tc.PDTV` is 1, the `PSCID` field in `ta` is ignored.

The `RCID` and `MCID` fields are added by the QoS ID extension. If
`capabilities.QOSID` is 0, these bits are reserved and must be set to 0.
IOMMU-initiated requests for accessing the following data structures use the
value configured in the `RCID` and `MCID` fields of `DC.ta`.

* Process directory table (`PDT`)
* Second-stage page table
* First-stage page table
* MSI page table
* Memory-resident interrupt file (`MRIF`)

The `RCID` and `MCID` configured in `DC.ta` are provided to the IO bridge on
successful address translations. The IO bridge should associate these QoS IDs
with device-initiated requests.

===== First-Stage context (`fsc`)
If `DC.tc.PDTV` is 0, the `DC.fsc` field holds the `iosatp` that provides
the controls for first-stage address translation and protection.
Expand Down Expand Up @@ -524,11 +550,11 @@ address.

[[IOSATP_MODE_ENC]]
.Encodings of `iosatp.MODE` field
[width=75%]
[%header, cols="3,3,20"]
[%autowidth,float="center",align="center"]
[%header, cols="^3,^3,20"]
|===
3+^| `DC.tc.SXL=0`
|Value | Name | Description
^|Value ^| Name ^| Description
| 0 | `Bare` | No translation or protection.
| 1-7 | -- | Reserved for standard use.
| 8 | `Sv39` | Page-based 39-bit virtual addressing.
Expand Down Expand Up @@ -571,11 +597,11 @@ directly edit the PDT to associate a virtual-address space identified by a
first-stage page table with a `process_id`.

[[PDTP_MODE_ENC]]
.Encoding of `pdtp.MODE` field
[width=75%]
[%header, cols="3,3,20"]
.Encodings of `pdtp.MODE` field
[%autowidth,float="center",align="center"]
[%header, cols="^3,^3,20"]
|===
|Value | Name | Description
^|Value ^| Name ^| Description
| 0 | `Bare` | No first-stage address translation or protection.
| 1 | `PD8` | 8-bit process ID enabled. The directory has 1 levels with
256 entries.The bits 19:8 of `process_id` must be 0.
Expand Down Expand Up @@ -607,11 +633,13 @@ defined by the Advanced Interrupt Architecture specification.

The `msiptp.MODE` field is used to select the MSI address translation scheme.

.Encoding of `msiptp.MODE` field
[width=75%]
[%header, cols="3,3,20"]
<<<

.Encodings of `msiptp.MODE` field
[%autowidth,float="center",align="center"]
[%header, cols="^3,^3,20"]
|===
|Value | Name | Description
^|Value ^| Name ^| Description
| 0 | `Off` | Recognition of accesses to
a virtual interrupt file using MSI address mask and
pattern is not performed.
Expand Down Expand Up @@ -706,6 +734,8 @@ misconfigured" (cause = 259).
. `DC.tc.SBE` value is not a legal value. If `fctl.BE` is writable
then `DC.tc.SBE` may be 0 or 1. If `fctl.BE` is not writable then
`DC.tc.SBE` must be the same as `fctl.BE`.
. `capabilities.QOSID` is 1 and `DC.ta.RCID` or `DC.ta.MCID` values
are wider than that supported by the IOMMU.

[NOTE]
====
Expand Down Expand Up @@ -882,6 +912,8 @@ misconfigured" (cause = 267).
. `DC.tc.SXL` is 1 and `PC.fsc.MODE` is not one of the supported modes
.. `capabilities.Sv32` is 0 and `PC.fsc.MODE` is `Sv32`

<<<

[NOTE]
====
Some `PC` fields hold supervisor physical addresses or
Expand Down Expand Up @@ -991,7 +1023,9 @@ The process to translate an `IOVA` is as follows:
. Translation process is complete

When checking the `U` bit in a second-stage PTE, the transaction is treated as
not requesting supervisor privilege.
not requesting supervisor privilege. The `pte.xwr=010` encoding, as specified by
the Zicfiss cite:[CFI] extension for the Shadow Stack page type in single-stage
and VS-stage page tables, remains a reserved encoding for IO transactions.

When the translation process reports a fault, and the request is an Untranslated
request or a Translated request, the IOMMU requests the IO bridge to abort the
Expand Down Expand Up @@ -1151,8 +1185,8 @@ file and translating the address using the MSI page table is as follows:
process are equivalent to that of a regular RISC-V second-stage PTE with
`R`=`W`=`U`=1 and `X`=0. Similar to a second-stage PTE, when checking the `U`
bit, the transaction is treated as not requesting supervisor privilege.
. If the transaction is an Untranslated or Translated read-for-execute then stop
and report "Instruction access fault" (cause = 1).
.. If the transaction is an Untranslated or Translated read-for-execute then stop
and report "Instruction access fault" (cause = 1).
. MSI address translation process is complete.

[NOTE]
Expand Down Expand Up @@ -1182,6 +1216,8 @@ PTEs atomically. When updating of A and D bits in second-stage PTEs is enabled
memory access from the device using the translated address becomes globally
visible.

<<<

[NOTE]
====
The A and D bits are never cleared by the IOMMU. If the supervisor software does
Expand Down Expand Up @@ -1246,18 +1282,21 @@ process-context is 0 then a Success response with R and W bits set to 0 is
generated.

If the translation could be successfully completed but the requested
permissions are not present (Execute requested but no execute permission;
permissions are not present in either stage (Execute requested but no execute permission;
no-write not requested and no write permission; no read permission)
then a Success response is returned with the denied permission (R, W or X)
set to 0 and the other permission bits set to the value determined from the
page tables. The X permission is granted only if the R permission is also
granted. Execute-only translations are not compatible with PCIe ATS as PCIe
requires read permission to be granted if the execute permission is granted.
granted and the execute permission was requested. Execute-only translations are
not compatible with PCIe ATS as PCIe requires read permission to be granted
if the execute permission is granted.

When a Success response is generated for an ATS translation request, no fault
records are reported to software through the fault/event reporting mechanism,
even when the response indicates no access was granted or some permissions were
denied.
denied. Conversely, when a UR or CA response is generated for an ATS translation
request, the corresponding fault is reported to software through the fault/event
reporting mechanism.

If the translation request has an address determined to be an MSI address using
the rules defined by the <<MSI_ID>> but the MSI PTE is configured in MRIF
Expand Down Expand Up @@ -1346,11 +1385,14 @@ of "Page Request".
a "Page Request Group Response" message to the device.

When the IOMMU generates the response, the status field of the response depends
on the cause of the error.
on the cause of the error. If a fault condition prevents locating a valid device
context then the `PRPR` value assumed is 0.

<<<

The status is set to Response Failure if the following faults are encountered:

* `ddtp.iommu_mode` is `Off`
* `ddtp.iommu_mode` is `Off` (cause = 256)
* DDT entry load access fault (cause = 257)
* DDT entry misconfigured (cause = 259)
* DDT entry not valid (cause = 258)
Expand All @@ -1359,8 +1401,8 @@ The status is set to Response Failure if the following faults are encountered:

The status is set to Invalid Request if the following faults are encountered:

* `ddtp.iommu_mode` is `Bare`
* `EN_PRI` is set to 0
* `ddtp.iommu_mode` is `Bare` (cause = 260)
* `EN_PRI` is set to 0 (cause = 260)

The status is set to Success if no other faults were encountered but the
"Page Request" could not be queued due to the page-request queue being full
Expand Down Expand Up @@ -1399,6 +1441,8 @@ the following conditions:
* "Page Request" could not be queued due to the page-request queue being full
(`pqt == pqh - 1`) or had a overflow (`pqcsr.pqof == 1`).

<<<

[[CACHING]]
=== Caching in-memory data structures

Expand All @@ -1424,10 +1468,10 @@ more IDs to tag the cached entries to identify a specific entry or a
group of entries.

.Identifiers used to tag IOATC entries
[width=90%]
[%autowidth,float="center",align="center"]
[%header, cols="8,10,10"]
|===
|Data Structure cached |IDs used to tag entries | Invalidation command
^|Data Structure cached ^|IDs used to tag entries ^| Invalidation command
|Device Directory Table |`device_id` | <<IDDT, IODIR.INVAL_DDT>>
|Process Directory Table|`device_id`, `process_id` | <<IPDT, IODIR.INVAL_PDT>>
|First-stage page table
Expand Down Expand Up @@ -1498,8 +1542,8 @@ determined by `fctl.BE` or by `DC.tc.SBE` as follows:

[[ENDIAN_CONFIG]]
.Endianness of memory access to data structures
[width=75%]
[%header, cols="16,8"]
[%autowidth,float="center",align="center"]
[%header, cols="10,8"]
|===
^|Data Structure ^| Controlled by
| Device directory table | `fctl.BE`
Expand Down
6 changes: 3 additions & 3 deletions src/iommu_debug.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ when the process completes (successfully or due to encountering a fault). When
the `Go/Busy` bit goes from 1 to 0, a response is valid in the `tr_response`
register.

The IOMMU behavior is `UNSPECIFIED` if:
When the `Go/Busy` bit is 1, the IOMMU behavior is `UNSPECIFIED` if:

* The `tr_req_iova` or `tr_req_ctl` are modified when the `Go/Busy` bit is 1.
* IOMMU configurations such as `ddtp.iommu_mode`, etc. are modified.
* The `tr_req_iova` or `tr_req_ctl` are modified.
* IOMMU configurations, such as `ddtp.iommu_mode`, are modified.

The time to complete a translation request through this debug interface is
`UNSPECIFIED` but is required to be finite. If the IOMMU is serving translation
Expand Down
Loading
Loading