Components, such as a RISC-V hart or a memory controller, in a system that support error detection may implement one or more banks of error records. Each error bank may implement one or more error records. Each error record corresponds to one or more hardware units of the component and reports errors detected by those hardware units. A hardware unit may implement multiple error records. One or more error records may be valid at any given time due to one or more hardware units in the component detecting an error or due to a hardware unit having detected one or more errors.
Each error bank is memory-mapped starting at an 8-byte aligned physical address and may include up to 63 error records. Each error record is a set of registers used to control that error record and to report status, address, and other information relevant to the error recorded in that error record.
Note
|
Implementations may use a coarser alignment for the start address of an error bank. For example, some implementations may locate the error bank within a naturally aligned 4-KiB region (a page) of physical address space for each error bank, i.e., one page per bank. Coarser alignments may enable register decoding to be implemented without a hardware adder circuit. |
The behavior for register accesses where the address is not aligned to
the size of the access, or if the access spans multiple registers, or if the
size of the access is not 4 bytes or 8 bytes, is UNSPECIFIED
. An aligned
4-byte access to a RERI register must be single-copy atomic. Whether an 8-byte
access to an RERI register is single-copy atomic is UNSPECIFIED
, and such an
access may appear, internally to the RERI implementation, as if two separate
4-byte accesses were performed.
Note
|
The RERI registers are defined in such a way that software can perform two individual 4 byte accesses, or hardware can perform two independent 4 byte transactions resulting from an 8 byte access, to the high and low halves of the register as long as the register’s semantics, with regards to side-effects, are respected between the two software accesses, or two hardware transactions, respectively. |
The RERI registers have little-endian byte order (even for systems where all harts are big-endian-only).
Note
|
Big-endian-configured harts using RERI may implement the |
An implementation-specific response occurs if the error bank and/or record is unavailable (e.g., powered down) to memory-mapped accesses. For example, an error bank and/or record may respond with all zero data on reads and may ignore writes. Other implementations may, for example, signal an error response on the attempted transaction.
An error bank that is otherwise available for memory-mapped accesses must respond with all zero data on reads and must ignore writes to unimplemented registers in the page.
The error bank registers are organized as a 64-byte header providing information
about the error bank followed by an array of 64-byte error records. The offset
of the error record numbered i
in the bank is (64 + i
* 64) where i
may
range from 0 to 62.
Offset | Name | Size | Description |
---|---|---|---|
0 |
|
8 |
Vendor and implementation ID. |
8 |
|
8 |
Error bank information. |
16 |
|
8 |
Summary of valid error records. |
24 |
Reserved |
32 |
Reserved for future standard use. |
56 |
Custom |
8 |
Designated for custom use. |
64 + 64 * i |
|
8 |
Control register of error record i. |
72 + 64 * i |
|
8 |
Status register of error record i. |
80 + 64 * i |
|
8 |
Address-or-info. register of error record i. |
88 + 64 * i |
|
8 |
Information register of error record i. |
96 + 64 * i |
|
8 |
Supplemental information register of error record i. |
104 + 64 * i |
|
8 |
Timestamp register of error record i. |
112 + 64 * i |
Reserved |
16 |
Reserved for future standard use. |
All registers and register fields defined by this specification are WARL unless noted otherwise. While all registers and register fields of an error bank and the error records in an error bank must exist, is legal to implement a register and/or register field of as read-only zero or a read-only legal value if they are not required to report errors information in an implementation.
Note
|
The number of error banks, the number of error records in an error bank and the amount of information reported in an error record may be implemented to meet the needs of the implementation. The error records are only required to implement the registers and register fields needed to report error information that is legally produced by the implementation. A minimal implementation with one error bank, which contains one error record
only consumes 128 bytes of address space. In terms of storage, the minimal
implementation requires only two bits of storage, for the |
The reset value is UNSPECIFIED
for RERI registers.
The registers of an error bank may preserve their value across certain types of reset. For example, a warm reset or a RAS initiated reset may preserve the register values whereas a cold reset may reset the values back to their initial state.
Note
|
Under normal circumstances, when an error is signaled, the RAS handler retrieves the logged errors to process the error condition. In some cases, the RAS handler may not be able to do such processing. For example, the system may be unable to support execution of the RAS handler and cause a RAS initiated reset. Preserving the information logged in error records across such resets allows reporting of unhandled errors that occurred in a previous boot of the system. |
All registers in an error bank must have the same reset behavior.
The vendor_n_imp_id
register is a read-only register and its layout is:
{reg: [ {bits: 32, name: 'vendor_id'}, {bits: 32, name: 'imp_id'}, ], config:{lanes: 1, hspace:1024}}
The vendor_id
field follows the encoding as defined by mvendorid
CSR and
provides the JEDEC manufacturer ID of the provider of the component hosting the
error bank. A value of 0 may be returned to indicate the field is not
implemented or that this is a non-commercial implementation.
The imp_id
provides a unique identity, defined by the vendor, to identify the
component and revisions of the component implementation hosting the error bank.
A value of 0 may be returned to indicate that the field is not implemented. The
value returned should reflect the design of the component itself and not of the
surrounding system.
Note
|
The |
The bank_info
is a read-only register and its layout is as follows:
{reg: [ {bits: 16, name: 'inst_id'}, {bits: 6, name: 'n_err_recs'}, {bits: 2, name: 'layout'}, {bits: 32, name: 'WPRI'}, {bits: 8, name: 'version'}, ], config:{lanes: 2, hspace:1024}}
The version
field returns the version of the architectural register layout
specification implemented by the error bank. The version defined by this
specification is 0x01. The encodings 0xF0 through 0xFF of this field are
designated for custom use.
The layout
field along with the version
field indicates the layout of the
registers in the error bank and the error records. The layout
encoding 0
indicates the registers are arranged and have meaning as defined by this
specification.
Note
|
The offset of the The |
The inst_id
field identifies a unique instance of an error bank, within a
package or at least a silicon die, of the component; ideally unique in the whole
system. The inst_id
is defined by the vendor of the system as a unique
identifier for the component. A value of 0 may be returned to indicate the field
is not implemented.
Note
|
The |
The n_err_recs
field indicates the number of error records implemented by the
error bank. The field is allowed to have an unsigned value between 1 and 63. The
error records of an error bank are located in the memory mapped region reserved
for the error bank such that the first error record is at offset 64 and the last
error record at offset (64 + 63 * n_err_recs
).
The valid_summary
is a read-only register and its layout is as follows:
{reg: [ {bits: 1, name: 'sv'}, {bits: 63, name: 'valid_bitmap'}, ], config:{lanes: 4, hspace:1024}}
The sv
bit when 1 indicates that the valid_bitmap
provides a summary of the
valid
bits from the status registers of this error bank. If this bit is 0
then the error bank does not provide a summary of valid bits and the
valid_bitmap
is 0.
Note
|
If |
The control_i
is a read/write WARL register used to control error reporting by
the corresponding error record in the error bank. The layout of this register
is as follows:
{reg: [ {bits: 1, name: 'else'}, {bits: 1, name: 'cece'}, {bits: 2, name: 'ces'}, {bits: 2, name: 'ueds'}, {bits: 2, name: 'uecs'}, {bits: 24, name: 'WPRI'}, {bits: 16, name: 'eid'}, {bits: 1, name: 'sinv'}, {bits: 1, name: 'srdp'}, {bits: 10, name: 'WPRI'}, {bits: 4, name: 'custom'}, ], config:{lanes: 4, hspace:1024}}
Error reporting functionality in the error record is enabled if the
error-logging-and-signaling-enable (else
) field is set to 1. The else
field
is WARL and may default to 1 or 0 at reset. When else
is 1, the hardware unit
logs and signals errors in the error record. When else
is 0, any signaling
associated with prior logged errors remains unaffected, the hardware unit does
not log and signal new errors in the error record, and it is UNSPECIFIED
whether the hardware unit continues detecting and correcting errors.
Note
|
When error reporting is disabled, the hardware unit may continue to silently correct detected errors and when correction is not possible provide corrupt data to the consumers of the data. Alternatively an implementation may disable error detection altogether when error reporting is disabled. It is recommended that implementations continue performing error correction even when error reporting is disabled. It is recommended that a hardware component continue to produce error detection and correction codes on data generated by or stored in the hardware component even when error reporting is disabled. It is recommended hardware components continue to use containment techniques like data poisoning even when error reporting is disabled. |
The ces
, ueds
, and uecs
are WARL fields used to enable signaling of CE,
UED, and UEC respectively when they are logged (i.e. when else
is 1). Enables
for unsupported classes of errors may be hardwired to 0. The encodings of these
fields are specified in Error signaling enable field encodings.
Encoding | Error signal |
---|---|
0 |
Signaling is disabled. |
1 |
Signal using a Low-priority RAS signal. |
2 |
Signal using a High-priority RAS signal. |
3 |
Signal using a platform specific RAS signal. |
The RAS signals are usually used to notify a RAS handler. The physical
manifestation of the signal is UNSPECIFIED
by this specification. The
information carried by the signal is UNSPECIFIED
by this specification.
Note
|
The error signaling enables typically default to 0 - disabled - at reset to allow a RAS handler an opportunity to initialize itself for handling RAS signals and to initialize the hardware units that generate the RAS signals before error reporting is enabled. The signal generated by the error record may in addition to causing an interrupt/event notification be also used to carry additional information to aid the RAS handler in the platform. The RAS handler may be implemented by a RISC-V application processor hart in the system, a dedicated RAS handling micro-controller, a Finite-State Machine (FSM), etc. The error signals may be configured, through platform specific means, to notify a RAS handler in the platform. For example, the High-priority RAS signal may be configured to cause a High-priority RAS local interrupt, an external interrupt, or an Non-Maskable Interrupt (NMI) and the Low-priority RAS signal may be configured to cause a Low-priority RAS local interrupt or an external interrupt. When error class and/or priority-specific RAS handlers are implemented,
these handlers must take into consideration the possibility that an error
record intended for a handler could be overwritten by an error of higher
severity or priority — which also triggers a signal to another RAS handler
for the new error — in the period between the first signal’s generation and
its examination of the error record by the first RAS handler. In such
instances, the first RAS handler may find an error record that is not
intended for it. This handler may choose to disregard this error record as
spurious from its perspective, and leave it to be handled by the other RAS
handler. It may also note that an error occurred that concerns it, but
information for the error is no longer available. Similarly, spurious
signals may arise if the fields controlling the type of signal generated by
an error record are modified while either the |
If the error record supports corrected-error counting then the
corrected-error-counting-enable (cece
) field, when set to 1, enables counting
corrected errors in the corrected-error-counter (cec
) field of the status
register status_i
of the error record. The cec
is a counter that holds an
unsigned integer count. When cece
is 0, the cec
does not count and retains
its value. If corrected error counting is not supported in the error record then
cece
and cec
may be hardwired to 0. An overflow of cec
is signaled using
the signal configured in the ces
field. When cece
is 1, the logging of a CE
in the error record does not cause an error signal and an error signal
configured in ces
occurs only on a cec
overflow that sets the ceco
bit.
The set-read-in-progress (srdp
) field, when written with a value of 1, causes
the rdip
(read-in-progress) bit of the associated status_i
register to be
set. The srdp
field always returns 0 on read. The rdip
field in the
status_i
register is set to 1 by hardware when an error is recorded in an
invalid error record causing the v
field to change from 0 to 1. The rdip
field is cleared to 0 by hardware when a new error updates any field of a valid
(v=1
) error record.
The status-register-invalidate (sinv
) bit, when written with a value of 1,
causes the v
(valid) field of the associated status_i
register to be
cleared if the rdip
field in the status_i
register is also 1. The sinv
field always returns 0 on read. The sinv
field enables software to read out
and invalidate an error record without needing to explicitly write the
status_i
register. Qualifying the clearing of the v
field with rdip
field
being 1 prevents losing information about an overwrite that might have occurred
while reading of the error record is in progress. If the sinv
and srdp
are
both written to 1 together then the rdip
bit is set and the v
bit is cleared
to 0.
Note
|
Software may determine if the error record was read atomically by first reading
the registers of the error record, then clearing the valid in |
The error-injection-delay (eid
) is a WARL field used to control error record
injection. When eid
is written with a value greater than 0, the eid
starts
counting down, at an implementation defined rate, till the value reaches a count
of 0. Writing a value of 0 disables the counter. If error injection is not
supported by the error record then the eid
field may be hardwired to 0. When
eid
reaches a count of 0, the status register is made valid by setting the
status_i.v
bit to 1. The status_i.v
transition from 0 to 1 generates a RAS
signal corresponding to the class of error (CE, UED, or UEC) setup in the
status_i
register. The counter continues to count even if the status_i
register was overwritten by a hardware detected error before the eid
counts
down to 0.
Note
|
Software may setup the error record registers with desired values of the error
record to be injected and then program The error record injection capability only injects an error record and not an error into the hardware itself. The error record injection capability is expected to be used to test the RAS handlers and is not intended to be used for verification of the hardware implementation itself. Other implementation specific mechanisms may be provided to generate and/or emulate hardware error conditions. When hardware error injection capabilities are implemented, the implementation should ensure that these capabilities cannot be misused to maliciously inject hardware errors that may lead to security issues. |
The status_i
is a read-write WARL register that reports errors detected by
the hardware unit.
{reg: [ {bits: 1, name: 'v'}, {bits: 1, name: 'ce'}, {bits: 1, name: 'ued'}, {bits: 1, name: 'uec'}, {bits: 2, name: 'pri'}, {bits: 1, name: 'mo'}, {bits: 1, name: 'c'}, {bits: 3, name: 'tt'}, {bits: 1, name: 'iv'}, {bits: 4, name: 'ait'}, {bits: 1, name: 'siv'}, {bits: 1, name: 'tsv'}, {bits: 2, name: 'WPRI'}, {bits: 1, name: 'scrub'}, {bits: 1, name: 'ceco'}, {bits: 1, name: 'WPRI'}, {bits: 1, name: 'rdip'}, {bits: 8, name: 'ec'}, {bits: 16, name: 'WPRI'}, {bits: 16, name: 'cec'}, ], config:{lanes: 4, hspace:1024}}
The error record holds a valid error log if the valid (v
) field is 1. The
status_i
register does not accept a software write when the v
field is 1.
If the detected error was corrected then ce
is set to 1. If the detected error
could not be corrected but was deferred then ued
is set to 1. If the detected
error could not be corrected or deferred and thus needs immediate handling by an
RAS handler, then the uec
bit is set to 1. If the error record does not log a
class of errors (e.g., does not support UED), then the corresponding bit may be
hardwired to 0. If the bits corresponding to more than one error class are set
to 1 then the error record holds information about the highest severity error
class among the bits set. The error record may be used to provide an
informational update by setting the v
bit to 1 and setting ce
, ued
, and
uec
bits to 0. Such informational updates are lower severity than a CE but are
signaled using the signal configured in control_i.ces
.
When v
is 1, if more errors of the same class as the error currently logged in
the error record occur then the multiple-occurrence (mo
) bit is set to indicate
the multiple occurrence of errors of the same severity. See Error Record Overwrite Rules
for rules on overwriting the error record in such cases.
Each error of an error class (CE, UED, or UEC) that may be logged in an error record may be associated with a priority which is a number between 0 and 3; priority value of 3 being the highest priority and priority value of 0 being the lowest priority. The priority values indicate relative priority among errors of the same error class and therefore represent sub-classes of errors. Among errors of different error classes the priority values are unrelated.
Note
|
Some implementations may report errors from more than one sources into a
single error records. Such implementations may prioritize reporting of error
from one source over the other using the |
The priority (pri
) field in the error record indicates the priority of the
currently logged error in the error record. The pri
is a WARL field and an
implementation may support only a subset of legal values for this field and
an implementation that does not support reporting of a priority per error may
hardwire this field to 0.
The error record overwrite rules use the error class (CE, UED, or UEC) and the
error priority (pri
) as specified in Error Record Overwrite Rules.
When an UEC occurs the containable (c
) bit may be set to 1 to indicate
that the error has not propagated beyond the boundaries of the hardware unit
that detected the error and thus may be containable through recovery actions
(e.g., terminating the computation, etc.) carried out by the RAS handler.
The c
bit is WARL. For error classes other than UEC, the interpretation of
the c
bit may be specified in a future standard extension.
For a RISC-V hart, some UEC may cause a Hardware Error exception cite:[PRIV]. A Hardware Error is a synchronous exception, triggered when corrupted or uncorrectable data is accessed, either explicitly or implicitly, by an instruction. In this context, "data" encompasses all types of information used within a RISC-V hart.
Note
|
For example, a RISC-V hart by causing the precise hardware error exception on
attempts to consume corrupted/poisoned data may contain the error to the program
currently executing on the hart. Such errors may be reported with the While the |
The address-or-info-type (ait
) is a WARL field that indicates the type of
information reported in the addr_info_i
register. An error record that does
not report information in this field may hardwire this field to 0. The encodings
of the ait
field are listed in Address-or-information type encodings.
Encoding | Description |
---|---|
0 |
None. The contents of the |
1 |
Supervisor Physical Address (SPA). |
2 |
Guest Physical Address (GPA). |
3 |
Virtual Address (VA). |
4-15 |
Component-specific address or information. |
Note
|
Component-specific information types, as defined in the range 4-15 of the When a standard address type (a VA, SPA, or GPA) is reported in the
|
The transaction-type (tt
) is a WARL field to report the type of transaction
that detected the error and its encodings are listed in Transaction type encodings. An
error record that does not report transaction types may hardwire this field
to 0.
Encoding | Description |
---|---|
0 |
Unspecified or not applicable. |
1 |
Designated for custom use. |
2-3 |
Reserved for future standard use. |
4 |
Explicit read. |
5 |
Explicit write. |
6 |
Implicit read. |
7 |
Implicit write. |
For a RISC-V hart, the Unprivileged specification cite:[UPRIV] defines memory accesses by instructions as either explicit or implicit. An Implicit read or write is an access that may be implicitly performed by hardware to perform an explicit operation. For example, a load or store instruction executed by the hart may perform implicit memory accesses to page table data structures. Instruction memory accesses by a hart are termed as implicit accesses by the Unprivileged specification. However, for the purposes of error reporting, only the implicit accesses to data structures, such as the (guest) page tables that are used to determine the address of the instructions to be fetched, are termed as implicit accesses. The read to fetch the instruction bytes themselves is classified as an explicit read.
Note
|
Implementations may report additional information about the transaction (e.g.,
whether speculative, on-demand vs. prefetch, etc.) in the A non-hart component may also perform implicit accesses in order to process an explicit transaction. For example, processing a memory transaction may require a fabric component to implicitly access a routing table data structure. |
If the detected error reports additional information in the info_i
register
then the information-valid (iv
) field is set to 1. If the detected error
reports additional supplemental information in the suppl_info_i
register then
supplemental-information-valid (siv
) field is set to 1. The iv
and/or siv
fields may be hardwired to 0 if the error record does not provide information in
info_i
and/or suppl_info_i
registers. When iv
is 0, the value in info_i
register is UNSPECIFIED
. When siv
is 0, the value in suppl_info_i
register
is UNSPECIFIED
.
If the error record holds a timestamp of when the last error was logged in the
timestamp_i
register then the timestamp-valid (tsv
) field is set to 1. This
field may be hardwired to 0 if the error record does not report a timestamp with
the error. When tsv
field is 0, the value in timestamp_i
register is
UNSPECIFIED
.
The scrub
bit is valid when a CE is logged and when set to 1 indicates that
the storage location that held the data value has been updated with the
corrected value (i.e., the data has been scrubbed). In an implementation that
cannot make this distinction then it may conservatively report this field as 0.
When the error record is not associated with storage elements (e.g., correcting
errors detected on bus transactions) this field may be hardwired to 0. If this
property is unconditionally true for a hardware unit then this field may be
hardwired to 1. For error classes other than CE, the interpretation of the c
bit may be specified in a future standard extension.
The error-code (ec
) is a WARL field that holds an error code that provides a
description of the detected error. Standard ec
encodings are defined in
Error code encodings. If an error record detects an error that does not correspond
to a standard ec
encoding then such errors may be reported using a custom
encoding. The custom encodings have the most significant bit set to 1 to
differentiate them from the standard encodings.
The read-in-progress (rdip
) field is set to 1 by hardware when a new error is
recorded in an invalid status register and is cleared to 0 by hardware when a
valid status register is overwritten. When the control_i.sinv
field is written
to 1, the v
field is cleared to 0 only if the rdip
field is 1. Gating the
clearing of the v
field by the rdip
field being 1 allows software to detect
an overwrite that may occur while it is in process of reading an error record.
An error record that supports the 1 setting of the cece
field in control_i
,
implements a corrected-error-counter in the cec
field. The cec
is a WARL
field. When cece
is 1, the cec
is incremented on each CE. If an unsigned
integer overflow occurs on an cec
increment then the
corrected-error-counter-overflow (ceco
) field is set to 1. The cec
continues to count following an overflow. The cec
and ceco
fields hold valid
data and continue to count even when the v
field is 0.
Note
|
Some hardware units may maintain a history of CE and may report a CE and may
increment the Some hardware units may implement low pass filters (e.g., leaky buckets) that throttle the rate at which CE are reported and counted. |
Note
|
To invalidate a valid error record (presumably after having first read the error
record), software should write 1 to the If software needs to initialize the |
When an UEC or UED error is logged in an error record, the cec
and ceco
fields of the error record are not modified and retain their values.
The addr_info_i
WARL register reports the address or other information
associated with the detected error when status_i.ait
is not 0. If
status_i.ait
is 0, the value in this register is UNSPECIFIED
. An
implementation that does not report information in this register may hardwire
this register to 0. Some fields of this register may be hardwired to zero if the
field is unused to report any type of address or information.
When an address (a VA, GPA, or an SPA) is reported in this register, to the extent possible, the error record should capture all significant parts of the address. However, as a function of the type of error being logged some address fields may be zeroes. Some of the highest address bits may be fixed or may be sign-extensions or may be zero-extensions of the next lowest address bit depending on the type of address reported.
When component specific information is reported in this register, the interpretation of the information is component specific.
The info_i
WARL register provides additional information about the error when
status_i.iv
is 1. If status_i.iv
is 0, the value in this register is
UNSPECIFIED
. An implementation that does not report any additional
information may hardwire this register to 0.
The format of the register is UNSPECIFIED
by this specification. This field
may be interpreted using the error code in status_i.ec
along with
implementation defined format and rules.
Note
|
This register may be used to report information for guiding recovery, error nature (transient/permanent), error location (set/way, parity group, ECC syndrome), and other details (protocol FSM state, assertion failures). Components that are or monitor field replaceable units may log information in this register to identify the failing component. For example, a memory controller may log the DIMM channel, bank, column, row, rank, subRank, device ID, etc. |
The suppl_info_i
WARL register provides additional information about the error
when status_i.siv
is 1. This information may supplement the information
provided in info_i
register. If status_i.siv
is 0, the value in this
register is UNSPECIFIED
. An implementation that does not report any
supplemental information may hardwire this register to 0.
The format of the register is UNSPECIFIED
by this specification. This field
may be interpreted using the error code in status_i.ec
along with
implementation specific and implementation defined format and rules.
The timestamp_i
WARL register provides a timestamp for the last error recorded
in the error record if status_i.tsv
is 1. When status.tsv
is 0, the value in
this register is UNSPECIFIED
. An implementation that does not report a
timestamp may hardwire this register to 0. Some fields of the register may be
hardwired to zero if the field is unused to report the timestamp.
The nature, frequency, and resolution of the timestamp are UNSPECIFIED
.
Note
|
The timestamp may be constructed by a hardware unit using mechanism such as sampling a local cycles counter (e.g., the cycles counter of a RISC-V hart, a global counter (e.g, mtime, etc.), or other implementation specific means. |
When a hardware unit detects an error and its error record is not valid, it writes the error record with the error information and marks the record as valid. However, if the error record is already valid, owing to an earlier detected but unprocessed error, the decision to overwrite the error record with new error information is determined by the new error’s severity and/or priority.
The overwrite rules allow a higher severity error to overwrite a lower severity
error. UEC has the highest severity, followed by UED, then CE, and finally,
informational. When the two errors have the same severity the priority of the
errors (as determined by status_i.pri
) is used to determine if the error
record is overwritten. Higher priority errors overwrite the lower priority
errors. When an error record is overwritten by a higher severity error (UED/CE
by UEC, UED by UEC, or CE by UEC/UED), the status bits indicating the severity
of the older errors are retained (i.e., are sticky).
When an error writes or overwrites an error record, the status_i.cec
and
status_i.ceco
fields update from CEs and retain value for errors of other
severity. When implemented, cec
counts CE occurrences; unsigned integer
overflow on cec
increment sets ceco
to 1.
Whenever a new error writes to or overwrites an error record, the signal
configured in the control_i
register for its severity level is asserted. When
status_i.ceco
changes from 0 to 1, the signal configured in control_i.ces
is
asserted.
Let new_status be the value to be recorded in status_i register for the new error overwrite = FALSE if status_i.v == 1 // There is a valid first error recorded if ( severity(new_error) > severity(status_i) ) // Higher severity errors overwrite less severe errors and clear mo status_i.mo = 0 overwrite = TRUE endif if ( severity(new_status) == severity(status_i) ) // Second errors of the same severity set MO status_i.mo = 1 // Second error of same severity overwrites previous error if it // has higher priority (status_i.pri). if ( new_status.pri > status_i.pri ) overwrite = TRUE; endif endif // previous error status bits are retained (sticky) but rdip bit is cleared. status_i.rdip = 0 status_i.uec |= new_status.uec status_i.ued |= new_status.ued status_i.ce |= new_status.ce else // No valid error recorded; new error logged, clearing sticky history // and MO bit, and rdip is set. status_i.rdip = 1 status_i.uec = new_status.uec status_i.ued = new_status.ued & ~new_status.uec status_i.ce = new_status.ce & ~new_status.uec & ~new_status.ued status_i.mo = 0 overwrite = TRUE; endif if ( overwrite = TRUE ) status_i.pri = new_status.pri status_i.c = new_status.c status_i.tt = new_status.tt status_i.ait = new_status.ait status_i.iv = new_status.iv status_i.siv = new_status.siv status_i.tsv = new_status.tsv status_i.scrub = new_status.scrub status_i.ec = new_status.ec // Update addr_info_i, info_i, suppl_info_i, and timestamp_i with new // error information, if valid. status_i.v = 1 endif
If the status_i.v
, status_i.mo
, and status_i.uec
are all 1 then the RAS
handler should preferably restart the system to bring it to a correct state as
an UEC record has been lost. If the status_i.v
and status_i.mo
are 1 but
status_i.uec
is 0 (i.e., the logged error is a UED or a CE) then the RAS
handler may keep the system operational.
If multiple errors occur simultaneously then they may be recorded individually in any order and the rules outlined in Error record writing rules lead to the highest severity error among them being retained in the error record. When the error record registers are written by an error, all registers that are written must be written with information related to that error.
Note
|
When multiple errors occur simultaneously, some implementations may choose to record each error individually following the rules outlined in Error record writing rules. Other implementations may however choose to only record the highest severity error or when they have the same severity the highest priority error. And yet another implementation may choose to record one of the errors as determined by implementation specific rules. |
Standards such as PCIe cite:[PCI] and CXL cite:[CXL] define standardized error reporting architectures such as the PCIe Advanced Error Reporting (AER). Specifications such as CXL define a standardized set of RAS requirements for hosts and devices.
The RISC-V RERI specification complements the error reporting architecture defined by these standards with a RISC-V standard for reporting errors for components that are not PCIe/CXL components. There may also be other error reporting mechanisms, possibly custom, that are employed alongside the RERI specification.
Note
|
The RISC-V system components such as PCIe root ports or PCIe Root Complex Event Collectors may themselves implement error reporting compliant with the RISC-V RERI specification and thus provide a unified error reporting mechanism in such systems. For example, a root complex event collector may support an error record to report errors logged in the Advanced Error Reporting (AER) log registers. |
Encoding | Description |
---|---|
0 |
None |
1 |
Other unspecified error occurred |
2 |
Corrupted data access (e.g., attempt to consume poisoned data) error |
3 |
Cache block data (e.g., ECC error on cache data) error |
4 |
Cache scrubbing detected (e.g., ECC error on cache data) error |
5 |
Cache address/control state (e.g., parity error tag or state) error |
6 |
Cache unspecified error |
7 |
Snoop-filter/directory address/control state (e.g., ECC error on tag or state) error |
8 |
Snoop-filter/directory unspecified error |
9 |
TLB/Page-walk cache data (e.g., ECC error on TLB data) error |
10 |
TLB/Page-walk cache address/control state (e.g., ECC error on TLB tag) error |
11 |
TLB/Page-walk cache unspecified error |
12 |
Hart state error (e.g., ECC error on CSRs or x/f/v registers) |
13 |
Interrupt controller state (e.g., ECC error on interrupt pending/enable state) error |
14 |
Interconnect data (e.g., ECC error on data bus) error |
15 |
Interconnect other (e.g., parity error on address bus) error |
16 |
Internal watchdog error |
17 |
Internal datapath, memory, or execution units error (e.g, ALU datapath parity) |
18 |
System memory command/address bus error |
19 |
System memory unspecified error |
20 |
System memory data (e.g., ECC error in SDRAM or HBM) error |
21 |
System Memory scrubbing detected error |
22 |
Protocol Error - illegal input/output error |
23 |
Protocol Error - illegal/unexpected state error |
24 |
Protocol Error - timeout error |
25 |
System internal controller (power management, security, etc.) error |
26 |
Deferred error pass-through (e.g., forwarding poisoned data) not supported |
27 |
PCIe/CXL detected (e.g., logged into PCIe AER, CXL.mem error log, etc.) errors |
28 - 63 |
Reserved for future standard use |
64 - 255 |
Designated for custom use |