DRTIO: Link errors and Sequence errors #2662

JammyL · 2025-01-29T11:44:26Z

Bug Report

One-Line Summary

After an nondeterministic period of time in our experiments we encounter both link errors and sequence errors. These put two of our satellite devices in a bad state (the ones showing sequence errors) requiring a restart.
I have also opened a discussion here, but an issue feels more appropriate.

Issue Details

After both reinitialisation of devices and a core.reset devices on the offending satellites are unresponsive. After a power cycle of the satellite or a reset of the master FPGA (artiq_flash start) the satellites come back online.

There is an added effect of experiments hanging. This can be patched by replacing the rtio_input_data calls for Sampler and SUServo with the timestamped alternative in the locations linked below.

artiq/artiq/coredevice/suservo.py

Line 149 in c1f2ff3

return rtio_input_data(self.channel)

artiq/artiq/coredevice/spi2.py

Line 241 in c1f2ff3

return rtio_input_data(self.channel)

The above behaviour leads me to believe that no rtio_output events are being triggered on these satellites, thus there is no input data to read, causing the experiments to hang.

Steps to Reproduce

We are unsure. The behaviour is non-deterministic. The same experiment may produce an error after 10 minutes or 2 hours.

Expected Behavior

Running experiments without any errors. This is usually the case for up to several hours.

Actual (undesired) Behavior

After a non-deterministic amount of experiment run time (10mins to 2hours) we see the following errors.
These errors show up across experiments that have run perfectly well for a few years.

DEST 0

[  1542.604249s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] error(s) found (0x04):
[  1542.610091s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] timeout attempting to get remote buffer space
[  1542.619944s]  WARN(runtime::rtio_mgt::drtio): [LINK#1] unsolicited aux packet: TSCAck
[  1542.627026s] ERROR(runtime::rtio_mgt::drtio): [LINK#1] error(s) found (0x04):
[  1542.634141s] ERROR(runtime::rtio_mgt::drtio): [LINK#1] timeout attempting to get remote buffer space
[  1542.645159s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0006:_1762eom
[  1542.655393s] ERROR(runtime::rtio_mgt::drtio): [DEST#2] RTIO sequence error involving channel 0x0003:_ground_dp
[  1542.867581s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1543.081299s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1545.149344s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1545.363290s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0008:suservo0
[  1555.827323s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1556.041276s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1556.255132s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1556.468229s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0004:_614dp
[  1556.682110s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0011:_493eom_11_ttl
[  1556.896094s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.109112s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.322117s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0002:_493sigma
[  1557.535115s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.748161s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.961308s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1558.175227s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0006:_1762eom
[  1558.389158s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1558.602279s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1560.615320s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1560.829287s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0008:suservo0

DEST 1

[  2784.146092s] ERROR(satman): received packet of an unknown type
[  2784.150586s] ERROR(satman): timeout attempting to get buffer space from CRI, destination=0xc0
[  2784.159099s] ERROR(satman): write underflow, channel=0, timestamp=1545247125414, counter=1545372549488, slack=-125424074
[  2784.173638s] ERROR(satman): received packet of an unknown type
[  2784.178139s] ERROR(satman): received truncated packet
[  2784.183159s] ERROR(satman): write underflow, channel=1, timestamp=437905711215, counter=1545400101

DEST 2

[  2930.554069s] ERROR(satman): write underflow, channel=6, timestamp=1270499749083, counter=1545372648680, slack=-274872899597
[  2930.563936s]  INFO(satman): TSC loaded from uplink
[  2930.568706s] ERROR(satman): received packet of an unknown type
[  2930.574510s] ERROR(satman): received truncated packet
[  2930.579547s] ERROR(satman): timeout attempting to get buffer space from CRI, destination=0x57
[  2930.588077s] ERROR(satman): write underflow, channel=2, timestamp=1546005020011, counter=12363432125432, slack=-10817427105421
[  2943.699773s] ERROR(satman): write underflow, channel=3, timestamp=1558521121013, counter=12376550453472, slack=-10818029332459
[  3012.501525s] ERROR(satman): write underflow, channel=3, timestamp=1627322209389, counter=124453

DEST 3
Shows nothing on the log - it isn't involved in this particular experiment.

Your System (omit irrelevant parts)

Master process running on Ubuntu 20.04

ARTIQ version: 9.unknown.beta. Commit: c1f2ff3
Gateware version: 9.unknown.beta. Commit: c1f2ff3

Hardware involved:

Master Kasli v2.0 (DEST # 0)
- DIO BNC v1.3
- Urukul ad9910 v1.5
- Fastino v1.1
- Fastino v1.1
Satellite kasli v2.0 (DEST # 1)
- SUServo
  - Urukul v.1.5
  - Urukul v.1.5
  - Sampler v.2.2
- BNC IO rev1.1
Satellite kasli v2.0.2 (DEST # 2)
- SUServo
  - Sampler v2.3
  - Urukul v1.5.2
  - Urukul v1.5.2
- Sampler v2.3
Satellite kasli v2.0 (DEST # 3)
- Urukul v1.5
- Fastino v1.2
- Fastino v1.2

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRTIO: Link errors and Sequence errors #2662

DRTIO: Link errors and Sequence errors #2662

JammyL commented Jan 29, 2025 •

edited

Loading

DRTIO: Link errors and Sequence errors #2662

DRTIO: Link errors and Sequence errors #2662

Comments

JammyL commented Jan 29, 2025 • edited Loading

Bug Report

One-Line Summary

Issue Details

Steps to Reproduce

Expected Behavior

Actual (undesired) Behavior

Your System (omit irrelevant parts)

JammyL commented Jan 29, 2025 •

edited

Loading