Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRTIO: Link errors and Sequence errors #2662

Open
JammyL opened this issue Jan 29, 2025 · 0 comments
Open

DRTIO: Link errors and Sequence errors #2662

JammyL opened this issue Jan 29, 2025 · 0 comments

Comments

@JammyL
Copy link

JammyL commented Jan 29, 2025

Bug Report

One-Line Summary

After an nondeterministic period of time in our experiments we encounter both link errors and sequence errors. These put two of our satellite devices in a bad state (the ones showing sequence errors) requiring a restart.
I have also opened a discussion here, but an issue feels more appropriate.

Issue Details

After both reinitialisation of devices and a core.reset devices on the offending satellites are unresponsive. After a power cycle of the satellite or a reset of the master FPGA (artiq_flash start) the satellites come back online.

There is an added effect of experiments hanging. This can be patched by replacing the rtio_input_data calls for Sampler and SUServo with the timestamped alternative in the locations linked below.

return rtio_input_data(self.channel)

return rtio_input_data(self.channel)

The above behaviour leads me to believe that no rtio_output events are being triggered on these satellites, thus there is no input data to read, causing the experiments to hang.

Steps to Reproduce

We are unsure. The behaviour is non-deterministic. The same experiment may produce an error after 10 minutes or 2 hours.

Expected Behavior

Running experiments without any errors. This is usually the case for up to several hours.

Actual (undesired) Behavior

After a non-deterministic amount of experiment run time (10mins to 2hours) we see the following errors.
These errors show up across experiments that have run perfectly well for a few years.

DEST 0

[  1542.604249s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] error(s) found (0x04):
[  1542.610091s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] timeout attempting to get remote buffer space
[  1542.619944s]  WARN(runtime::rtio_mgt::drtio): [LINK#1] unsolicited aux packet: TSCAck
[  1542.627026s] ERROR(runtime::rtio_mgt::drtio): [LINK#1] error(s) found (0x04):
[  1542.634141s] ERROR(runtime::rtio_mgt::drtio): [LINK#1] timeout attempting to get remote buffer space
[  1542.645159s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0006:_1762eom
[  1542.655393s] ERROR(runtime::rtio_mgt::drtio): [DEST#2] RTIO sequence error involving channel 0x0003:_ground_dp
[  1542.867581s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1543.081299s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1545.149344s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1545.363290s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0008:suservo0
[  1555.827323s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1556.041276s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1556.255132s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1556.468229s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0004:_614dp
[  1556.682110s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0011:_493eom_11_ttl
[  1556.896094s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.109112s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.322117s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0002:_493sigma
[  1557.535115s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.748161s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1557.961308s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1558.175227s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0006:_1762eom
[  1558.389158s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1558.602279s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0005:_1762sp
[  1560.615320s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0003:_650dp
[  1560.829287s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] RTIO sequence error involving channel 0x0008:suservo0

DEST 1

[  2784.146092s] ERROR(satman): received packet of an unknown type
[  2784.150586s] ERROR(satman): timeout attempting to get buffer space from CRI, destination=0xc0
[  2784.159099s] ERROR(satman): write underflow, channel=0, timestamp=1545247125414, counter=1545372549488, slack=-125424074
[  2784.173638s] ERROR(satman): received packet of an unknown type
[  2784.178139s] ERROR(satman): received truncated packet
[  2784.183159s] ERROR(satman): write underflow, channel=1, timestamp=437905711215, counter=1545400101

DEST 2

[  2930.554069s] ERROR(satman): write underflow, channel=6, timestamp=1270499749083, counter=1545372648680, slack=-274872899597
[  2930.563936s]  INFO(satman): TSC loaded from uplink
[  2930.568706s] ERROR(satman): received packet of an unknown type
[  2930.574510s] ERROR(satman): received truncated packet
[  2930.579547s] ERROR(satman): timeout attempting to get buffer space from CRI, destination=0x57
[  2930.588077s] ERROR(satman): write underflow, channel=2, timestamp=1546005020011, counter=12363432125432, slack=-10817427105421
[  2943.699773s] ERROR(satman): write underflow, channel=3, timestamp=1558521121013, counter=12376550453472, slack=-10818029332459
[  3012.501525s] ERROR(satman): write underflow, channel=3, timestamp=1627322209389, counter=124453

DEST 3
Shows nothing on the log - it isn't involved in this particular experiment.

Your System (omit irrelevant parts)

Master process running on Ubuntu 20.04

ARTIQ version: 9.unknown.beta. Commit: c1f2ff3
Gateware version: 9.unknown.beta. Commit: c1f2ff3

  • Hardware involved:
  • Master Kasli v2.0 (DEST # 0)
    • DIO BNC v1.3
    • Urukul ad9910 v1.5
    • Fastino v1.1
    • Fastino v1.1
  • Satellite kasli v2.0 (DEST # 1)
    • SUServo
      • Urukul v.1.5
      • Urukul v.1.5
      • Sampler v.2.2
    • BNC IO rev1.1
  • Satellite kasli v2.0.2 (DEST # 2)
    • SUServo
      • Sampler v2.3
      • Urukul v1.5.2
      • Urukul v1.5.2
    • Sampler v2.3
  • Satellite kasli v2.0 (DEST # 3)
    • Urukul v1.5
    • Fastino v1.2
    • Fastino v1.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant