You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After an nondeterministic period of time in our experiments we encounter both link errors and sequence errors. These put two of our satellite devices in a bad state (the ones showing sequence errors) requiring a restart.
I have also opened a discussion here, but an issue feels more appropriate.
Issue Details
After both reinitialisation of devices and a core.reset devices on the offending satellites are unresponsive. After a power cycle of the satellite or a reset of the master FPGA (artiq_flash start) the satellites come back online.
There is an added effect of experiments hanging. This can be patched by replacing the rtio_input_data calls for Sampler and SUServo with the timestamped alternative in the locations linked below.
The above behaviour leads me to believe that no rtio_output events are being triggered on these satellites, thus there is no input data to read, causing the experiments to hang.
Steps to Reproduce
We are unsure. The behaviour is non-deterministic. The same experiment may produce an error after 10 minutes or 2 hours.
Expected Behavior
Running experiments without any errors. This is usually the case for up to several hours.
Actual (undesired) Behavior
After a non-deterministic amount of experiment run time (10mins to 2hours) we see the following errors.
These errors show up across experiments that have run perfectly well for a few years.
[ 2784.146092s] ERROR(satman): received packet of an unknown type
[ 2784.150586s] ERROR(satman): timeout attempting to get buffer space from CRI, destination=0xc0
[ 2784.159099s] ERROR(satman): write underflow, channel=0, timestamp=1545247125414, counter=1545372549488, slack=-125424074
[ 2784.173638s] ERROR(satman): received packet of an unknown type
[ 2784.178139s] ERROR(satman): received truncated packet
[ 2784.183159s] ERROR(satman): write underflow, channel=1, timestamp=437905711215, counter=1545400101
DEST 2
[ 2930.554069s] ERROR(satman): write underflow, channel=6, timestamp=1270499749083, counter=1545372648680, slack=-274872899597
[ 2930.563936s] INFO(satman): TSC loaded from uplink
[ 2930.568706s] ERROR(satman): received packet of an unknown type
[ 2930.574510s] ERROR(satman): received truncated packet
[ 2930.579547s] ERROR(satman): timeout attempting to get buffer space from CRI, destination=0x57
[ 2930.588077s] ERROR(satman): write underflow, channel=2, timestamp=1546005020011, counter=12363432125432, slack=-10817427105421
[ 2943.699773s] ERROR(satman): write underflow, channel=3, timestamp=1558521121013, counter=12376550453472, slack=-10818029332459
[ 3012.501525s] ERROR(satman): write underflow, channel=3, timestamp=1627322209389, counter=124453
DEST 3
Shows nothing on the log - it isn't involved in this particular experiment.
Bug Report
One-Line Summary
After an nondeterministic period of time in our experiments we encounter both link errors and sequence errors. These put two of our satellite devices in a bad state (the ones showing sequence errors) requiring a restart.
I have also opened a discussion here, but an issue feels more appropriate.
Issue Details
After both reinitialisation of devices and a core.reset devices on the offending satellites are unresponsive. After a power cycle of the satellite or a reset of the master FPGA (artiq_flash start) the satellites come back online.
There is an added effect of experiments hanging. This can be patched by replacing the rtio_input_data calls for Sampler and SUServo with the timestamped alternative in the locations linked below.
artiq/artiq/coredevice/suservo.py
Line 149 in c1f2ff3
artiq/artiq/coredevice/spi2.py
Line 241 in c1f2ff3
The above behaviour leads me to believe that no
rtio_output
events are being triggered on these satellites, thus there is no input data to read, causing the experiments to hang.Steps to Reproduce
We are unsure. The behaviour is non-deterministic. The same experiment may produce an error after 10 minutes or 2 hours.
Expected Behavior
Running experiments without any errors. This is usually the case for up to several hours.
Actual (undesired) Behavior
After a non-deterministic amount of experiment run time (10mins to 2hours) we see the following errors.
These errors show up across experiments that have run perfectly well for a few years.
DEST 0
DEST 1
DEST 2
DEST 3
Shows nothing on the log - it isn't involved in this particular experiment.
Your System (omit irrelevant parts)
Master process running on Ubuntu 20.04
ARTIQ version: 9.unknown.beta. Commit: c1f2ff3
Gateware version: 9.unknown.beta. Commit: c1f2ff3
The text was updated successfully, but these errors were encountered: