Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error caused by full pipe to child process when lots of L1 nodes are started at once #34

Open
cubranic opened this issue Sep 3, 2019 · 4 comments

Comments

@cubranic
Copy link

cubranic commented Sep 3, 2019

From the logs (as shared in #frb-ops):

Aug 29 19:55:55 cf1n0 ch-frb-l1-dispatch.sh: ch_frb_io: listening for packets (ip_addr=10.6.201.10, udp_port=1313)
Aug 29 19:55:55 cf1n0 ch-frb-l1-dispatch.sh: [Assembler-2] Found frame0_nano: 1566915994999999960
Aug 29 19:55:55 cf1n0 ch-frb-l1-dispatch.sh: [Assembler-1] Retrieving frame0_ctime from http://carillon.chime:54321/get-frame0-time
Aug 29 19:55:55 cf1n0 ch-frb-l1-dispatch.sh: [Assembler-1] Fetching frame0_time from http://carillon.chime:54321/get-frame0-time
Aug 29 19:55:56 cf1n0 ch-frb-l1-dispatch.sh: [Assembler-1] Found frame0_nano: 1566915994999999960
Aug 29 19:56:07 cf1n0 ch-frb-l1-dispatch.sh: terminate called after throwing an instance of 'std::runtime_error'
Aug 29 19:56:07 cf1n0 ch-frb-l1-dispatch.sh: what():  bonsai::trigger_pipe::write(): pipe is full and timeout expired, looks like child process isn't reading quickly enough from the PipedDedisperser
Aug 29 19:56:08 cf1n0 ch-frb-l1-dispatch.sh: /home/l1operator/ch-frb-l1-dispatch.sh: line 104:  3722 Aborted                 ./ch-frb-l1 ${L1_ARGS} ${L1_CONFIG} ../ch_frb_rfi/json_files/rfi_16k/${RFI_CONFIG} /data/bonsai_configs/${BONSAI_CONFIG} ${L1B_CONFIG}
Aug 29 19:56:08 cf1n0 systemd: ch-frb-l1.service: main process exited, code=exited, status=134/n/a
Aug 29 19:56:09 cf1n0 systemd: Unit ch-frb-l1.service entered failed state.
@kmsmith137
Copy link
Owner

Hmm, not sure what's going on!
Can you check the value of /proc/sys/fs/pipe-max-size on the node?
Is the parameter 'l1b_buffer_nsamples' specified in the L1 config file?

@cubranic
Copy link
Author

cubranic commented Sep 3, 2019

I don't think there is anything particular about this node, it's just the first one I found when checking the logs. I believe this is the reason why tsars can start up L1 too quickly across the can, or even start it all at once, but have to go approx. a rack at a time.

Anyways, pipe-max-size is 16 MB:

[root@cf1n0 ~]# cat /proc/sys/fs/pipe-max-size
16777216

I'm not sure exactly about the config being used. I think it's "bonsai_production_noups_nbeta2_v4.hdf5", which I found in /data, but I haven't found that string in the output of h5dump on it. Maybe @dstndstn or someone else more knowledgeable about L1 than me can answer that.

@cubranic
Copy link
Author

cubranic commented Sep 3, 2019

cc @chitrangpatel

@dstndstn
Copy link
Collaborator

dstndstn commented Sep 3, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants