-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packet loss with multiple disk sizes, when some become full #30
Comments
Hi @varenius! I see; this can indeed be an issue. |
Thank you for reminding me about the set_disks command. A strategy like you propose could indeed work, and in principle it could cover what I wished for already - thanks! However, for the particular case at hand, we'd like to avoid manual tallies of "which disk to use when" as much as possible. As it is, with one group of "small" disks full, some space will be available when an experiment is deleted, but perhaps not enough to use all the disks for the complete duration of the next experiment. This makes for a somewhat complicated disk-surveillence strategy to make this work. If jive5ab could just "forget about a disk when full until future notice", then one does not have to plan, but can just rely on disks to be omitted when full. It's a simpler book-keeping problem. Still, it's some manual work to keep track off, and not using all disks all the time is in principle suboptimal for performance. So, I'm now thinking that for this particular case, where we have 2 flexbuffs with this situation (2 blocks of disks each, one big and one small), I'm wondering if the pragmatic solution would simply be the hardware-fix of shuffling disks around so that one machine has all the "big" disks, and the other all the "small" disks. This would result in both machines having only disks of one size. For now, that would fix the problem, at the expense of one flexbuff having more space than the other (but that's fine). I will ponder this a bit more. I still think that perhaps a flag to make jive5ab "remember bad disks" until a restart, or manual clear (say a set_disks= command) could be useful. But perhaps it's not the simplest answer to my current issue :). |
OSO flexbuff disks are sometimes upgraded in batches, not all at once. For one machine, this means we have 12 x 20TB and 18 x 12 TB. Because jive5ab is striping evenly across all disks, there is a problem with this configuration. When we have many full disks, there is a risk that a record-command will spend so much time trying to find a free disk that we get significant packet loss. This happened yesterday in experiment vr2303_oe where 623 scans had 5-25% loss. One example:
Thing is: we can sustain this 8 Gbps recording rate just fine with the 12 disks with free space. So the problem, as I understand it, is that jive5ab keeps checking the same full disks for every scan.
During an experiment, disks are unlikely to get more free space. So, one could imagine that when jive5ab encounters a full disk, it will be out of the pool FOREVER until some "clear allowed disks" or similar command is sent? In this way, one could send a "clear" command at the start of every experiment, and when disks fill up, jive5ab will only fail once per disk during the experiment, instead of repeating the same search in vain for every scan?
The text was updated successfully, but these errors were encountered: