-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behaviour change after download worker refactor #302
Comments
@MattUnderscoreZhang could you please take a look? Indeed this is definitely not expected |
Reverting commit 83afef0 restored the old behavior for me. It seems that there is still much higher number of threads than processes_count × thread_count (haven't checked it before), but at least it creates the correct number of shards at a time. Oh, BTW, the setting is called process_es__count, not process_count that might sound more correct (vs thread_count), for anyone grepping in the source tree. Confused me at first since the only occurrence was in the no_distributor() function. |
Yeah I might have merged that a bit too fast. Will look more and we can probably revert if there's no easy fix |
I’ll take a look as soon as I get home. It’s likely a change I made to the semaphores. I’ll commit a fix in a new pull request, and see if I can add a relevant unit test. |
@rom1504 I created a pull request that should limit processing to the correct number of shards at a time (as tested on my local environment). However, I see the same thing that @pabl0 does, where the number of threads created is higher than expected from processes_count x thread_count, going far back in the commit history. This seems to be a bug that has existed for a long time. Testing with a 5k sample dataset, I set number_sample_per_shard=100, processes_count=5, thread_count=2: As a result, my config requires 50 shards, but I only download 5 at a time: As you mentioned, the number of threads is higher than expected: |
Thanks @rom1504 @MattUnderscoreZhang. Your fix does indeed seem to improve things in my quick testing. However, it seems to me that the current behavior is that it creates batches of N number of tar files, but it is only writing to one at a time. Like only one process is active and the rest are not doing any work, and the tar files remain empty. BTW, is the output ".../_tmp/X.fether" caused by this print statement added by the worker refactor PR? It feels to me like unnecessary debug output that should be dropped.
|
@pabl0 I can't seem to replicate your observation. What environment are you running on, and what configs are you using? For reference, here's what I see. With five processes, I see 5 tar files being written simultaneously. |
Also, I removed the print statement on the threading_fix path. Nice catch. |
I see. IIRC, the WebVid dataset consist of normal http links using different d/l backend than YouTube datasets. Is it possible they behave differently? |
Previously, video2dataset would create and download distribution.process_count shards at a time. So if I had 10 processes and, say, 32 threads configured, it would create ten .tar and .parquet files, and after finishing, the _stats.json file would be created (triggering my automation to upload the downloaded data elsewhere), and video2dataset would move to processing a new shard.
After the recent changes, it seems to be processing hundreds of shard at a time. Also, with process_count 4 and thread_count 16, I see some 300-400 threads running instead of the expected 64 or so. I don't think it works as designed, but if it does, how can I limit the number of shards processed at a time to make my download pipeline work smoothly?
The text was updated successfully, but these errors were encountered: