-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch record retrieval #742
Comments
Multithreading breakdown performanceI ran a few benchmarks with different chunk sizes and threads on my 8 core and 32 core machines, mostly because I was curious how much these impacted the results I showed above. In terms of the number of records to fetch per thread, the sweet spot seems to be 5-10 records. In terms of the number of concurrent threads, 8-10 threads per core. Fetching 1000 records on 8 core machine
|
A few notes: Getting records one-by-one will always be very slow because of various overheads. Datasets have an The records obtained from The Possibly related, but I did some investigation yesterday. We are indeed having some networking hardware issues that we are trying to pinpoint. So that might be adding some slowdown and variability to your requests as well. |
I'll note the bad gateway error would occur regardless of whether I used a threaded or non-threaded implementation. Even when using the iterate records, I am still getting a Bad gateway error roughly 1 out of ever 3 times I try to fetch the dataset.
|
FYI, the networking issues here at VT seem to have been resolved. So hopefully things are generally faster. I will look more deeply into the Bad Gateway errors, but at least at the moment I couldn't reproduce. |
I just ran a bunch of "tests" (essentially just running a few loops of grabbing the same dataset above) with no gateway errors. |
Ok that's funny, because now I am able to reproduce the |
I think this has been resolved, but if you run into problems again let me know! |
I'm running into this same problem. I've been trying all day to download data with the script at https://github.com/openmm/spice-dataset/tree/main/downloader. It invariably fails with the exception
though only after working for anywhere from 20 minutes to 2.5 hours. I've tried from two different computers. One is at my home with a not great internet connection. The other is in a data center at Stanford with a very high speed connection. Both of them fail. |
The downloads also are going ridiculously slowly, sometimes taking over an hour for a single dataset. Almost all the time is spent in just two lines: recs = list(dataset.iterate_records(specification_names=specifications)) and all_molecules = client.get_molecules([r.molecule_id for e, s, r in recs]) |
The 502 errors are hard to debug, and something I've occasionally seen before. I see them in the logs but I don't see much more detail than that. I think it's related to some interaction between Traefik and gunicorn (where no workers are available), but I need to dig deeper. Unfortunately it is hard to reproduce. I'm running the downloader script now and it seems to be chugging along. It could just be the amount of data being downloaded and being limited by cross-country bandwidth. If you run the following (which just fetches all the records of the first dataset) how long does it take for you? For me it takes about 2 minutes.
The good-ish news is that I have been working on the local caching stuff, and that is almost ready. So this kind of stuff might become much easier in the future. I do see that when getting records, there is a needlessly complicated query happening, although I doubt it affects it too much. |
If I also retrieve the molecules it takes much longer.
|
Oh my, this is not what I expected. I added some print statements for each part of fetching molecules, and the requests are being handled just fine (100-200 ms). But parsing the JSON into the molecule objects that is taking far too long. I need to look at this ASAP.
Almost 2 minutes to convert JSON to 250 molecules is definitely not right. |
Ok I have a PR (#798) for the server that should fix this, but I need to test a bit to make sure it actually fixes the issue |
Thanks for the quick fix! Let me know when it's ready to test. |
New release is out and the ML server has been upgraded. Here's what I get now:
|
Much better! |
I'm still getting this error. I've been trying all day to download dataset 348. It invariably fails after anything from a few minutes to a couple of hours. Usually with the Bad Gateway error, occasionally with a read timeout. |
Looking through the logs I do see a hint where the instance is running out of memory. I've increased the memory allocated to it through docker, and also reconfigured the number of processes/threads handling requests for each container. Let's see if that helps. Ok let me see if I can reproduce this tomorrow on a different instance, but I think it is something subtle like this. This is annoying but thanks for being patient! |
Thanks! I'm trying again. |
Success, thanks! |
Yes I haven't seen any additional errors. But the increasing memory usage is unsettling. Something is happening in SQLAlchemy that is causing increasing memory as time goes on. I can reproduce it with QCFractal, even outside of flask. And I can see what data structures are being held on too long, but it's hard to reproduce it will a small self-contained script. |
This issue relates to downloading large batches of records, such as fetching an entire dataset.
There are two main hurdles I've run into with the batch download:
1- Efficiency of the process
2- Loss of connection to the database during retrieval
Note, this issue relates to issues #740 and #741.
Here, I'm will outline some benchmarks and my attempted solutions.
Serial retrieval of records benchmarks
Currently if I want to fetch records I would do something similar to:
The performance for fetching a small number of records is good:
Since this is a serial process, as might be expected, timing scales with the number of records
Based on this scaling, fetching the whole QM9 dataset (with ~133K records) would take about 13 hours (in practice this is about right).
Multithreaded retrieval bencharmks
I used
concurrent.futures
to wrap theget_records
function above, to allow for concurrent multithreading. I'll note I choseconcurrent.futures
because it works in jupyter notebooks without needing to do any sort of "tricks". The code below is a pretty bare bones implementation of multithreaded retrieval. I'll note, this code will chunk together 25 records per thread (i.e.,chunk_size
) and allow for 48 concurrent threads (i.e.,n_threads
); these values seems to be pretty reasonable on my 8 core MacBook.Benchmarks for fetching records:
Extrapolating to the whole qm9 dataset, it would take about <15 minutes to fetch the 133K records, which seems quite reasonable as compared to 13 hours.
Loss of database connectivity
If I were to up the number of records to, say 10000, at some point during the process I will get the following error:
Note, if I just turn off my wifi (to simulate an internet hiccup), I get the following error:
I modifying the code above (shown below), to simply put the record retrieval in a try/except structure (inside a while loop that will retry a max of 1000 times). I'll note, in the code below, instead of recording the actual data fetched, i just keep track of the number of "failures" to connect to the server.
I had this code fetch 10000 records 3 times; the timing was very consistent at about 100 s.
Interestingly, the number of failures to connect didn't change much between the 3 runs: 405, 407, and 408. If it were just my internet being a bit flaky, I would have expected those to be less consistent. Since data is being broken up into 25 record chunks, that means 400 total threads, so the ~400 failures seems a little suspect . Digging a little into these numbers, each thread on the first call to the portal,ends up in the except statement. This confuses me because, if every single thread needs to reconnect initially, how does the first implementation of the threading, without the try/except of this would even work? (There must be something odd about the try/exception statement that is eluding me). Regardless, this means that 400 of those "failures" aren't real, so we are really dealing with 5, 7, and 8, i.e., <1 failure to connect for every 1000 record calls.
Issue #741 suggests using retry behind the scenes which seems to be a more user friendly approach and would allow for a bit smarter approach in terms of connection retries.
Creating multithreaded
get_records()
and 'get_entries()' functions in the qcportal would be quite beneficial.The text was updated successfully, but these errors were encountered: