-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in daemon runners? #4603
Comments
Question for the engine-savvy people (@sphuber @muhrin @unkcpz) - I guess when a workflow starts, submits a subprocess, and then waits, I guess the whole state is kept in memory. In this way, one can always keep the memory usage very low, and therefore it becomes very easy to increase a lot the number of slots per worker without risking to overload the machine. |
It would definitely be possible but it would require a significant redesign and rewrite. The current idea is that the To implement what you suggest, we would have to change this |
I see. Still, this leaves us with a lot of slots taken when running many workflows, as we know, and potentially a large memory usage (even if this leak is fixed - this issue is indeed in the memory left over after the processes should not be in memory anymore). There should be a LightProcess superclass that implements only the methods to listen/schedule callbacks/..., from which Process inherits, and LightProcess has a method to return the corresponding "full" Process when needed? Does it make sense? |
I confirm that the behaviour of #4598 (AiiDA starts having DuplicateSubscribers) happens when too much memory is used, the system starts putting programs in the disk swap, and everything becomes very slow (the whole machine, and then indirectly AiiDA itself starts choking). This is not even an issue of a memory leak (or maybe it's partially related to it), but even just of memory consumption, with each worker quickly getting to >4GB of RAM needed, easily filling my (already powerful) workstation. I think that it would be good at least to understand why so much memory is needed by the daemons, and fix it. |
Since I had to run a bit more, I injected some tracemalloc code at the top of my PwBaseWorkChains.
(this is at the beginning of the run, and after submitting some hundreds of WorkChains). Then I had to stop because both my hard disk got full and I ended the memory, the computer started swapping and everything got slow... Anyway the output is this:
The first 30 entries already account for almost all memory of this worker (>1.5GB). If I do tot_stats = after.compare_to(before, 'traceback')
stat = tot_stats[0]
print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
for line in stat.traceback.format():
print(line) for the first one, I get:
Can anyone with a bit of knowledge of the engine (@sphuber @unkcpz @muhrin) figure out if there is something obvious where we can spare some memory? Most of the places seems to be code coming from the engine, and not the nodes of AiiDA, if I'm not wrong. |
Out of interest, could you clarify what you did here? What exactly do you mean by "top", where exactly in the code does it go? From an uneducated glimpse, it looks like tornado introduces a significant memory overhead, when wrapping a callback into a future. |
What @chrisjsewell said about the fact that |
Hi both. I'm aware that tornado is gone now; I fear that (even if the code has changed) some of these "leaks" might still be in and might be also an indirect cause of the issues in #4595. I am running in production and since develop still has some issues for me (#4595) I had to continue working with 1.5, and so I started this investigation - I thought it was due to other parts of AiIDA that didn't change; this last report of mine above confirms that indeed it's better to investigate on develop. I have some runs to finish in the next few weeks that will fill my RAM, and also have to free up some disk space before I can do more tests, so won't be able to test this super soon. NOTE: this will create a lot of files and a possibly tens of GBs, and also the memory usage will increase by 2x or more because of tracemalloc - so maybe better to start conservative, with e.g. 50 work chains only, and make sure there is enough disk space. What I did was to insert this code at the end of the # Injecting memory monitoring code
import os
import datetime
import tracemalloc
from aiida.manage.configuration.settings import AIIDA_CONFIG_FOLDER
from aiida import get_profile
if not tracemalloc.is_tracing():
tracemalloc.start(10)
snapshot = tracemalloc.take_snapshot()
snapshot_fname = f"{AIIDA_CONFIG_FOLDER}/daemon/log/tracemalloc-{get_profile().name}-{os.getpid()}-{datetime.datetime.now().strftime('%Y%m%d-%H%M%S.%f')}.dump"
snapshot.dump(snapshot_fname) (I'm not sure if the parameter 10 is needed in As you see, is creating a dump file at each run of a new workflow, in the daemon/log folder, with an appropriate - hopefully unique - name. This also includes the PID so one can distinguish different workflows. To analyze the dumps, I use something like this.
This gives the top 30 differences. If one wants the traceback of e.g. the very first entry, one could do instead: tot_stats = after.compare_to(before, 'traceback')
stat = tot_stats[0]
print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
for line in stat.traceback.format():
print(line) |
We confirm the issue of excessive memory use by the engine in AiiDA v1.5.2 (not yet on
That is three daemon workers together using ~74% of 32GB, i.e. 24GB of RAM! I attach the output of In this state AiiDA is essentially unusable for our high-throughput work.
|
Some short answers:
|
Please ignore my comment below if I misunderstand this issue. Probably this issue hit me. I ran with aiida-vasp and aiida-phonopy, supercell phonon (phono3py) calculation with 780 supercells. Submission of supercell vasp calculations are done at this point, https://github.com/aiida-phonopy/aiida-phonopy/blob/13efc549647f595b4f76ab455691cd95dc06d05d/aiida_phonopy/workflows/phono3py.py#L165-L198. With the latest develop, a few aiida-vasp calculations could not be retrieved (probably due to lack of memory of postgresql etc). I rerun the same phono3py calculation three times, but all similarly failed. With v1.5.2 (from git tag), memory usage is still a lot, but all the calculations went well, but maybe just lucky. |
Think I might narrow down the place where this issue happening. how to easily reproduce the issueBy running the script which modified from The cruxI roll the version back to The two commits are tightly related since the second depends on the first implement and the In my test, the However, I have no idea how to fix this, manually calling the garbage collector to free the memory in the callback of Sorry to bother you with this. Happy Christmas! |
Thanks a lot for the investigation @unkcpz ! Looks to me like you are on the right track.
My suspicion would be that after these two commits the runner starts keeping references to objects it actually no longer needs, which prevents the garbage collector from freeing the memory. I would also mention that (without me knowing more details) it is possible that the cause is not in the aiida-core code but in the updated dependencies (plumpy,...) |
Thanks @unkcpz very useful!!! And I agree with Leo's comment. From my memory reports above, it seems that (at least part of) the memory that does not get freed up happens here: My guess is that indeed something (loop? Communicator? Callback?) gets stored, bound to some variable via functors.partial, but does not get released. The next steps would be
Not sure how much this is helpful, but maybe in helps in moving one step closer to the solution? |
By the way, I report here the next 5 tracebacks from tracemalloc: they happen from different places, but they all go back to the same routine. This is good, in the sense that if we fix that problem, we fix probably all memory leaks.
|
@giovannipizzi @ltalirz Thanks for the inputs! Here is another information that might be related to the issue. It appeared many times during I migrate
|
I did more experiment in this issue and I think the leak comes from both new added The point is memory is leaking not only by the subscribers but from the implement of |
I've now tested (so far only using toy examples) where the memory references are accumulating. P.S. According to my tests, at least the basic fact of this accumulation does not seem to be affected by the presence or absence of the LoopCommunicator (tested on a branch of @unkcpz where the LoopCommunicator was removed from 1.5.2 and the results were unaffected). |
FYI in the benchmark engine tests that same warning is appearing https://github.com/aiidateam/aiida-core/runs/1677369011?check_suite_focus=true tests/benchmark/test_engine.py::test_workchain_local[serial-wc-loop]
/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/base_events.py:641: RuntimeWarning: coroutine 'create_task.<locals>.run_task' was never awaited
Coroutine created at (most recent call last)
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
self.run_forever()
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
self._run_once()
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/nest_asyncio.py", line 132, in _run_once
handle._run()
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/nest_asyncio.py", line 201, in run
ctx.run(self._callback, *self._args)
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/nest_asyncio.py", line 159, in step
step_orig(task, exc)
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/asyncio/tasks.py", line 280, in __step
result = coro.send(None)
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/kiwipy/rmq/communicator.py", line 217, in _on_broadcast
await receiver(
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/kiwipy/rmq/utils.py", line 118, in wrap
return coro_or_fn(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/plumpy/communications.py", line 61, in converted
task_future = futures.create_task(msg_fn, loop)
File "/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/site-packages/plumpy/futures.py", line 75, in create_task
asyncio.run_coroutine_threadsafe(run_task(), loop)
self._ready.clear() |
Hi @ltalirz , nice work. I'd be tempted to try and pare down the problem further now, given that you're seeing that a complete round trip launch via the runner is clearly now cleaning up after itself. For example, does running a workfunction from a simple script leave references? One of the things I'd be hunting for is an object that is hanging around that shouldn't be (for example the If you find such a beast in the while (don't be scared to just hack the runner to intercept the process instance after it has finished for example) then I've found the (there blog entries for this library are also quite interesting and informative e.g. https://mg.pov.lt/blog/hunting-python-memleaks.html) |
Ok so over the past few days I have created: https://github.com/aiidateam/aiida-integration-tests. In investigating this issue, here is what I did: First spin up the docker network: $ docker-compose up --build -d Then in a separate terminal you can get a live reporting of the stats: $ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2940cef2c10d aiida-int-core 0.00% 22.86MiB / 1.942GiB 1.15% 1.1kB / 412B 17.6MB / 9.26MB 13
e8b56bc0f9c7 aiida-int-slurm 7.69% 20.46MiB / 1.942GiB 1.03% 1.06kB / 0B 22.3MB / 4.1kB 20
90e2543dcd25 aiida-int-rmq 4.25% 103.5MiB / 1.942GiB 5.21% 1.06kB / 0B 43.5MB / 590kB 92
9313daf2fbf9 aiida-int-postgres 0.10% 49.57MiB / 1.942GiB 2.49% 1.15kB / 0B 42.1MB / 49.7MB 7 After configuring an AiiDA profile, connected to the other three containers: $ docker exec -it aiida-int-core /bin/bash
root@2940cef2c10d:~# aiida_config/run_all.sh
If you start a daemon, this is the baseline memory usage: root@2940cef2c10d:~# verdi daemon start 1
If you start an ipython shell, this is the baseline memory usage: root@2940cef2c10d:~# verdi daemon stop
root@2940cef2c10d:~# verdi shell
Now here is the important part: If you run a calculation in ipython (which I think should mirror what is happening in a daemon?) you can see that the memory usage increases and does not release: In [1]: import gc
...: from aiida_sleep.cli import run_calc
...: from aiida_sleep.sleep_job import SleepCalculation
...: from pympler import muppy, refbrowser
In [2]: run_calc(payload=int(1e6), output=int(1e6))
Out[2]: <CalcJobNode: uuid: 4b1a2c94-e93d-427a-9ec4-380566eae038 (pk: 5) (aiida.calculations:sleep)>
Using pympler to analyze the memory, you can find that the Process (and its connected nodes) are still in memory. In [3]: gc.collect()
...: all_objects = muppy.get_objects()
...: calcs = [o for o in all_objects if hasattr(o, "__class__") and isinstance(o, SleepCalculation)]
...: calcs
Out[3]: [<aiida_sleep.sleep_job.SleepCalculation at 0x7fe0ab3169d0>]
In [4]: cb = refbrowser.ConsoleBrowser(calcs[0], maxdepth=3)
...: cb.print_tree()
aiida_sleep.sleep_job.SleepCalculation-+-list--dict-+-list
| +-dict
| +-dict
| +-dict
| +-dict
| +-dict
| +-IPython.core.interactiveshell.DummyMod
| +-dict
| +-dict
|
+-list-+-dict-+-list
| | +-dict
| | +-dict
| | +-dict
| | +-dict
| | +-dict
| | +-IPython.core.interactiveshell.DummyMod
| | +-dict
| | +-dict
| |
| +-dict-+-list
| | +-dict
| |
| +-dict-+-list
| | +-IPython.terminal.prompts.RichPromptDisplayHook
| |
| +-dict-+-list
| | +-IPython.core.interactiveshell.ExecutionResult
| |
| +-dict-+-list
| +-dict
| +-dict
| +-dict
|
|
+-dict-+-list--dict
| +-dict-+-list
| +-aiida.plugins.utils.PluginVersionProvider
|
|
+-list-+-list--dict
| +-hamt_bitmap_node-+-list
| +-hamt
|
|
+-cell-+-list--dict
| +-tuple-+-list
| +-function (kill_process)
|
|
+-dict-+-list--dict
| +-plumpy.process_states.Finished-+-list
| +-dict
|
|
+-cell-+-list--dict
+-tuple-+-list
| +-function (<lambda>)
|
+-tuple-+-list
| +-function (<lambda>)
|
+-tuple-+-list
+-function (<lambda>) Related and somewhat troubling is that, if you run an ipython cell, then press CTRL-C before it finishes, you get: In [5]: import time; time.sleep(60)
^C01/26/2021 03:42:51 PM <243> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 5 So it appears that the process is not correctly finalising!? (final note, if you exit the verdi shell, the memory is released) This may or may not also be related to #3776 (comment) |
@chrisjsewell Thanks a lot for checking! The aiida-core/aiida/engine/runners.py Lines 227 to 239 in 02248cf
There are at least in principle two ways one could fix this:
I think @muhrin may know the answer to this; he pointed to a similar suspicious case (I believe it was a different one, but I'm not sure). The memory reported to be held by |
Opened an issue on plumpy aiidateam/plumpy#197 |
Yes, I think you're probably right. Ok, how 'bout just doing what you say and putting a check in the |
👍 opened aiidateam/plumpy#198 |
Ok I've found the next memory leak, and this is a big one. For the slurm server code only (I assume this may extend to all ssh transport codes), but not for a local code:
This referencing object is <TimerHandle when=721531.841911332 TransportQueue.request_transport.<locals>.do_open() at /root/aiida-core/aiida/engine/transports.py:84> So it appears that the transport is not being closed properly 😬 As an example, submitting with a local code:
and after it has ended:
Whereas submitting with a SLURM code:
|
A quick |
indeed 👍 As I have already noted to @ltalirz this does not look to yet release all memory; there still seems to be reference(s) to the input/output nodes, but it is getting there Edit (Leo): We checked together and for running on |
Not exactly sure if the transport is holding on to an instance of the process, or if you are suggesting that the transport should be closed, but I just wanted to point out that we should be careful with the latter. The idea of the |
thanks for the clarification @sphuber, yeh I haven't really looked in any detail at that part of the code yet |
I guess then taht the |
Note for this transport queue issues, if you run multiple calculations it is only keeping a reference to the first one you run: In [1]: import gc
...:
...: from pympler import muppy, refbrowser
...:
...: from aiida_sleep.cli import run_calc
...: from aiida_sleep.sleep_job import SleepCalculation
In [2]: run_calc(code="sleep@slurm")
Out[2]: <CalcJobNode: uuid: ad7338c7-3c7f-49db-9468-d3866a95bdda (pk: 192) (aiida.calculations:sleep)>
In [3]: run_calc(code="sleep@slurm")
Out[3]: <CalcJobNode: uuid: 7485d8cb-ab0e-46c9-9312-5852d92291f1 (pk: 200) (aiida.calculations:sleep)>
In [4]: run_calc(code="sleep@slurm")
Out[4]: <CalcJobNode: uuid: b2c11303-a607-48a7-8929-3376e05bd902 (pk: 208) (aiida.calculations:sleep)>
In [5]: gc.collect()
...: all_objects = muppy.get_objects()
...: calcs = [
...: o
...: for o in all_objects
...: if hasattr(o, "__class__") and isinstance(o, SleepCalculation)
...: ]
In [6]: calcs
Out[6]: [<aiida_sleep.sleep_job.SleepCalculation at 0x7fadfa6ad340>]
In [7]: calcs[0].node.pk
Out[7]: 192 I guess because for subsequent calculations, it checks if the transport already exists and does not have to re-open |
Just to keep track of a curiosity we encountered: it seems like certain references go out of scope from one ipython cell execution to the next. E.g. when using the local code In [1]: from pympler import muppy, refbrowser, summary
...: import gc
...:
...: from aiida_sleep.cli import run_calc
...: from aiida_sleep.sleep_job import SleepCalculation
...:
...: run_calc(code="sleep@local", payload=int(1e5), output=int(1e5))
Out[1]: <CalcJobNode: uuid: 55ebc26f-0225-489c-a1a0-1de0710dfabb (pk: 194) (aiida.calculations:sleep)>
In [2]: gc.collect()
...: objects_1 = muppy.get_objects()
...: calcs = [o for o in objects_1 if hasattr(o, "__class__") and isinstance(o, SleepCalculation)]
...: calcs
Out[2]: [] However, if one runs the same code inside a single cell (or copies it into a python script and runs the script using In [1]: from pympler import muppy, refbrowser, summary
...: import gc
...:
...: from aiida_sleep.cli import run_calc
...: from aiida_sleep.sleep_job import SleepCalculation
...:
...: run_calc(code="sleep@local", payload=int(1e5), output=int(1e5))
...:
...: gc.collect()
...: objects_1 = muppy.get_objects()
...:
...: calcs = [o for o in objects_1 if hasattr(o, "__class__") and isinstance(o, SleepCalculation)]
...: calcs
Out[1]: [<aiida_sleep.sleep_job.SleepCalculation at 0x7ff52a2edd90>] In the second case, there are two extra references from In [2]: cb = refbrowser.ConsoleBrowser(calcs[0], maxdepth=3)
...: cb.print_tree()
aiida_sleep.sleep_job.SleepCalculation-+-list--dict-+-list
| +-dict
| +-dict
| +-dict
| +-dict
| +-dict
| +-IPython.core.interactiveshell.DummyMod
| +-dict
| +-dict
|
+-list-+-dict-+-list
| | +-dict
| | +-dict
| | +-dict
| |
| +-dict-+-list
| | +-dict
| | +-dict
| | +-dict
| | +-dict
| | +-dict
| | +-IPython.core.interactiveshell.DummyMod
| | +-dict
| | +-dict
| |
| +-dict-+-list
| | +-dict
| |
| +-dict-+-list
| | +-IPython.terminal.prompts.RichPromptDisplayHook
| |
| +-dict-+-list
| +-IPython.core.interactiveshell.ExecutionResult
|
|
+-cell-+-list--dict
| +-tuple-+-list
| +-function (try_killing)
|
|
+-method-+-list--dict
| +-dict-+-list
| | +-function (broadcast_receive)
| |
| +-cell-+-list
| +-tuple
|
|
+-dict-+-list--dict
| +-plumpy.process_states.Finished-+-list
| +-dict
|
|
+-cell-+-list--dict
+-tuple-+-list
| +-function (<lambda>)
|
+-tuple-+-list
| +-function (<lambda>)
|
+-tuple-+-list
+-function (<lambda>) It looks somehow like certain cpython cell objects (two of which hold a reference to the Calculation) go out of scope in between two ipython cell executions. P.S. We originally thought the reason for this result may be that after P.P.S. Aha! The issue was the time delay after all, but loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.sleep(2)) I.e. when working in ipython, we just need to make sure to let a bit of time pass after |
Note, I have created an alpha plumpy release (https://pypi.org/project/plumpy/0.18.5a0/) from aiidateam/plumpy#205, which should be the final change required for daemons to automatically garbage collect terminated processes. @giovannipizzi do you want to give this a try (aiida-core develop + plumpy==0.18.5a0) to see if you see any difference in memory for your original workchain runs (#4603 (comment))? |
I've launched ~1000 work chains yesterday night, they ar running. I guess tomorrow we'll have an answer. Currently (400 work chains running, ~1200 active processes) the memory usage is still 2.5-2.8%/worker (with 8 workers), that totals to ~13GB RAM. We'll see if this goes back to zero at the end of all jobs! And then we can investigate if the memory consumption is OK (we've on average ~11 MB/active process I think, I think/hope this can be reduced? - but for a different issue) |
thanks @giovannipizzi, yes indeed it will not reduce the memory during running, that's the next step 😬 |
Unfortunately I have bad news. After all workflows are completed and there are no more running (e.g. in
Just to show the "correct" baseline for memory consumption I run
(and indeed at the beginning, or if I restart the daemon, all were using only ~0.143% memory). |
I suggest anyway to postpone this to after the release of 1.6.0 since this is not blocking and the issue was already there before. Instead I opened #4745 that occurred during the runs that is IMO more urgent to solve before the release, as it makes some jobs fail |
Just to add another data point: This is
followed by
100% = 32GB, i.e. the daemon workers are using ~210 MB each, which is perfectly acceptable to me. If Giovanni's issues still persist with AiiDA 1.6.0, that suggests to me that his issue is either related to
[1] I.e. the 3rd daemon worker still got the chance to pick up a handful of calculations. This makes for a fairer comparison in my view, since some memory increase is to be expected due to things like python imports. Edit: Update after having run ~250-300k calculations with 2-3 workers (added one a bit later), i.e. roughly 100k per worker
Again, 100%=32GB, i.e. 0.869%=278MB, so this worker accumulated ~170MB of extra memory over the course of running these 100k processes |
Hi all, At the beginning the memory usage of each of the 8 workers was 0.146%. After running 4 processes (some of which triggered the exponential-backoff mechanism): the memory usage went a bit up, to 0.22-0.25%. I then run 400 WorkChains concurrently (resulting in ~1200 processes): at the end the memory was only 0.33% (and I think it never went much higher). So, for at least my original use case, I think that this issue is fixed and can be closed (thanks! :-) ) - @chrisjsewell feel free to close unless some other people have more issues. |
wahoo 🎉 So yes I will close this then, but feel free to re-open if anyone else still encounters the issue |
That's great news, thanks for all the hard work! |
Also to further confirm, also usage while running is quite nice/low, now - I've 8 daemon workers, with slots increased from the default to 500 per worker, and with ~3600 running processes (from |
I have been running ~2000 (QE Relax) WorkChains with AiiDA 1.5.0.
Now, everything is finished,
verdi process list
is empty, and I have zero messages also in RabbitMQ (as a double check, see below).However, my 8 workers are still using a lot of memory:
Note that this means ~1.8GB RAM/worker, so a total of ~15GB used!
I initially reported this already in #4598 but I thought it was due to the overload described there.
Instead, this time everything went smoothly with no excepted jobs.
Therefore, I am assuming that this is a memory leak, with some resources not properly released.
Considering the size of the data, this is similar to the size of the corresponding file repository. Indeed, ArrayData should still have some 'caching' of the arrays in memory, so maybe an ArrayData node might keep in memory all arrays? Maybe this is the cause?
We might want to remove that caching, but still I think the daemon should explicitly delete or remove nodes from memory once they are not used, I believe. Maybe they remain because they stay in some DB session? (This is Django)
It would be good if someone could investigate this (maybe the next task for @chrisjsewell, but also feedback from @muhrin @sphuber @unkcpz is appreciated)
For completeness: as expected, if I stop the daemon and then restart it with
verdi daemon start 8
, I get a low memory usage:This is to show that there are no more messages (also before restarting the daemon)
The text was updated successfully, but these errors were encountered: