-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifying a spike in memory usage #12
Comments
The memory usage will be on the higher end given the large number of insertions that will be present in wild-derived strains vs mm10, but the usage should be consistent. There are a few places where random sampling is used in |
I've tracked the memory usage with mprof (https://pypi.org/project/memory_profiler/) for three samples which showed similar memory spikes and I couldn't reproduce the excess mem usage in a controlled environment. So it seems to me that this is more a cluster issue rather than a software issue. Thanks for putting in a new option for further controlling the memory usage. |
Sorry for reopening this issue again but I ran tebreak on a second cluster and still a third of all jobs die because of spikes in memory usage. I have researched a bit further and found that the multiprocessing package of python seems to be prone to memory leaks. Maybe the attached tebreak log helps pinning down the underlying problem? |
Hi Adam, Have you had a chance to look into this issue a bit more detailed? |
Hi sorry for taking awhile getting back to this - I haven't seen memory issues on my end as long as I allow 4Gb RAM for each process. Were you using a similar amount of memory per process? |
So, currently I am down to assigning 80GB of memory to 10 processes. So this shouldn't be an issue. I read that this might happen due to a faulty garbage collection in child processes which could lead to an accumulation of unused but allocated memory. I have forked tebreak and try to play around with controlling the child processes. So far with only limited effects. I can prevent some memory leaks but still a good amount of jobs die. What I have done so far:
The last two points seem to reduce the number of dying jobs a little. One last comment, this issue became really grave after we decided to align our samples from CAST and CAROLI to mm10 in order to make inter-strain comparisons easier. So currently, each tebreak job has to cycle through up to 100k genome chunks. Are these information of any use for you? |
Hi Adam, Sorry to bother you again. Do you have an idea to avoid the memory leaks? |
It's a bit hard for me to debug as I haven't run into an example that clearly causes a memory leak in my hands (I've run the Sanger mouse genomes project M.m. castaneus genome aligned to mm10 awhile back and I don't recall memory issues - could try again I suppose). Maybe debugging via the gc module (https://docs.python.org/2/library/gc.html) would be worth playing around with if you haven't already. You could try adding a |
I have had a go at the About the chunks. I honestly don't know. The log files grew so large, I stopped printing them as I thought this might speed up the process. I will rerun them tonight and can report tomorrow on both issues. |
Sorry for the delayed answer.
The last lines before crashing are:
This sample is also a normal CAROLI sample aligned to mm10. |
Had a closer look at the log-files while
|
So, I have tried a couple of more things. Re-installing everything. No change in memory leaks. Running multiple samples in one batch tebreak. Now, tebreak cannot even load the reference fasta into shared memory before the memory usage explodes to over 100GB. This leads me to believe that this might be Python or a cluster issue rather than tebreak? |
I had another go at this by reverting to the Upon looking at the code, I see there are computational differences but are there conceptual differences between the two approaches? |
Hi, The Have you tried running tebreak without --Adam |
Alright, I will give this approach a shot. Thanks! Does it makes sense to additionally use bam files reduced by |
Not unless you need to save disk space - that's the motivation behind |
I have tried running tebreak in full mode with the elaborate sample-specific mask and your proposed command line options but now all jobs die. Even the ones which used to work before are getting terminated by the cluster for absorbing stupid amounts of memory. |
I played a bit with other command line options and there seems to be something there to circumvent the memory spikes: @adamewing Is there a way to feed in the |
Now, I am creating a sample-specific mask file and run tebreak options:
And for resolve:
These options seem rather stringent to me. What would your recommendation for the options be, @adamewing ? Especially with a look at your recent paper (Schauer et al.)? |
Hi, sorry you're still having this problem. Are you able to provide any details about your cluster environment? e.g. what scheduler is used, what's the memory vs cpus per node, etc. The memory issues you described and this being the resolution (at least partially) seems bizarre. Those are probably rather stringent options, whether that's appropriate depends on your expectations - for germline insertions you expect will be heterozygous or homozygous and reasonably deep coverage these might be OK. For cancer samples and where you're willing and able to do validation experiments I'd drop the stringency a bit (e.g. in the paper you mentioned). |
I'm working on a LSF cluster
I'm not able to see memory load per CPU during a job but I can see maximum RSS and swap per job. I have attached a screenshot for a successful tebreak run and a failed one. |
I am having this issue as well. All of my jobs exceed memory and die with 80G for 10 processes for both 30X and 80X normal/tumour WGS samples. I am running the script on reduced bams with hg19 ref, 100000 chunks, providing disc_target, masking centromeres and telomeres and using max_ins_reads 1000. |
Hi! |
Hi,
I'm running tebreak on 40xWGS mouse samples from multiple strains (CAST, CAROLI, BL6). Recently I have switched from strain-specific reference genomes to the mm10 reference genome in order to allow easier comparisons between strains. Also more information and feature tracks are available for mm10.
Since the switch, tebreak needs way more RAM (>60 or even >80 GB) leading to the cluster killing my jobs. I needed to limit the number of cores to 4 instead of 12 resulting in an unacceptable extension of the runtime from hours to multiple days per job. The weird thing is that the average memory apparently doesn't change much and is quite low. Somewhere the memory usage of tebreak spikes.
What makes this more complicated is that sometimes when I run the same job twice, the needed resources change dramatically. Like in this example:
This is the code I use to run tebreak and resolve right after one another:
I will try to map the used memory for a sample to get a clearer picture where exactly the memory usage is spiking.
If you all can offer any help or have some ideas, that would be appreciated!
Thank you!
The text was updated successfully, but these errors were encountered: