Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying a spike in memory usage #12

Open
Martingales opened this issue May 9, 2018 · 23 comments
Open

Identifying a spike in memory usage #12

Martingales opened this issue May 9, 2018 · 23 comments

Comments

@Martingales
Copy link

Hi,

I'm running tebreak on 40xWGS mouse samples from multiple strains (CAST, CAROLI, BL6). Recently I have switched from strain-specific reference genomes to the mm10 reference genome in order to allow easier comparisons between strains. Also more information and feature tracks are available for mm10.

Since the switch, tebreak needs way more RAM (>60 or even >80 GB) leading to the cluster killing my jobs. I needed to limit the number of cores to 4 instead of 12 resulting in an unacceptable extension of the runtime from hours to multiple days per job. The weird thing is that the average memory apparently doesn't change much and is quite low. Somewhere the memory usage of tebreak spikes.

What makes this more complicated is that sometimes when I run the same job twice, the needed resources change dramatically. Like in this example:

Run 1:
Max Memory : 56962 MB
Average Memory : 4866.31 MB
Total Requested Memory : 50000.00 MB
Max Swap : 66314 MB

Run 2:
Max Memory : 13446 MB
Average Memory : 3554.79 MB
Total Requested Memory : 80000.00 MB
Max Swap : 56173 MB

This is the code I use to run tebreak and resolve right after one another:

$PYTHONVERSION $TB/tebreak/tebreak.py \
        --disco_target $REF_TES \
        --bam $BAM \
        --bwaref $REF \
        --processes $n \
        --mask $REF_MASK \
        --map_tabix $UMAPK100 \
        --min_mappability $MINMAP \
        --pickle $OUTPUT.1.pickle \
        --detail_out $OUTPUT.1.tb.detail \
        --tmpdir $TMPDIRTB \
        --debug > $OUTPUT.1.out 2>&1

$PYTHONVERSION $TB/tebreak/resolve.py \
    -p $OUTPUT.1.pickle \
    -t $n \
    -i $REPBASE \
    -v \
    --detail_out $OUTPUT.2.rs.detail \
    --refoutdir $TMPDIRRF \
    --tmpdir $TMPDIRRS \
    --callmuts > $OUTPUT.2.out 2>&1

I will try to map the used memory for a sample to get a clearer picture where exactly the memory usage is spiking.

If you all can offer any help or have some ideas, that would be appreciated!

Thank you!

@adamewing
Copy link
Owner

The memory usage will be on the higher end given the large number of insertions that will be present in wild-derived strains vs mm10, but the usage should be consistent. There are a few places where random sampling is used in tebreak.py which could lead to differences in memory usage. I've added random seeding in commit f08a2d4 so if you pull the latest, you should hopefully now see stable memory usage between runs.

@Martingales
Copy link
Author

I've tracked the memory usage with mprof (https://pypi.org/project/memory_profiler/) for three samples which showed similar memory spikes and I couldn't reproduce the excess mem usage in a controlled environment. So it seems to me that this is more a cluster issue rather than a software issue.

Thanks for putting in a new option for further controlling the memory usage.

@Martingales
Copy link
Author

Sorry for reopening this issue again but I ran tebreak on a second cluster and still a third of all jobs die because of spikes in memory usage.

I have researched a bit further and found that the multiprocessing package of python seems to be prone to memory leaks. Maybe the attached tebreak log helps pinning down the underlying problem?

tebreak_error.log

@Martingales Martingales reopened this Jul 8, 2018
@Martingales
Copy link
Author

Hi Adam,

Have you had a chance to look into this issue a bit more detailed?

@adamewing
Copy link
Owner

Hi sorry for taking awhile getting back to this - I haven't seen memory issues on my end as long as I allow 4Gb RAM for each process. Were you using a similar amount of memory per process?

@Martingales
Copy link
Author

So, currently I am down to assigning 80GB of memory to 10 processes. So this shouldn't be an issue.

I read that this might happen due to a faulty garbage collection in child processes which could lead to an accumulation of unused but allocated memory. I have forked tebreak and try to play around with controlling the child processes. So far with only limited effects. I can prevent some memory leaks but still a good amount of jobs die.

What I have done so far:

  1. Started the pool of workers before loading the reference genome maybe to avoid internal copying.
  2. Try to close and join pools after apply_async. (Code didn't run)
  3. assign a maximum tasks per child:
    pool = mp.Pool(processes=procs, maxtasksperchild=1)
  4. I tried manually setting the number of genome chunks higher (10 times the number of processes).

The last two points seem to reduce the number of dying jobs a little.

One last comment, this issue became really grave after we decided to align our samples from CAST and CAROLI to mm10 in order to make inter-strain comparisons easier. So currently, each tebreak job has to cycle through up to 100k genome chunks.

Are these information of any use for you?

@Martingales
Copy link
Author

Hi Adam,

Sorry to bother you again. Do you have an idea to avoid the memory leaks?

@adamewing
Copy link
Owner

It's a bit hard for me to debug as I haven't run into an example that clearly causes a memory leak in my hands (I've run the Sanger mouse genomes project M.m. castaneus genome aligned to mm10 awhile back and I don't recall memory issues - could try again I suppose). Maybe debugging via the gc module (https://docs.python.org/2/library/gc.html) would be worth playing around with if you haven't already. You could try adding a gc.collect() at the end of the run_chunk function. Do you have a chunk of mm10 that's specifically blowing up on the castaneus genome or is it just when you're running the whole thing?

@Martingales
Copy link
Author

I have had a go at the gc.collect() but not within the function. I'll try this right now.

About the chunks. I honestly don't know. The log files grew so large, I stopped printing them as I thought this might speed up the process. I will rerun them tonight and can report tomorrow on both issues.

@Martingales
Copy link
Author

Sorry for the delayed answer.

gc.collect() did no difference.
I added --skipbig and --debug to the run call but again no change in failure rates. I attached one of the debug files but I cannot see anything in them. do13208_16_38603.1.txt

The last lines before crashing are:

2018-09-10 11:23:18,307 parsed 0 reads, last position: JH584300.1:1378
2018-09-10 11:23:18,329 JH584300.1:0-182347: found 0 anchored reads
2018-09-10 11:23:21,446 2:0-182113224: found 2546159 anchored reads
2018-09-10 11:23:22,477 14:0-124902244: found 1656818 anchored reads
2018-09-10 11:23:23,671 17:0-94987271: found 1496889 anchored reads
2018-09-10 11:23:26,231 parsed 5028299 reads, last position: 18:29599439
2018-09-10 11:23:29,026 parsed 15084897 reads, last position: 16:75669381
2018-09-10 11:23:43,253 building interval trees for /PATH/TO/FILES/source/tebreak/lib/mm10.te.disctgt.nochr.txt
2018-09-10 11:23:46,107 16:0-98207768: found 1422162 anchored reads
2018-09-10 11:23:51,530 parsed 10056598 reads, last position: 18:57646204
2018-09-10 11:23:52,036 CHR_MG3561_PATCH:0-156464115: fetching coordinates from /PATH/TO/FILES/bams_reduced/bams_controls_normals/do13208_16_38603.dist10k.minclip10.reduced.bam
2018-09-10 11:23:52,036 outputting status every 5028299 reads (1 pct)
2018-09-10 11:23:52,052 parsed 0 reads, last position: CHR_MG3561_PATCH:60250689
2018-09-10 11:23:52,123 CHR_MG3561_PATCH:0-156464115: found 6 anchored reads
2018-09-10 11:24:04,601 building interval trees for /PATH/TO/FILES/source/tebreak/lib/mm10.te.disctgt.nochr.txt
2018-09-10 11:24:04,992 building interval trees for /PATH/TO/FILES/source/tebreak/lib/mm10.te.disctgt.nochr.txt
2018-09-10 11:24:05,358 building interval trees for /PATH/TO/FILES/source/tebreak/lib/mm10.te.disctgt.nochr.txt
2018-09-10 11:24:06,421 CHR_MG4213_PATCH:0-91736668: fetching coordinates from /PATH/TO/FILES/bams_reduced/bams_controls_normals/do13208_16_38603.dist10k.minclip10.reduced.bam
2018-09-10 11:24:06,421 outputting status every 5028299 reads (1 pct)
2018-09-10 11:24:06,429 parsed 0 reads, last position: CHR_MG4213_PATCH:49220474
2018-09-10 11:24:06,461 CHR_MG4213_PATCH:0-91736668: found 0 anchored reads
2018-09-10 11:24:06,547 CHR_MG3836_PATCH:0-46591103: fetching coordinates from /PATH/TO/FILES/bams_reduced/bams_controls_normals/do13208_16_38603.dist10k.minclip10.reduced.bam
2018-09-10 11:24:06,547 outputting status every 5028299 reads (1 pct)
2018-09-10 11:24:06,547 CHR_MG3836_PATCH:0-46591103: found 0 anchored reads
2018-09-10 11:24:06,892 CHR_MG3836_PATCH:46581103-182124663: fetching coordinates from /PATH/TO/FILES/bams_reduced/bams_controls_normals/do13208_16_38603.dist10k.minclip10.reduced.bam
2018-09-10 11:24:06,892 outputting status every 5028299 reads (1 pct)
2018-09-10 11:24:06,893 parsed 0 reads, last position: CHR_MG3836_PATCH:122566301
2018-09-10 11:24:06,955 CHR_MG3836_PATCH:46581103-182124663: found 13 anchored reads
2018-09-10 11:24:11,144 parsed 15084897 reads, last position: 18:84981680
2018-09-10 11:24:15,795 18:0-90702639: found 1328244 anchored reads
2018-09-10 11:24:27,641 building interval trees for /PATH/TO/FILES/source/tebreak/lib/mm10.te.disctgt.nochr.txt
2018-09-10 11:24:29,163 CHR_MG4211_PATCH:0-91797447: fetching coordinates from /PATH/TO/FILES/bams_reduced/bams_controls_normals/do13208_16_38603.dist10k.minclip10.reduced.bam
2018-09-10 11:24:29,163 outputting status every 5028299 reads (1 pct)
2018-09-10 11:24:29,185 parsed 0 reads, last position: CHR_MG4211_PATCH:20710999
2018-09-10 11:24:29,218 CHR_MG4211_PATCH:0-91797447: found 0 anchored reads
2018-09-10 11:24:46,526 building interval trees for /PATH/TO/FILES/source/tebreak/lib/mm10.te.disctgt.nochr.txt
Process PoolWorker-105:
Process PoolWorker-93:
Process PoolWorker-99:
Process PoolWorker-108:
Process PoolWorker-109:
Process PoolWorker-92:
Process PoolWorker-100:
Process PoolWorker-98:
Process PoolWorker-106:
Process PoolWorker-104:
Process PoolWorker-107:
Process PoolWorker-103:
Traceback (most recent call last):
...

This sample is also a normal CAROLI sample aligned to mm10.

@Martingales
Copy link
Author

Had a closer look at the log-files while --debug was on.
Most of the samples seem to abort in a similar genomic region. The following lines are the last lines before python multiprocessing throws an error probably due to the LSF job manager killing the job.

Caroli
Sample1
2018-09-10 11:24:18,216 parsed 0 reads, last position: CHR_MG171_PATCH:75358687
2018-09-10 11:24:18,603 CHR_MG171_PATCH:0-151834685: found 1 anchored reads
Sample2
2018-09-10 11:24:29,185 parsed 0 reads, last position: CHR_MG4211_PATCH:20710999
2018-09-10 11:24:29,218 CHR_MG4211_PATCH:0-91797447: found 0 anchored reads
2018-09-10 11:24:46,526 building interval trees for /PATHS/tebreak/lib/mm10.te.disctgt.nochr.txt
Sample3
2018-09-10 11:24:05,883 parsed 0 reads, last position: CHR_MG4211_PATCH:66020767
2018-09-10 11:24:05,932 CHR_MG4211_PATCH:0-91797447: found 0 anchored reads
2018-09-10 11:24:21,904 building interval trees for /PATHS/tebreak/lib/mm10.te.disctgt.nochr.txt
Sample4
2018-09-10 11:24:18,202 outputting status every 5228639 reads (1 pct)
2018-09-10 11:24:18,216 parsed 0 reads, last position: CHR_MG171_PATCH:75358687
2018-09-10 11:24:18,603 CHR_MG171_PATCH:0-151834685: found 1 anchored reads
Sample5
2018-09-10 11:25:59,286 1:112552968-195471971: fetching coordinates from /PATHS/bams_reduced/bams_controls
2018-09-10 11:25:59,287 outputting status every 4809817 reads (1 pct)
2018-09-10 11:25:59,333 parsed 0 reads, last position: 1:112552994
Sample6
2018-09-10 11:24:31,090 parsed 0 reads, last position: CHR_MG171_PATCH:9767691
2018-09-10 11:24:31,419 CHR_MG171_PATCH:0-151834685: found 1 anchored reads
2018-09-10 11:24:49,539 building interval trees for /PATHS/tebreak/lib/mm10.te.disctgt.nochr.txt
Sample7
2018-09-10 11:24:23,973 parsed 0 reads, last position: GL456368.1:3
2018-09-10 11:24:24,023 GL456368.1:0-20208: found 77 anchored reads
2018-09-10 11:24:47,395 building interval trees for /PATHS/tebreak/lib/mm10.te.disctgt.nochr.txt
2018-09-10 11:24:47,902 building interval trees for /PATHS/tebreak/lib/mm10.te.disctgt.nochr.txt
Sample8
2018-09-10 11:23:50,504 outputting status every 4213888 reads (1 pct)
2018-09-10 11:23:50,526 parsed 0 reads, last position: CHR_MG3835_PATCH:11238894
2018-09-10 11:23:50,740 CHR_MG3835_PATCH:0-90835696: found 2070 anchored reads
Sample9
2018-09-10 11:24:25,649 outputting status every 4928998 reads (1 pct)
2018-09-10 11:24:25,650 parsed 0 reads, last position: CHR_MG4254_PATCH:29001049
2018-09-10 11:24:25,681 CHR_MG4254_PATCH:0-52220912: found 1 anchored reads

C3HHeJ
Sample1
2018-09-10 11:23:22,128 Chunk 1:3042620-3044579: masked 0 reads due to -m/--mask
2018-09-10 11:23:22,129 Chunk 1:3042620-3044579: Building clusters from 845 split reads ...
2018-09-10 11:23:22,222 Chunk 1:3042620-3044579: Building breakends...
2018-09-10 11:23:41,111 Chunk 1:3042620-3044579: Mapping 140 breakends ...

CAST
Sample1
2018-09-10 11:24:52,674 Chunk 1:3042432-3044554: Building clusters from 1299 split reads ...
2018-09-10 11:24:52,877 Chunk 1:3042432-3044554: Building breakends...
2018-09-10 11:24:55,578 Chunk 1:3042432-3044554: Mapping 211 breakends ...
Sample2
2018-09-10 11:35:14,578 Processing chunk: 1:3063692-3066152 ...
2018-09-10 11:35:14,609 Chunk 1:3063692-3066152: Parsing split reads from bam(s): /PATHS/bams_reduced/bams

@Martingales
Copy link
Author

So, I have tried a couple of more things.

Re-installing everything. No change in memory leaks.

Running multiple samples in one batch tebreak. Now, tebreak cannot even load the reference fasta into shared memory before the memory usage explodes to over 100GB. This leads me to believe that this might be Python or a cluster issue rather than tebreak?

@Martingales
Copy link
Author

I had another go at this by reverting to the discocluster.py script albeit being outdated. When I first call discocluster.py, filter the results to a list of candidates and supply this list to tebreak via the --interval_bed option, then the whole script runs smoothly without any memory hiccups.

Upon looking at the code, I see there are computational differences but are there conceptual differences between the two approaches?

@adamewing
Copy link
Owner

Hi,

The --disco_targetoption basically incorporates the discocluster.py script into tebreak.py so there isn't really a conceptual difference. The main difference is just relaxed criteria for discordant clustering for --disco_target to reduce false negatives.

Have you tried running tebreak without --disco_target? To make this feasible, the way I would approach it is making a mask via scripts/bam_cpm_mask.py, adding all of the contigs, unplaced chromosomes, chrM, and everything in lib/mm10.centromere_telomere.bed to the mask, and passing this via -m. Also set --max_ins_reads to 500 and set --min_sr_per_break to 2. If you try this, let me know how it goes.

--Adam

@Martingales
Copy link
Author

Alright, I will give this approach a shot. Thanks!

Does it makes sense to additionally use bam files reduced by bam_reduced.py?

@adamewing
Copy link
Owner

Not unless you need to save disk space - that's the motivation behind reduce_bam.py. It probably doesn't affect runtime or memory usage much if at all.

@Martingales
Copy link
Author

I have tried running tebreak in full mode with the elaborate sample-specific mask and your proposed command line options but now all jobs die. Even the ones which used to work before are getting terminated by the cluster for absorbing stupid amounts of memory.

@Martingales
Copy link
Author

I played a bit with other command line options and there seems to be something there to circumvent the memory spikes:
--disc_target failed half the samples even though the jobs had more than 60GB RAM and 5 nodes only.
--disc_only without --disco_target failed all the jobs with the same specs and a sample-specific mask file generated with scripts/bam_cpm_mask.py.
--disc_only with --disco_target worked for all tested samples with no job needing more than 6GB RAM.

@adamewing Is there a way to feed in the --disc_only output as tebreak input?

@Martingales
Copy link
Author

Now, I am creating a sample-specific mask file and run tebreak in a --disc_only fashion. Then I filter the output to a list of candidates. Then I call tebreak again and feed the candidate list in via --interval_bed. When applying the following other options, I lose <5% of the samples but I can live with that for now.

tebreak options:

--map_tabix k100.umap.bed
--min_mappability 0.9
--skipbig
--bigcluster 500
--max_ins_reads 500
--min_disc_reads 8
--min_split_reads 8

And for resolve:

--min_split 8
--min_discord 8the
--min_ins_match 0.9
--min_ref_match 0.99

These options seem rather stringent to me. What would your recommendation for the options be, @adamewing ? Especially with a look at your recent paper (Schauer et al.)?

@adamewing
Copy link
Owner

Hi, sorry you're still having this problem. Are you able to provide any details about your cluster environment? e.g. what scheduler is used, what's the memory vs cpus per node, etc. The memory issues you described and this being the resolution (at least partially) seems bizarre.

Those are probably rather stringent options, whether that's appropriate depends on your expectations - for germline insertions you expect will be heterozygous or homozygous and reasonably deep coverage these might be OK. For cancer samples and where you're willing and able to do validation experiments I'd drop the stringency a bit (e.g. in the paper you mentioned).

@Martingales
Copy link
Author

I'm working on a LSF cluster IBM Platform LSF Standard 9.1.3.0, October 21 2015 with 40 CPUs and 127.8 GB of memory per node, like

HOST_NAME      type        model       cpuf    ncpus  maxmem  maxswp  server RESOURCES
hh-yoda-02-         X86_64   Intel_EM  60.0    40        127.8G    7.8G        Yes (mg)

I'm not able to see memory load per CPU during a job but I can see maximum RSS and swap per job. I have attached a screenshot for a successful tebreak run and a failed one.

tebreak_failure
tebreak_success

@asalcedo31
Copy link

I am having this issue as well. All of my jobs exceed memory and die with 80G for 10 processes for both 30X and 80X normal/tumour WGS samples. I am running the script on reduced bams with hg19 ref, 100000 chunks, providing disc_target, masking centromeres and telomeres and using max_ins_reads 1000.

@adamewing
Copy link
Owner

Hi!
Thanks for the report. What did the rest of your command line look like? You shouldn't specify both -d (--disco_target) and -c (--chunks). I would recommend just the former (-d) for minimising runtime. Let me know if that helps. I'm working on reducing runtime and memory usage but in the meantime if this isn't working for you I'd maybe suggest MELT as an alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants