New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Bacterial contamination documentation #88

Open

adthrasher wants to merge 6 commits into master from bacterial_contamination

Member

adthrasher commented Dec 6, 2022

Adding a page to talk about the bacterial contamination in CCSS WGS data.

adthrasher added 3 commits

December 6, 2022 10:26


          Initial draft of bacterial contamination documentation

5d94f95


          Update plot colors

1cd9b35


          Finishing initial draft of bacterial contamination text

de1fe79

adthrasher self-assigned this

adthrasher requested review from claymcleod and mcrusch

December 6, 2022 19:43

Member Author

adthrasher commented Dec 6, 2022

@mcrusch & @claymcleod - I've tried to write up documentation for the bacterial contamination which describes the problem and the resolution. I wanted to avoid bogging down by including all of the analysis and results that we have while communicating the issue and our solution. Let me know if there is anything you feel is missing. Feel free to provide other feedback as well, if you have any.

claymcleod approved these changes

View reviewed changes

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Outdated


		## Overview

		Childhood Cancer Survivor Study (CCSS) is a germline-only Data Set consisting of whole genome sequencing of childhood cancer survivors. CCSS is a multi-institutional, multi-disciplinary, NCI-funded collaborative resource established to evaluate long-term outcomes among survivors of childhood cancer. It is a retrospective cohort consisting of >24,000 five-year survivors of childhood cancer who were diagnosed between 1970-1999 at one of 31 participating centers in the U.S. and Canada. The primary purpose of this sequencing of CCSS participants is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy.

Member

claymcleod Dec 11, 2022

Suggested change

      
            Childhood Cancer Survivor Study (CCSS) is a germline-only Data Set consisting of whole genome sequencing of childhood cancer survivors. CCSS is a multi-institutional, multi-disciplinary, NCI-funded collaborative resource established to evaluate long-term outcomes among survivors of childhood cancer. It is a retrospective cohort consisting of >24,000 five-year survivors of childhood cancer who were diagnosed between 1970-1999 at one of 31 participating centers in the U.S. and Canada. The primary purpose of this sequencing of CCSS participants is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy. 
          
            Childhood Cancer Survivor Study (CCSS) is a germline-only dataset consisting of whole genome sequencing of childhood cancer survivors. CCSS is a multi-institutional, multi-disciplinary, NCI-funded collaborative resource established to evaluate long-term outcomes among survivors of childhood cancer. It is a retrospective cohort consisting of >24,000 five-year survivors of childhood cancer who were diagnosed between 1970-1999 at one of 31 participating centers in the U.S. and Canada. The primary purpose of this sequencing of CCSS participants is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy.

Member Author

adthrasher Dec 12, 2022

@claymcleod "Data Set" / "data set" etc. is used all throughout the dau-and-datasets section. In fact, this paragraph is just a reproduction of the description under that section. If we change it here, we should change it across that page for consistency.

Member Author

adthrasher Dec 15, 2022

@claymcleod Looking into this a little further, the APA (https://blog.apastyle.org/apastyle/2012/07/data-is-or-data-are.html) suggests "data set" is the correct term, though that guidance is 10 years old now. "Dataset" now appears in some dictionaries as a simplification. Either way, we should ensure consistency across the pages.

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md


		Samples for the Childhood Cancer Survivorship Study were collected by sending out Buccal swab kits to enrolled participants and having them complete the kits at home. This mechanism of collecting saliva and buccal cells for sequencing is highly desirable because of its non-invasive nature and ease of execution. However, collection of samples in this manner also has higher probability of contamination from external sources (as compared to, say, samples collected using blood). We have observed samples in this cohort which suffer from bacterial contamination. To address this issue, we have taken the following steps:

		1. We have estimated the bacterial contamination rate and annotated each of the samples in the CCSS cohort. For each sample, you will find the estimated contamination rate in the `Description` field of the `SAMPLE_INFO.txt` file that is vended with your data (and as a property on the DNAnexus file). For information on this field, see the [Metadata specification](../metadata-and-clinical#metadata).

Member

claymcleod Dec 11, 2022

Is the warning in the description field going to be relevant once we vend the aln files? I don't think so right?

Member Author

adthrasher Dec 12, 2022

I think that is up to the team. In my opinion, there is still some usefulness to the warning. There still exist bacterial reads in the files. We also have not quantitatively measured the mapping rate of bacterial reads in aln data. We have just observed that the mapping rates are lower and that we have not previously noticed the issue in aln aligned data.

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Show resolved Hide resolved

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md


		![](./kraken_non_human_fraction_with_mapping_aln_mem.png)

		We elected to use `bwa-aln` for this dataset as, currently, there is no clear scalable solution to address contaminated sequencing data. A read identification strategy, such as filtering with Kraken2, would still contain atypical aligning reads that sufficiently diverge from known genomes. [Prior work](https://doi.org/10.1038/s41598-020-76022-4) showed that greater than 30% of these reads showed no similarity to an species in RefSeq. Our investigation supports this conclusion.

Member

claymcleod Dec 11, 2022

I would mention here that we tried many alternative approaches to try to clean the bacterial reads from the data. Ultimately, we ended up choosing bwa aln because... etc.

Member Author

adthrasher Dec 12, 2022

I'm not sure that it is accurate to state that we tried alternative approaches. We considered and discarded a number of strategies. We did pilot work for one or two alternative strategies, which I allude to in the final sentence. Unless there is additional work of which I am unaware.

Member

claymcleod commented Dec 11, 2022

Approved after these changes are made and questions answered.


          Feedback from review

04ed3a9

adthrasher requested a review from bgcurran

December 16, 2022 17:52

adthrasher requested review from claymcleod and agout

January 23, 2023 16:51


          Remove sample name based on review feedback.

5e15b05

mcrusch requested changes

View reviewed changes

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Outdated Show resolved Hide resolved

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Outdated Show resolved Hide resolved

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Outdated


		## Bacterial Contamination

		Samples for the Childhood Cancer Survivorship Study were collected by sending out Buccal swab kits to enrolled participants and having them complete the kits at home. This mechanism of collecting saliva and buccal cells for sequencing is highly desirable because of its non-invasive nature and ease of execution. However, collection of samples in this manner also has higher probability of contamination from external sources (as compared to, say, samples collected using blood). We have observed samples in this cohort which suffer from bacterial contamination. To address this issue, we have taken the following steps:

mcrusch Jan 26, 2023

Buccal -> buccal

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Outdated Show resolved Hide resolved

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Show resolved Hide resolved

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Show resolved Hide resolved

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Show resolved Hide resolved

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Outdated


		## Conclusion

		We have provided the `BAM` file as aligned with `bwa aln` with default parameters. This departure from our standard harmonization pipeline that utilizes `bwa mem` is the best currently available approach to prevent bacterial contamination from impacting variant calling. This is likely a result of `bwa aln` using a minimum seed length of `32` while `bwa-mem` uses a minimum seed length of `19`. In practice, the bacterial reads all appear to have alignments between 19-25bp. The provided BAM files from `bwa-aln` are significantly less impacted by bacterial contamination than those produced by `bwa-mem`. These BAM files should be sufficient for downstream use with requiring additional post-processing.

mcrusch Jan 26, 2023

I feel like the seed length discussion should probably go higher up. It seems too detailed for the conclusion, and it's new information, which doesn't fit well in the conclusion.

What does "with requiring additional post-processing" mean? I assume that should be "without" correct?

docs/genomics-platform/about-our-data/CCSS-and-bacterial-contamination/index.md Outdated


		![](./ccss_aln_vs_mem_mapping_rates.png)

		In the figure below, the `bwa-mem` and `bwa-aln` alignment rates are plotted for each WGS sample in CCSS. The fraction of reads not labeled as of human origin from [kraken2](https://doi.org/10.1186/s13059-019-1891-0) is also plotted. Both aligners show diminished alignment rates in the presence of bacterial contamination, however `bwa-mem` consistently aligns a higher proportion of reads of dubious origin.

mcrusch Jan 26, 2023

I don't think this plot is sufficient to draw the conclusion that mem aligns a higher proportion of reads of dubious origin. We don't know for sure that the dubious reads are the ones mem is aligning. I think we still have evidence of this (like manually inspecting the pillars of doom in the bams), but this makes it seem like you can conclude that from the plot. I would update the wording to make that clear.


          Applying suggestions from Mike's review

9817ae8

mcrusch approved these changes

View reviewed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet