Skip to content

Commit

Permalink
updates dockerfiles and amends projects accordingly
Browse files Browse the repository at this point in the history
  • Loading branch information
Geert van Geest committed Mar 3, 2023
1 parent 133b2d1 commit d766b9b
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 55 deletions.
37 changes: 17 additions & 20 deletions Docker/jupyter/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,34 +1,31 @@
FROM jupyter/minimal-notebook:9fe5186aba96
FROM jupyter/minimal-notebook:e407f93c8dcc

USER root
RUN cat /etc/skel/.bashrc >> /etc/bash.bashrc
USER $NB_UID

RUN rm -r /home/jovyan/work

RUN /opt/conda/bin/conda install -y --quiet -c bioconda \
samtools=1.10 \
minimap2=2.17 \
fastqc=0.11.9 \
pbmm2=1.4.0 \
parallel=20170422
RUN /opt/conda/bin/conda config --add channels bioconda
RUN /opt/conda/bin/conda config --add channels conda-forge

RUN parallel --citation <<< 'will cite'
RUN /opt/conda/bin/conda install mamba -y -n base -c conda-forge

RUN mamba install -y --quiet \
samtools>=1.10 \
minimap2 \
fastqc \
pbmm2 \
parallel

# install nanoplot with pip (with conda gives errors)
RUN pip install NanoPlot

# install flair environment from .yml file
COPY flair_env.yml ./
RUN /opt/conda/bin/conda env create -f flair_env.yml
RUN rm flair_env.yml
RUN mamba create -y -n flair -c conda-forge -c bioconda flair

# install pacbio environment from .yml file
COPY pacbio_env.yml ./
RUN /opt/conda/bin/conda env create -f pacbio_env.yml
RUN rm pacbio_env.yml
RUN mamba create -y -n pacbio \
pbccs lima pbmm2 matplotlib numpy \
pandas seaborn mappy pysam scikit-learn pbcore

# install assembly environment from .yml file
COPY assembly_env.yml ./
RUN /opt/conda/bin/conda env create -f assembly_env.yml
RUN rm assembly_env.yml
RUN mamba create -y -n assembly \
prokka flye busco
2 changes: 1 addition & 1 deletion Docker/rstudio/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM rocker/rstudio:4.0.3
FROM rocker/rstudio:4.2.0

# non interactive for installing dependencies
ENV DEBIAN_FRONTEND=noninteractive
Expand Down
8 changes: 4 additions & 4 deletions docs/course_material/group_work/project1.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,12 @@ You can start this project with dividing initial tasks. Because some intermediat

* Clone the [FLAIR repository](https://github.com/BrooksLabUCSC/flair) to the server, and check out the documentation. All FLAIR dependencies are in the the pre-installed conda environment named `flair`. You can activate it with `conda activate flair`.
* Merge the separate alignments with `samtools merge`, index the merged bam file, and generate a `bed12` file with the script `flair/bin/bam2Bed12.py`
* Run `flair.py correct` on the `bed12` file. Add the `gtf` to the options to improve the alignments.
* Run `flair.py collapse` to generate isoforms from corrected reads. This steps takes ~1 hour to run.
* Generate a count matrix with `flair.py quantify` by using the isoforms fasta and `reads_manifest.tsv`.
* Run `flair correct` on the `bed12` file. Add the `gtf` to the options to improve the alignments.
* Run `flair collapse` to generate isoforms from corrected reads. This steps takes ~1 hour to run.
* Generate a count matrix with `flair quantify` by using the isoforms fasta and `reads_manifest.tsv`.

!!! danger "Paths in `reads_manifest.tsv`"
The paths in `reads_manifest.tsv` are relative, e.g. `reads/striatum-5238-batch2.fastq.gz` points to a file relative to the directory from which you are running `flair.py quantify`. So the directory from which you are running the command should contain the directory `reads`. If not, modify the paths in the file accordingly (use full paths if you are not sure).
The paths in `reads_manifest.tsv` are relative, e.g. `reads/striatum-5238-batch2.fastq.gz` points to a file relative to the directory from which you are running `flair quantify`. So the directory from which you are running the command should contain the directory `reads`. If not, modify the paths in the file accordingly (use full paths if you are not sure).

* Now you can do several things:
* Do a differential expression analysis. In `scripts/` there's a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is `rstudio`).
Expand Down
43 changes: 13 additions & 30 deletions docs/course_material/group_work/project3.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,39 @@

## :material-bacteria: Project 3: Assembly and annotation of bacterial genomes

You will be working with PacBio sequencing data of five different bacterial strains. Divide the strains over the members of the group and generate an assembly and annotation.
You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species.

!!! info "Project aim"
Generate and evaluate an assembly of a bacterial genome out of PacBio reads.

There are five different strains:
There are eight different species: `sample_[1-8].fastq.gz`



Each strain has a tarfile available. Download only the data for the strains that you will require:
Each species has a fastq file available. Download only the data for the species that you will require:

```sh
mkdir -p ~/workdir/groupwork_assembly
cd ~/workdir/groupwork_assembly

# change this to your strain:
STRAIN="LWX12"
# change this to your species:
species="sample_1"

wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$STRAIN".tar.gz
tar -xvf "$STRAIN".tar.gz
rm "$STRAIN".tar.gz
wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$species".fastq.gz
tar -xvf "$species".fastq.gz
rm "$species".fastq.gz
```

The downloaded directory has the following structure (here's an example for LWH7):

```
```

### Before you start

You can start this project with dividing the strains over the different group members. In principle, each group member will go through all the steps of assembly and annotation:
You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation:

1. Quality control with `NanoPlot`
2. Assembly with `flye`
3. Assembly QC with `BUSCO`
4. Annotation with `prokka`

You can do this for both the CLR reads and HiFi reads and compare the results.

### Tasks and questions

!!! note
Expand All @@ -53,27 +46,17 @@ You can do this for both the CLR reads and HiFi reads and compare the results.
conda activate assembly
```

**Before you run `prokka`**

The `conda` installation misses a perl module. Install it in the `assembly` environment like this:

```sh
cpanm Bio::SearchIO::hmmer --force
```

* Perform a quality control with `NanoPlot`.
* How is the read quality? Is this quality expected?
* How is the read length?
* Perform an assembly with `flye`.
* Have a look at the helper first with `flye --help`. Make sure you pick the correct mode (i.e. `--pacbio-??`).
* Check out the output. Where is the assembly? How is the quality? For that, check out `assembly_info.txt`.
* Did flye assemble any plasmid sequences?
* Check the completeness with `BUSCO`. Have a good look at the manual first. You can use automated lineage selecton by specifying `--auto-lineage-prok`. After you have run `BUSCO`, you can generate a nice completeness plot with `generate_plot.py`. You can check its usage with `generate_plot.py --help`.
* How is the completeness? Is this expected?
* Perform an annotation with `prokka`. Again, check the manual first. After the run, have a look at for example the statistics in `PROKKA_[date].txt`. For a nice table of annotated genes have a look in `PROKKA_[data].tsv`.


* Perform an annotation with `prokka`. Again, check the manual first. After the run, have a look at for example the statistics in `PROKKA_[date].txt`. For a nice table of annotated genes have a look in `PROKKA_[date].tsv`.
* Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why?

* Compare the assembly and annotation between the Illumina, CLR and HiFi reads. Do you see any differences?
* Compare the assemblies of the different strains. Are assembly qualities similar? Can you think of reasons why?
* **BONUS**: Polish the CLR assembly with the Illumina reads by using `pilon`. For this you will need to align the Illumina reads to the assembly first. Use `minimap2` for that while setting `-x` to `sr`. For pilon, specify the resulting bam file by using the option `--frags`.
* Does the polishing improve the assembly? Why (not)?
> This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/

0 comments on commit d766b9b

Please sign in to comment.