updates dockerfiles and amends projects accordingly

sib-swiss · Mar 3, 2023 · d766b9b · d766b9b
1 parent 133b2d1
commit d766b9b
Show file tree

Hide file tree

Showing 4 changed files with 35 additions and 55 deletions.
diff --git a/Docker/jupyter/Dockerfile b/Docker/jupyter/Dockerfile
@@ -1,34 +1,31 @@
-FROM jupyter/minimal-notebook:9fe5186aba96
+FROM jupyter/minimal-notebook:e407f93c8dcc
 
 USER root
 RUN cat /etc/skel/.bashrc >> /etc/bash.bashrc
 USER $NB_UID
 
 RUN rm -r /home/jovyan/work
 
-RUN /opt/conda/bin/conda install -y --quiet -c bioconda \
-samtools=1.10 \
-minimap2=2.17 \
-fastqc=0.11.9 \
-pbmm2=1.4.0 \
-parallel=20170422
+RUN /opt/conda/bin/conda config --add channels bioconda
+RUN /opt/conda/bin/conda config --add channels conda-forge
 
-RUN parallel --citation <<< 'will cite'
+RUN /opt/conda/bin/conda install mamba -y -n base -c conda-forge
+
+RUN mamba install -y --quiet \
+samtools>=1.10 \
+minimap2 \
+fastqc \
+pbmm2 \
+parallel
 
 # install nanoplot with pip (with conda gives errors)
 RUN pip install NanoPlot
 
-# install flair environment from .yml file
-COPY flair_env.yml ./
-RUN /opt/conda/bin/conda env create -f flair_env.yml
-RUN rm flair_env.yml
+RUN mamba create -y -n flair -c conda-forge -c bioconda flair
 
-# install pacbio environment from .yml file
-COPY pacbio_env.yml ./
-RUN /opt/conda/bin/conda env create -f pacbio_env.yml
-RUN rm pacbio_env.yml
+RUN mamba create -y -n pacbio \
+    pbccs lima pbmm2 matplotlib numpy \
+    pandas seaborn mappy pysam scikit-learn pbcore
 
-# install assembly environment from .yml file
-COPY assembly_env.yml ./
-RUN /opt/conda/bin/conda env create -f assembly_env.yml
-RUN rm assembly_env.yml
+RUN mamba create -y -n assembly \
+    prokka flye busco
diff --git a/Docker/rstudio/Dockerfile b/Docker/rstudio/Dockerfile
@@ -1,4 +1,4 @@
-FROM rocker/rstudio:4.0.3
+FROM rocker/rstudio:4.2.0
 
 # non interactive for installing dependencies
 ENV DEBIAN_FRONTEND=noninteractive

diff --git a/docs/course_material/group_work/project1.md b/docs/course_material/group_work/project1.md
@@ -102,12 +102,12 @@ You can start this project with dividing initial tasks. Because some intermediat
 
 * Clone the [FLAIR repository](https://github.com/BrooksLabUCSC/flair) to the server, and check out the documentation. All FLAIR dependencies are in the the pre-installed conda environment named `flair`. You can activate it with `conda activate flair`.
 * Merge the separate alignments with `samtools merge`, index the merged bam file, and generate a `bed12` file with the script `flair/bin/bam2Bed12.py`
-* Run `flair.py correct` on the `bed12` file. Add the `gtf` to the options to improve the alignments.
-* Run `flair.py collapse` to generate isoforms from corrected reads. This steps takes ~1 hour to run.
-* Generate a count matrix with `flair.py quantify` by using the isoforms fasta and `reads_manifest.tsv`.
+* Run `flair correct` on the `bed12` file. Add the `gtf` to the options to improve the alignments.
+* Run `flair collapse` to generate isoforms from corrected reads. This steps takes ~1 hour to run.
+* Generate a count matrix with `flair quantify` by using the isoforms fasta and `reads_manifest.tsv`.
 
 !!! danger "Paths in `reads_manifest.tsv`"
-    The paths in `reads_manifest.tsv` are relative, e.g. `reads/striatum-5238-batch2.fastq.gz` points to a file relative to the directory from which you are running `flair.py quantify`. So the directory from which you are running the command should contain the directory `reads`. If not, modify the paths in the file accordingly (use full paths if you are not sure).
+    The paths in `reads_manifest.tsv` are relative, e.g. `reads/striatum-5238-batch2.fastq.gz` points to a file relative to the directory from which you are running `flair quantify`. So the directory from which you are running the command should contain the directory `reads`. If not, modify the paths in the file accordingly (use full paths if you are not sure).
 
 * Now you can do several things:
     * Do a differential expression analysis. In `scripts/` there's a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is `rstudio`).

diff --git a/docs/course_material/group_work/project3.md b/docs/course_material/group_work/project3.md
@@ -1,46 +1,39 @@
 
 ## :material-bacteria: Project 3: Assembly and annotation of bacterial genomes
 
-You will be working with PacBio sequencing data of five different bacterial strains. Divide the strains over the members of the group and generate an assembly and annotation.
+You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. 
 
 !!! info "Project aim"
     Generate and evaluate an assembly of a bacterial genome out of PacBio reads. 
 
-There are five different strains: 
+There are eight different species: `sample_[1-8].fastq.gz` 
 
 
 
-Each strain has a tarfile available. Download only the data for the strains that you will require: 
+Each species has a fastq file available. Download only the data for the species that you will require: 
 
 ```sh
 mkdir -p ~/workdir/groupwork_assembly
 cd ~/workdir/groupwork_assembly
 
-# change this to your strain:
-STRAIN="LWX12"
+# change this to your species:
+species="sample_1"
 
-wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$STRAIN".tar.gz
-tar -xvf "$STRAIN".tar.gz
-rm "$STRAIN".tar.gz
+wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$species".fastq.gz
+tar -xvf "$species".fastq.gz
+rm "$species".fastq.gz
 ```
 
-The downloaded directory has the following structure (here's an example for LWH7):
-
-```
-
-```
 
 ### Before you start
 
-You can start this project with dividing the strains over the different group members. In principle, each group member will go through all the steps of assembly and annotation:
+You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation:
 
 1. Quality control with `NanoPlot`
 2. Assembly with `flye`
 3. Assembly QC with `BUSCO`
 4. Annotation with `prokka`
 
-You can do this for both the CLR reads and HiFi reads and compare the results. 
-
 ### Tasks and questions
 
 !!! note
@@ -53,27 +46,17 @@ You can do this for both the CLR reads and HiFi reads and compare the results.
     conda activate assembly
     ```
 
-    **Before you run `prokka`**
-
-    The `conda` installation misses a perl module. Install it in the `assembly` environment like this:
-
-    ```sh
-    cpanm Bio::SearchIO::hmmer --force
-    ```
 
 * Perform a quality control with `NanoPlot`.
     * How is the read quality? Is this quality expected?
     * How is the read length?
 * Perform an assembly with `flye`. 
     * Have a look at the helper first with `flye --help`. Make sure you pick the correct mode (i.e. `--pacbio-??`). 
     * Check out the output. Where is the assembly? How is the quality? For that, check out `assembly_info.txt`. 
+    * Did flye assemble any plasmid sequences?
 * Check the completeness with `BUSCO`. Have a good look at the manual first. You can use automated lineage selecton by specifying `--auto-lineage-prok`. After you have run `BUSCO`, you can generate a nice completeness plot with `generate_plot.py`. You can check its usage with `generate_plot.py --help`. 
     * How is the completeness? Is this expected?
-* Perform an annotation with `prokka`. Again, check the manual first. After the run, have a look at for example the statistics in `PROKKA_[date].txt`. For a nice table of annotated genes have a look in `PROKKA_[data].tsv`. 
-
-
+* Perform an annotation with `prokka`. Again, check the manual first. After the run, have a look at for example the statistics in `PROKKA_[date].txt`. For a nice table of annotated genes have a look in `PROKKA_[date].tsv`. 
+* Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why?
 
-* Compare the assembly and annotation between the Illumina, CLR and HiFi reads. Do you see any differences? 
-* Compare the assemblies of the different strains. Are assembly qualities similar? Can you think of reasons why?
-* **BONUS**: Polish the CLR assembly with the Illumina reads by using `pilon`. For this you will need to align the Illumina reads to the assembly first. Use `minimap2` for that while setting `-x` to `sr`. For pilon, specify the resulting bam file by using the option `--frags`. 
-    * Does the polishing improve the assembly? Why (not)?
+> This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/