diff --git a/Docker/jupyter/Dockerfile b/Docker/jupyter/Dockerfile index e16cfa7..68b4c7f 100644 --- a/Docker/jupyter/Dockerfile +++ b/Docker/jupyter/Dockerfile @@ -1,4 +1,4 @@ -FROM jupyter/minimal-notebook:9fe5186aba96 +FROM jupyter/minimal-notebook:e407f93c8dcc USER root RUN cat /etc/skel/.bashrc >> /etc/bash.bashrc @@ -6,29 +6,26 @@ USER $NB_UID RUN rm -r /home/jovyan/work -RUN /opt/conda/bin/conda install -y --quiet -c bioconda \ -samtools=1.10 \ -minimap2=2.17 \ -fastqc=0.11.9 \ -pbmm2=1.4.0 \ -parallel=20170422 +RUN /opt/conda/bin/conda config --add channels bioconda +RUN /opt/conda/bin/conda config --add channels conda-forge -RUN parallel --citation <<< 'will cite' +RUN /opt/conda/bin/conda install mamba -y -n base -c conda-forge + +RUN mamba install -y --quiet \ +samtools>=1.10 \ +minimap2 \ +fastqc \ +pbmm2 \ +parallel # install nanoplot with pip (with conda gives errors) RUN pip install NanoPlot -# install flair environment from .yml file -COPY flair_env.yml ./ -RUN /opt/conda/bin/conda env create -f flair_env.yml -RUN rm flair_env.yml +RUN mamba create -y -n flair -c conda-forge -c bioconda flair -# install pacbio environment from .yml file -COPY pacbio_env.yml ./ -RUN /opt/conda/bin/conda env create -f pacbio_env.yml -RUN rm pacbio_env.yml +RUN mamba create -y -n pacbio \ + pbccs lima pbmm2 matplotlib numpy \ + pandas seaborn mappy pysam scikit-learn pbcore -# install assembly environment from .yml file -COPY assembly_env.yml ./ -RUN /opt/conda/bin/conda env create -f assembly_env.yml -RUN rm assembly_env.yml +RUN mamba create -y -n assembly \ + prokka flye busco diff --git a/Docker/rstudio/Dockerfile b/Docker/rstudio/Dockerfile index a588638..08c4be6 100644 --- a/Docker/rstudio/Dockerfile +++ b/Docker/rstudio/Dockerfile @@ -1,4 +1,4 @@ -FROM rocker/rstudio:4.0.3 +FROM rocker/rstudio:4.2.0 # non interactive for installing dependencies ENV DEBIAN_FRONTEND=noninteractive diff --git a/docs/course_material/group_work/project1.md b/docs/course_material/group_work/project1.md index 91baaf2..eed6709 100644 --- a/docs/course_material/group_work/project1.md +++ b/docs/course_material/group_work/project1.md @@ -102,12 +102,12 @@ You can start this project with dividing initial tasks. Because some intermediat * Clone the [FLAIR repository](https://github.com/BrooksLabUCSC/flair) to the server, and check out the documentation. All FLAIR dependencies are in the the pre-installed conda environment named `flair`. You can activate it with `conda activate flair`. * Merge the separate alignments with `samtools merge`, index the merged bam file, and generate a `bed12` file with the script `flair/bin/bam2Bed12.py` -* Run `flair.py correct` on the `bed12` file. Add the `gtf` to the options to improve the alignments. -* Run `flair.py collapse` to generate isoforms from corrected reads. This steps takes ~1 hour to run. -* Generate a count matrix with `flair.py quantify` by using the isoforms fasta and `reads_manifest.tsv`. +* Run `flair correct` on the `bed12` file. Add the `gtf` to the options to improve the alignments. +* Run `flair collapse` to generate isoforms from corrected reads. This steps takes ~1 hour to run. +* Generate a count matrix with `flair quantify` by using the isoforms fasta and `reads_manifest.tsv`. !!! danger "Paths in `reads_manifest.tsv`" - The paths in `reads_manifest.tsv` are relative, e.g. `reads/striatum-5238-batch2.fastq.gz` points to a file relative to the directory from which you are running `flair.py quantify`. So the directory from which you are running the command should contain the directory `reads`. If not, modify the paths in the file accordingly (use full paths if you are not sure). + The paths in `reads_manifest.tsv` are relative, e.g. `reads/striatum-5238-batch2.fastq.gz` points to a file relative to the directory from which you are running `flair quantify`. So the directory from which you are running the command should contain the directory `reads`. If not, modify the paths in the file accordingly (use full paths if you are not sure). * Now you can do several things: * Do a differential expression analysis. In `scripts/` there's a basic R script to do the analysis. Go to your specified IP and port to login to RStudio server (the username is `rstudio`). diff --git a/docs/course_material/group_work/project3.md b/docs/course_material/group_work/project3.md index b7f6b38..f81d5b6 100644 --- a/docs/course_material/group_work/project3.md +++ b/docs/course_material/group_work/project3.md @@ -1,46 +1,39 @@ ## :material-bacteria: Project 3: Assembly and annotation of bacterial genomes -You will be working with PacBio sequencing data of five different bacterial strains. Divide the strains over the members of the group and generate an assembly and annotation. +You will be working with PacBio sequencing data of eight different bacterial species. Divide the species over the members of the group and generate an assembly and annotation. After that, guess the species. !!! info "Project aim" Generate and evaluate an assembly of a bacterial genome out of PacBio reads. -There are five different strains: +There are eight different species: `sample_[1-8].fastq.gz` -Each strain has a tarfile available. Download only the data for the strains that you will require: +Each species has a fastq file available. Download only the data for the species that you will require: ```sh mkdir -p ~/workdir/groupwork_assembly cd ~/workdir/groupwork_assembly -# change this to your strain: -STRAIN="LWX12" +# change this to your species: +species="sample_1" -wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$STRAIN".tar.gz -tar -xvf "$STRAIN".tar.gz -rm "$STRAIN".tar.gz +wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$species".fastq.gz +tar -xvf "$species".fastq.gz +rm "$species".fastq.gz ``` -The downloaded directory has the following structure (here's an example for LWH7): - -``` - -``` ### Before you start -You can start this project with dividing the strains over the different group members. In principle, each group member will go through all the steps of assembly and annotation: +You can start this project with dividing the species over the different group members. In principle, each group member will go through all the steps of assembly and annotation: 1. Quality control with `NanoPlot` 2. Assembly with `flye` 3. Assembly QC with `BUSCO` 4. Annotation with `prokka` -You can do this for both the CLR reads and HiFi reads and compare the results. - ### Tasks and questions !!! note @@ -53,13 +46,6 @@ You can do this for both the CLR reads and HiFi reads and compare the results. conda activate assembly ``` - **Before you run `prokka`** - - The `conda` installation misses a perl module. Install it in the `assembly` environment like this: - - ```sh - cpanm Bio::SearchIO::hmmer --force - ``` * Perform a quality control with `NanoPlot`. * How is the read quality? Is this quality expected? @@ -67,13 +53,10 @@ You can do this for both the CLR reads and HiFi reads and compare the results. * Perform an assembly with `flye`. * Have a look at the helper first with `flye --help`. Make sure you pick the correct mode (i.e. `--pacbio-??`). * Check out the output. Where is the assembly? How is the quality? For that, check out `assembly_info.txt`. + * Did flye assemble any plasmid sequences? * Check the completeness with `BUSCO`. Have a good look at the manual first. You can use automated lineage selecton by specifying `--auto-lineage-prok`. After you have run `BUSCO`, you can generate a nice completeness plot with `generate_plot.py`. You can check its usage with `generate_plot.py --help`. * How is the completeness? Is this expected? -* Perform an annotation with `prokka`. Again, check the manual first. After the run, have a look at for example the statistics in `PROKKA_[date].txt`. For a nice table of annotated genes have a look in `PROKKA_[data].tsv`. - - +* Perform an annotation with `prokka`. Again, check the manual first. After the run, have a look at for example the statistics in `PROKKA_[date].txt`. For a nice table of annotated genes have a look in `PROKKA_[date].tsv`. +* Compare the assemblies of the different species. Are assembly qualities similar? Can you think of reasons why? -* Compare the assembly and annotation between the Illumina, CLR and HiFi reads. Do you see any differences? -* Compare the assemblies of the different strains. Are assembly qualities similar? Can you think of reasons why? -* **BONUS**: Polish the CLR assembly with the Illumina reads by using `pilon`. For this you will need to align the Illumina reads to the assembly first. Use `minimap2` for that while setting `-x` to `sr`. For pilon, specify the resulting bam file by using the option `--frags`. - * Does the polishing improve the assembly? Why (not)? +> This tutorial is based on data provided by Pacific Biosciences at https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/ \ No newline at end of file