diff --git a/docs/course_material/group_work/project3.md b/docs/course_material/group_work/project3.md index f81d5b6..171eb1a 100644 --- a/docs/course_material/group_work/project3.md +++ b/docs/course_material/group_work/project3.md @@ -10,20 +10,32 @@ There are eight different species: `sample_[1-8].fastq.gz` -Each species has a fastq file available. Download only the data for the species that you will require: +Each species has a fastq file available. You can download all fastq files like this: ```sh -mkdir -p ~/workdir/groupwork_assembly -cd ~/workdir/groupwork_assembly +wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/project3.tar.gz +tar -xvf project3.tar.gz +rm project3.tar.gz +``` + +!!! note + Download the data file package in your shared working directory, i.e. : `/group_work/` or `~/`. Only one group member has to do this. -# change this to your species: -species="sample_1" +This will create a directory `project3` with the following structure: -wget https://ngs-longreads-training.s3.eu-central-1.amazonaws.com/group_work_assembly/"$species".fastq.gz -tar -xvf "$species".fastq.gz -rm "$species".fastq.gz ``` - +project3 +|-- sample_1.fastq.gz +|-- sample_2.fastq.gz +|-- sample_3.fastq.gz +|-- sample_4.fastq.gz +|-- sample_5.fastq.gz +|-- sample_6.fastq.gz +|-- sample_7.fastq.gz +`-- sample_8.fastq.gz + +0 directories, 8 files +``` ### Before you start @@ -46,13 +58,34 @@ You can start this project with dividing the species over the different group me conda activate assembly ``` - * Perform a quality control with `NanoPlot`. * How is the read quality? Is this quality expected? * How is the read length? * Perform an assembly with `flye`. * Have a look at the helper first with `flye --help`. Make sure you pick the correct mode (i.e. `--pacbio-??`). * Check out the output. Where is the assembly? How is the quality? For that, check out `assembly_info.txt`. + * What species did you assemble? Choose from this list: + ``` + Acinetobacter baumannii + Bacillus cereus + Bacillus subtilis + Burkholderia cepacia + Burkholderia multivorans + Enterococcus faecalis + Escherichia coli + Helicobacter pylori + Klebsiella pneumoniae + Listeria monocytogenes + Methanocorpusculum labreanum + Neisseria meningitidis + Rhodopseudomonas palustris + Salmonella enterica + Staphylococcus aureus + Streptococcus pyogenes + Thermanaerovibrio acidaminovorans + Treponema denticola + Vibrio parahaemolyticus + ``` * Did flye assemble any plasmid sequences? * Check the completeness with `BUSCO`. Have a good look at the manual first. You can use automated lineage selecton by specifying `--auto-lineage-prok`. After you have run `BUSCO`, you can generate a nice completeness plot with `generate_plot.py`. You can check its usage with `generate_plot.py --help`. * How is the completeness? Is this expected? diff --git a/scripts/generate_data_project3/download_reads.sh b/scripts/generate_data_project3/download_reads.sh new file mode 100644 index 0000000..3db8af2 --- /dev/null +++ b/scripts/generate_data_project3/download_reads.sh @@ -0,0 +1,7 @@ +for BC in bc2001 bc2002 bc2004 bc2007 bc2011 bc2019 bc2022 bc2015 +do + wget -O "$BC".bam https://downloads.pacbcloud.com/public/dataset/2021-11-Microbial-96plex/demultiplexed-reads/m64004_210929_143746."${BC}".bam + samtools fastq -0 "$BC".fastq "$BC".bam + gzip "$BC".fastq +done + diff --git a/scripts/generate_data_project3/lookup_bc_organism.csv b/scripts/generate_data_project3/lookup_bc_organism.csv new file mode 100644 index 0000000..e455da3 --- /dev/null +++ b/scripts/generate_data_project3/lookup_bc_organism.csv @@ -0,0 +1,8 @@ +bc2001,sample_6 +bc2002,sample_7 +bc2004,sample_3 +bc2007,sample_1 +bc2011,sample_5 +bc2019,sample_2 +bc2022,sample_8 +bc2015,sample_4 diff --git a/scripts/generate_data_project3/rename_fastq.sh b/scripts/generate_data_project3/rename_fastq.sh new file mode 100644 index 0000000..5380408 --- /dev/null +++ b/scripts/generate_data_project3/rename_fastq.sh @@ -0,0 +1,4 @@ +sed 's/,/ /g' lookup_bc_organism.csv | while read BC NAME +do + mv "$BC".fastq.gz "$NAME.fastq.gz" +done \ No newline at end of file diff --git a/scripts/project2_commands.sh b/scripts/project2_commands.sh index d15d020..47edd1f 100755 --- a/scripts/project2_commands.sh +++ b/scripts/project2_commands.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash -cd ~/workdir/groupwork_pacbio/ +cd ~/workdir/project2/ # generate reference for minimap2 minimap2 \