diff --git a/_site/develop/02_DMP.html b/_site/develop/02_DMP.html index d1f82f5..5b7667c 100644 --- a/_site/develop/02_DMP.html +++ b/_site/develop/02_DMP.html @@ -258,7 +258,7 @@

2. Data Management Plan

Modified
-

May 22, 2024

+

July 30, 2024

@@ -289,7 +289,7 @@

2. Data Management Plan

The process of data management involves implementing tailored best practices for your data but how do you ensure comprehensive coverage of the decisions and that data is well-managed throughout its life cycle. To achieve this, a Data Management Plan (DMP) is essential.

-

A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.

+

DMP are required for grant applications to ensure research data to be FAIR. A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.

Benefits of writing a DMP

A DMP serves as the initial step toward achieving FAIR principles in a project.

diff --git a/_site/develop/03_DOD.html b/_site/develop/03_DOD.html index e99ae98..1172978 100644 --- a/_site/develop/03_DOD.html +++ b/_site/develop/03_DOD.html @@ -308,7 +308,7 @@

3. Data organization and storage

Modified
-

May 22, 2024

+

July 25, 2024

@@ -416,7 +416,7 @@

Folder organization
-

Ensure that the person downloading the files employs checksums or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.

+

Ensure that the person downloading the files employs checksums (MD5, SHA1, SHA256) or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.

  • MD5 Checksum: Files with names ending in “.md5” contain MD5 checksums. For instance, “filename.txt.md5” holds the MD5 checksum of “filename.txt”.”
@@ -615,7 +615,7 @@

Optimizing Fo ├── .fastq.gz └── samplesheet.csv

@@ -749,7 +809,7 @@

Quick tutor

3. Resources and databases folder

Health databases are utilized for storing, organizing, and providing access to diverse health-related data, including genomic data, clinical records, imaging data, and more. These resources are regularly updated and released under different versions from various sources. To ensure data reproducibility, it’s crucial to manage and specify the versions and sources of data within these databases.

- -
+

For example, preprocessing NGS data involves utilizing various genomic resources for tasks like aligning and annotating fastq files. Essential resources include reference genomes in FASTA format (e.g., human and mouse), indexed fasta files for alignment tools like STAR and Bowtie, and GTF or GFF files for quantifying reads into genomic regions. One of the latest human reference genome is GRCh38, however many past studies are based on GRCh37.

How can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as refgenie.

@@ -776,12 +836,12 @@

Manual Download

  • Organizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices.
  • Documentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata.
  • README.md: Record the version of the data in the README.md file.
  • -
  • Checksums: Check for and use checksums provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below.
  • +
  • Checksums: Check for and use checksums (MD5, SHA1, SHA256, …) provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below to get more familiar with these files.
  • Verify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption.
  • Automated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g. use bash script or pipeline).
  • - -
    +
    -

    We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets. In this example, we use data from the HLA FTP Directory.

    +

    We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets, as it is commonly used. In this example, we use data from the HLA FTP Directory.

    1. Install md5sum (from coreutils package)
    -
    #!/bin/bash
    -# On Ubuntu/Debian
    -apt-get install coreutils
    -# On macOS
    -brew install coreutils
    +
    #!/bin/bash
    +# On Ubuntu/Debian
    +apt-get install coreutils
    +# On macOS
    +brew install coreutils
    1. Create a bash script to download the target files (named “dw_resources.sh” in the data structure).
    -
    #!/bin/bash
    -# Important: go through the README before downloading! Check if a checksums file is included. 
    -
    -# 1. Create or change the directory to the resources dir. 
    -
    -# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:
    -1a3d12e4e6cc089388d88e3509e41cb3  hla_gen.fasta
    -# Finally, save it: 
    -md5file="md5checksum.txt"
    -
    -# Define the URL of the files to download
    -url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_gen.fasta"
    -# 
    -filename=$(basename "$url")
    -
    -# (Optional) Define a different filename to save the downloaded file (`wget -O $out_filename`)
    -# out_filename = "imgt_hla_gen.fasta"
    -
    -# Download the file
    -wget $url && \
    -md5sum --status --check $md5file
    +
    #!/bin/bash
    +# Important: go through the README before downloading! Check if a checksums file is included. 
    +
    +# 1. Create or change the directory to the resources dir. 
    +
    +# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:
    +7348fbef5ab204f3aca67e91f6c59ed2  hla_prot.fasta
    +# Finally, save it: 
    +md5file="md5checksum.txt"
    +
    +# Define the URL of the files to download
    +url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta"
    +
    +# (Optional 1) Save the original file name: filename=$(basename "$url")
    +# (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)
    +# out_filename = "imgt_hla_prot.fasta"
    +
    +# Download the file
    +wget $url && \
    +md5sum --status --check $md5file
    +
    +We recommend using the argument `--status` **only** when you incorporate this sanity check as part of your pipeline so that it only prints the errors (it doesn't print output when success).
    1. Folder structure
    -
    genomic_resources/
    -├── specie1/
    -  └── version/
    -     ├── files.txt
    -     └── indexes/
    -└── dw_resources.sh
    +
    genomic_resources/
    +├── specie1/
    +  └── version/
    +     ├── files.txt
    +     └── indexes/
    +└── dw_resources.sh
    1. Create a md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files.
    -
    md5sum <data>
    +
    md5sum <data>
    -
    +
    @@ -848,7 +909,7 @@

    Manual Download

    -
    +
    @@ -898,7 +959,7 @@

    Naming conventions

    -
    +
    @@ -907,7 +968,7 @@

    Naming conventions

    -
    +
    @@ -917,18 +978,174 @@

    Naming conventions

    -

    To learn more about naming conventions for NGS analysis and see additional examples, click here.

    +
    +
    +
    + +
    +
    +Which naming conventions should not be used and why? +
    +
    +
    +
    +
    +
    +
    +
    A. data_processing_carlo's.py
    +B. raw_sequences_V#20241111.fasta
    +C. differential_expression_results_clara.csv
    +D. Grant proposal final.doc
    +E. sequence_alignment$v1.py
    +F. data/gene_annotations_20201107.gff
    +G. alpha~1.0/beta~2.0/reg_2024-05-98.tsv
    +H. alpha=1.0/beta=2.0/reg_2024-05-98.tsv
    +I. run_pipeline:20241203.sh
    +
    + +
    +
    +
    +
    +

    A, B, D, E, H, I

    +
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +
    + +
    +
    +Which file name is more readable? +
    +
    +
    +
    +
    +
    +
    +
    1a. forecast2000122420240724.tsv
    +1b. forecast_2000-12-24_2024-07-24.tsv
    +1c. forecast_2000_12_24_2024_07_24.tsv
    +2a. 01_data_preprocessing.R
    +2b. 1_data_preProcessing.R
    +2c. 01_d4t4_pr3processing.R
    +3a. B1_2024-12-12_cond~pH7_temp~37C.fastq
    +3b. B1.20241212.pH7.37C.fastq
    +3c. b1_2024-12-12_c0nd~pH7_t3mp~37C.fastq
    +
    + +
    +
    +
    +
    +

    1b: easier for human & machine, _ separates dates, - separates within time information (year/month/day). This is important, for example, when using wildcards in Snakemake for building pipelines.

    +

    2a: start with 0 for sorting, consistently with upper/lower and the use of separators (_ separates metadata)

    +

    3a: indicates variable temperature is set to 37 Celsius (temperature could be negative - and is better used to separate values in time)

    +
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +

    Regular expressions are an incredibly powerful tool for string manipulation. We recommend checking out RegexOne to learn how to create smart file names that will help you parse them more efficiently. To learn more about naming conventions for NGS analysis and see additional examples, click here.

    +
    +
    +
    + +
    +
    +Which of the following regexps match the following filenames? +
    +
    +
    +
    +
    +
    +
    +

    (in bold filenames that SHOULD be matched):

    +
      +
    • rna_seq/2021/03/results/Sample_A123_gene_expression.tsv
    • +
    • proteomics/2020/11/Sample_B234_protein_abundance.tsv
    • +
    • rna_seq/2021/03/results/Sample_C345_normalized_counts.tsv
    • +
    • rna_seq/2021/03/results/Sample_D456_quality_report.log
    • +
    • metabolomics/2019/05/Sample_E567_metabolite_levels.tsv
    • +
    • rna_seq/2019/12/Sample_F678_raw_reads.fastq
    • +
    • rna_seq/2021/03/results/Sample_G789_transcript_counts.tsv
    • +
    • proteomics/2021/02/Sample_H890_protein_quantification.TSV
    • +
    +

    Regular Expressions:

    +
    rna_seq.*\.tsv
    +.*\.csv
    +.*/2021/03/.*\.tsv
    +.*Sample_.*_gene_expression.tsv
    +rna_seq/2021/03/results/Sample_.*_.*\.tsv
    +
    + +
    +
    +
    +
    +

    .*rna_seq.*\.tsv and rna_seq/2021/03/results/Sample_.*_.*\.tsv match the exact same files

    +
    +
    +
    +
    +
    +
    +
    +
    +
    +

    Wrap up

    -

    In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! Complete the practical tutorial on using cookiecutter as a template engine to be able to create your own templates and reuse them as much as you need.

    +

    In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! It is now your responsibility to use and implement them in a reasonable way. Complete the practical tutorial on using cookiecutter as a template engine to be able to create your own templates and reuse them as much as you need.

    Sources

    +
    @@ -983,6 +1021,31 @@

    Wrap up

    }); + diff --git a/_site/develop/images/longwood_repos.png b/_site/develop/images/longwood_repos.png new file mode 100644 index 0000000..345fb30 Binary files /dev/null and b/_site/develop/images/longwood_repos.png differ diff --git a/_site/develop/practical_workshop.html b/_site/develop/practical_workshop.html index 412f2e1..5d45cf7 100644 --- a/_site/develop/practical_workshop.html +++ b/_site/develop/practical_workshop.html @@ -219,7 +219,7 @@

    Practical material

    Modified
    -

    June 4, 2024

    +

    August 20, 2024

    @@ -275,9 +275,8 @@

    Practical material

    Two more tools will be required, choose the one you are familiar with or the first option:

    +
  • Option a) Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.

  • +
  • Option b) Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.

    pip install mkdocs # create webpages
     pip install mkdocs-material # customize webpages
     pip install mkdocs-video # add videos or embed videos from other sources
    @@ -286,7 +285,8 @@ 

    Practical material

    pip install mkdocs-jupyter # include Jupyter notebooks pip install mkdocs-bibtex # add references in your text (`.bib`) pip install neoteroi-mkdocs # create author cards -pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
    +pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
  • +
    @@ -466,7 +466,8 @@

    Template engine

    Here are some template that you can use to get started, adapt and modify them to your own needs:

    @@ -522,16 +523,16 @@
    Step 3: Use Cookie
    Step 4: Review the Generated Project

    After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.

    -
    +
    -Exercise 1: Create your own template +Exercise 1: Create your own template.
    -
    +
    @@ -543,26 +544,47 @@
    Step 4
  • Go to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. fork_repo_example

  • Open a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):

    git clone <your URL to the template>
    -

    If you have a GitHub Desktop, click Add and select “Clone repository” from the options

  • -
  • Open the repository and navigate through the different directories

  • -
  • Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.

    +

    If you have a GitHub Desktop, click Add and select “Clone repository” from the options.

  • +
  • Open the repository and navigate through the different directories.

  • +
  • Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones, remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. Our Cookiecutter template is missing the ‘reports’ directory or the ‘requirements.txt’ file. Consider creating them, along with a subdirectory named ‘reports/figures’.

    ├── results/
     │   ├── figures/
    -├── requirements.txt
    +├── requirements.txt
  • + +
    + +
    +
    +
    +

    Here’s an example of how to do it:

    # Open your terminal and navigate to your template directory. Then: 
     cd \{\{\ cookiecutter.project_name\ \}\}/  
     mkdir reports 
    -touch requirements.txt
    -
  • Commit and push changes when you are done with your modifications

  • +touch requirements.txt
    +
    +
    +
    +
    +
    +
      +
    1. Commit and push changes when you are done with your modifications.
      -
    • Stage the changes with git add
    • -
    • Commit the changes with a meaningful commit message git commit -m "update cookicutter template"
    • -
    • Push the changes to your forked repository on Github git push origin main (or the appropriate branch name)
    • +
    • Stage the changes with git add.
    • +
    • Commit the changes with a meaningful commit message git commit -m "update cookicutter template".
    • +
    • Push the changes to your forked repository on Github git push origin main (or the appropriate branch name).
      -
    1. Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template">

      +
    2. Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template">.

      Fill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?

    @@ -571,7 +593,7 @@
    Step 4
    -
    +
    @@ -580,7 +602,7 @@
    Step 4
    -
    +
    @@ -624,7 +646,7 @@

    Metadata

    Choose the format that best suits the project’s needs. In this workshop, we will focus on YAMl as it is highly used for configuration files (e.g., in conda or pipelines).

    - -
    +
    @@ -691,7 +713,7 @@

    README file

    Choose the format that best suits the project’s needs. In this workshop, we will focused on Markdown as it is the most used format due to its balance of simplicity and expressive formatting options.

    - -
    +
    @@ -737,7 +759,7 @@

    README file

    # OVERVIEW

    Introduction to the project including its aims, and its significance. Describe the main purpose and the biological questions being addressed.

    - -
    +
    @@ -765,7 +787,7 @@

    README file

    # DATASETS

    Describe the data,, including its sources, format, and how to access it. If the data has undergone preprocessing, provide a description of the processes applied or the pipeline used.

    - -
    +
    @@ -788,7 +810,7 @@

    README file

    # RESULTS

    Summarize the results and key findings or outputs.

    - -
    +
    @@ -815,7 +837,7 @@

    README file

    -
    +
    @@ -824,7 +846,7 @@

    README file

    -
    +
    @@ -907,7 +929,7 @@

    README file

    3. Naming conventions

    As discussed in lesson 3, consistent naming conventions are key for interpreting, comparing, and reproducing findings in scientific research. Standardized naming helps organize and retrieve data or results, allowing researchers to locate and compare similar types of data within or across large datasets.

    -
    +
    @@ -916,7 +938,7 @@

    3. Naming conventions

    -
    +
    @@ -927,7 +949,7 @@

    3. Naming conventionsConsider the most common file types you work with, such as visualizations, figures, tables, etc., and create logical and clear file names
    -