diff --git a/_site/develop/02_DMP.html b/_site/develop/02_DMP.html
index d1f82f5..5b7667c 100644
--- a/_site/develop/02_DMP.html
+++ b/_site/develop/02_DMP.html
@@ -258,7 +258,7 @@ <h1 class="title">2. Data Management Plan</h1>
     <div>
     <div class="quarto-title-meta-heading">Modified</div>
     <div class="quarto-title-meta-contents">
-      <p class="date-modified">May 22, 2024</p>
+      <p class="date-modified">July 30, 2024</p>
     </div>
   </div>
     
@@ -289,7 +289,7 @@ <h1 class="title">2. Data Management Plan</h1>
 </div>
 </div>
 <p>The process of data management involves implementing tailored best practices for your data but how do you ensure comprehensive coverage of the decisions and that data is well-managed throughout its life cycle. To achieve this, a Data Management Plan (DMP) is essential.</p>
-<p>A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.</p>
+<p>DMP are required for grant applications to ensure research data to be FAIR. A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.</p>
 <section id="benefits-of-writing-a-dmp" class="level3">
 <h3 class="anchored" data-anchor-id="benefits-of-writing-a-dmp">Benefits of writing a DMP</h3>
 <p>A DMP serves as the initial step toward achieving FAIR principles in a project.</p>
diff --git a/_site/develop/03_DOD.html b/_site/develop/03_DOD.html
index e99ae98..1172978 100644
--- a/_site/develop/03_DOD.html
+++ b/_site/develop/03_DOD.html
@@ -308,7 +308,7 @@ <h1 class="title">3. Data organization and storage</h1>
     <div>
     <div class="quarto-title-meta-heading">Modified</div>
     <div class="quarto-title-meta-contents">
-      <p class="date-modified">May 22, 2024</p>
+      <p class="date-modified">July 25, 2024</p>
     </div>
   </div>
     
@@ -416,7 +416,7 @@ <h2 class="anchored" data-anchor-id="folder-organization">Folder organization</h
 </div>
 </div>
 <div class="callout-body-container callout-body">
-<p>Ensure that the person downloading the files employs checksums or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.</p>
+<p>Ensure that the person downloading the files employs checksums (MD5, SHA1, SHA256) or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.</p>
 <ul>
 <li><strong>MD5 Checksum</strong>: Files with names ending in “.md5” contain MD5 checksums. For instance, “filename.txt.md5” holds the MD5 checksum of “filename.txt”.”</li>
 </ul>
@@ -615,7 +615,7 @@ <h3 class="anchored" data-anchor-id="optimizing-folder-structures">Optimizing Fo
 <span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a>    <span class="ex">├──</span> .fastq.gz </span>
 <span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a>    <span class="ex">└──</span> samplesheet.csv</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <ul>
-<li><strong>README.md</strong>: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay’s objectives). Sometimes, it also outlines the basic directory structure and file naming conventions.</li>
+<li><strong>README.md</strong>: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay’s objectives). Sometimes, it also outlines the basic directory structure and file naming conventions. README’s are great but the goal is to make everything as self-explanatory as possible!</li>
 <li><strong>metadata.yml</strong>: This serves as the metadata file for the project (<a href="../develop/04_metadata.html">see this lesson</a>).</li>
 <li><strong>pipeline.md</strong>: This document describes the pipeline employed to process the raw data, along with the specific commands used to execute the pipeline. The specific format can vary depending on the workflow system employed (e.g., bash, Snakemake, Nextflow, Jupyter Notebooks, etc.) (<a href="../develop/06_pipelines.html">see this lesson</a>). Employing a standardized pipeline ensures a consistent file organization system (and the corresponding documentation)</li>
 <li><strong>processed_data</strong>: folder with results of the preprocessing pipeline. The contents may vary depending on the pipeline utilized. For example,
@@ -674,7 +674,7 @@ <h3 class="anchored" data-anchor-id="optimizing-folder-structures-1">Optimizing
 <li>logs: log files.</li>
 <li>tmp/scratch: store temporary or intermediate files (eg. testing).</li>
 <li><strong>environment</strong>: files for reproducing the analysis environment to reproduce the results, such as a Dockerfile, conda yaml file, or a text file (<a href="../develop/06_pipelines.html">See 6th lesson</a> for more tips on making your pipelines reproducible). It includes software, libraries/packages, and dependencies (and their versions!).</li>
-<li><strong>scripts</strong>: a folder containing helper scripts to run data analysis or source code</li>
+<li><strong>scripts</strong>: a folder containing helper scripts to run data analysis or source code. Other common directory names: <code>src</code>, <code>source</code> and <code>code</code>, pick one!</li>
 <li><strong>reports</strong>: Generated analysis as HTML, PDF, LaTeX, etc. Great for sharing with colleagues and creating formal reports of the data analysis procedure.
 <ul>
 <li><em>figures</em>: figures produced upon rendering notebooks. Tip: save the figures under a subfolder named after the notebook/pipeline that created them (you will appreciate this organization when you need to rerun analysis and know which script created each figure!).</li>
@@ -682,6 +682,7 @@ <h3 class="anchored" data-anchor-id="optimizing-folder-structures-1">Optimizing
 <li><strong>results</strong>: results from the data analysis, such as tables and figures, etc. Tip: Create a subfolder named after the notebook or pipeline for storing the results generated by that specific notebook or pipeline.</li>
 <li><strong>metadata.yml</strong>: metadata file describing the dataset, samples, etc. (<a href="../develop/04_metadata.html">see this lesson</a>).</li>
 </ul>
+<p>For good managing project practices, version control <em>everything</em> with git and git-annex!</p>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
 <div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-11-contents" aria-controls="callout-11" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
@@ -698,13 +699,72 @@ <h3 class="anchored" data-anchor-id="optimizing-folder-structures-1">Optimizing
 <div class="callout-exercise">
 <ul>
 <li>Create your own data structure for one of the projects you are currently working on. Consider how it is similar to the example provided and how it differs. Make sure the data structure is easily understandable and navigable.</li>
-<li>What improvements or modifications could be made to enhance clarity and efficiency?</li>
+<li>What improvements or modifications could be made to enhance clarity and efficiency? Check the following callout for more examples to get inspired.</li>
 </ul>
 </div>
 </div>
 </div>
 </div>
 </div>
+<div class="callout callout-style-default callout-note callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-12-contents" aria-controls="callout-12" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Need more examples?
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-12" class="callout-12-contents callout-collapse collapse">
+<div class="callout-body-container callout-body">
+<p>If you want to get inspired, here are two other templates proposed by A. <a href="https://the-turing-way.netlify.app/project-design/project-repo/project-repo-advanced.html">The Turing way</a> and B. <a href="https://coderefinery.github.io/reproducible-research/organizing-projects/#directory-structure-for-projects">Coderefinery</a>:</p>
+<ol type="A">
+<li></li>
+</ol>
+<pre><code>Project Folder/
+├── docs                     &lt;- documentation
+│   └── codelist.txt
+│   └── project_plan.txt
+│   └── ...
+│   └── deliverables.txt
+├── data
+│   └── raw/
+│       └── my_data.csv
+│   └── clean/
+│       └── data_clean.csv
+├── analysis                 &lt;- scripts
+│   └── my_script.R
+├── results                  &lt;- analysis output     
+│   └── figures
+├── .gitignore               &lt;- files excluded from git version control
+├── install.R                &lt;- environment setup
+├── CODE_OF_CONDUCT          &lt;- Code of Conduct for community projects
+├── CONTRIBUTING             &lt;- Contribution guideline for collaborators
+├── LICENSE                  &lt;- software license
+├── README.md                &lt;- information about the repo
+└── report.md                &lt;- report of project</code></pre>
+<ol start="2" type="A">
+<li></li>
+</ol>
+<pre><code>project_name/ 
+├── README.md # overview of the project
+├── data/ # data files used in the project 
+│   ├── README.md # describes where data came from 
+│   └── sub-folder/ # may contain subdirectories 
+├── processed_data/ # intermediate files from the analysis 
+├── manuscript/ # manuscript describing the results 
+├── results/ # results of the analysis (data, tables, figures) 
+├── src/ # contains all code in the project 
+│   ├── LICENSE # license for your code 
+│   ├── requirements.txt # software requirements and dependencies 
+│   └── ... 
+└── doc/ # documentation for your project 
+├── index.rst 
+└── ...</code></pre>
+</div>
+</div>
+</div>
 </section>
 </section>
 <section id="template-engine" class="level2">
@@ -749,7 +809,7 @@ <h3 class="anchored" data-anchor-id="quick-tutorial-on-cookiecutter">Quick tutor
 <h2 class="anchored" data-anchor-id="resources-and-databases-folder">3. Resources and databases folder</h2>
 <p>Health databases are utilized for storing, organizing, and providing access to diverse health-related data, including genomic data, clinical records, imaging data, and more. These resources are regularly updated and released under different versions from various sources. To ensure data reproducibility, it’s crucial to manage and specify the versions and sources of data within these databases.</p>
 <div class="callout callout-style-default callout-note callout-titled" title="Example NGS: genomic resources">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-14-contents" aria-controls="callout-14" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-15-contents" aria-controls="callout-15" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon"></i>
 </div>
@@ -758,7 +818,7 @@ <h2 class="anchored" data-anchor-id="resources-and-databases-folder">3. Resource
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-14" class="callout-14-contents callout-collapse collapse">
+<div id="callout-15" class="callout-15-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <p>For example, preprocessing NGS data involves utilizing various genomic resources for tasks like aligning and annotating fastq files. Essential resources include reference genomes in FASTA format (e.g., human and mouse), indexed fasta files for alignment tools like STAR and Bowtie, and GTF or GFF files for quantifying reads into genomic regions. One of the latest human reference genome is GRCh38, however many past studies are based on GRCh37.</p>
 <p>How can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as <a href="http://refgenie.databio.org/en/latest/">refgenie</a>.</p>
@@ -776,12 +836,12 @@ <h4 class="anchored" data-anchor-id="manual-download">Manual Download</h4>
 <li>Organizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices.</li>
 <li>Documentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata.</li>
 <li>README.md: Record the version of the data in the README.md file.</li>
-<li>Checksums: Check for and use checksums provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below.</li>
+<li>Checksums: Check for and use checksums (MD5, SHA1, SHA256, …) provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below to get more familiar with these files.</li>
 <li>Verify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption.</li>
 <li>Automated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g.&nbsp;use bash script or pipeline).</li>
 </ul>
 <div class="callout callout-style-default callout-note callout-titled" title="Optional: Exercise on CHECKSUMS">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-16-contents" aria-controls="callout-16" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-17-contents" aria-controls="callout-17" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon"></i>
 </div>
@@ -790,56 +850,57 @@ <h4 class="anchored" data-anchor-id="manual-download">Manual Download</h4>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-16" class="callout-16-contents callout-collapse collapse">
+<div id="callout-17" class="callout-17-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
-<p>We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets. In this example, we use data from the <a href="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/">HLA FTP Directory</a>.</p>
+<p>We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets, as it is commonly used. In this example, we use data from the <a href="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/">HLA FTP Directory</a>.</p>
 <ol type="1">
 <li>Install md5sum (from coreutils package)</li>
 </ol>
-<div class="sourceCode" id="cb11"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span>
-<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="co"># On Ubuntu/Debian</span></span>
-<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="ex">apt-get</span> install coreutils</span>
-<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="co"># On macOS</span></span>
-<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a><span class="ex">brew</span> install coreutils</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb13"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span>
+<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="co"># On Ubuntu/Debian</span></span>
+<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="ex">apt-get</span> install coreutils</span>
+<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="co"># On macOS</span></span>
+<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a><span class="ex">brew</span> install coreutils</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <ol start="2" type="1">
 <li>Create a bash script to download the target files (named “dw_resources.sh” in the data structure).</li>
 </ol>
-<div class="sourceCode" id="cb12"><pre class="sourceCode bash code-overflow-wrap code-with-copy"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span>
-<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="co"># Important: go through the README before downloading! Check if a checksums file is included. </span></span>
-<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Create or change the directory to the resources dir. </span></span>
-<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:</span></span>
-<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="ex">1a3d12e4e6cc089388d88e3509e41cb3</span>  hla_gen.fasta</span>
-<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Finally, save it: </span></span>
-<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a><span class="va">md5file</span><span class="op">=</span><span class="st">"md5checksum.txt"</span></span>
-<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a><span class="co"># Define the URL of the files to download</span></span>
-<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a><span class="va">url</span><span class="op">=</span><span class="st">"ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_gen.fasta"</span></span>
-<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a><span class="co"># </span></span>
-<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a><span class="va">filename</span><span class="op">=</span><span class="va">$(</span><span class="fu">basename</span> <span class="st">"</span><span class="va">$url</span><span class="st">"</span><span class="va">)</span></span>
-<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a><span class="co"># (Optional) Define a different filename to save the downloaded file (`wget -O $out_filename`)</span></span>
-<span id="cb12-17"><a href="#cb12-17" aria-hidden="true" tabindex="-1"></a><span class="co"># out_filename = "imgt_hla_gen.fasta"</span></span>
-<span id="cb12-18"><a href="#cb12-18" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb12-19"><a href="#cb12-19" aria-hidden="true" tabindex="-1"></a><span class="co"># Download the file</span></span>
-<span id="cb12-20"><a href="#cb12-20" aria-hidden="true" tabindex="-1"></a><span class="fu">wget</span> <span class="va">$url</span> <span class="kw">&amp;&amp;</span> <span class="dt">\</span></span>
-<span id="cb12-21"><a href="#cb12-21" aria-hidden="true" tabindex="-1"></a><span class="fu">md5sum</span> <span class="at">--status</span> <span class="at">--check</span> <span class="va">$md5file</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb14"><pre class="sourceCode bash code-overflow-wrap code-with-copy"><code class="sourceCode bash"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span>
+<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="co"># Important: go through the README before downloading! Check if a checksums file is included. </span></span>
+<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Create or change the directory to the resources dir. </span></span>
+<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:</span></span>
+<span id="cb14-7"><a href="#cb14-7" aria-hidden="true" tabindex="-1"></a><span class="ex">7348fbef5ab204f3aca67e91f6c59ed2</span>  hla_prot.fasta</span>
+<span id="cb14-8"><a href="#cb14-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Finally, save it: </span></span>
+<span id="cb14-9"><a href="#cb14-9" aria-hidden="true" tabindex="-1"></a><span class="va">md5file</span><span class="op">=</span><span class="st">"md5checksum.txt"</span></span>
+<span id="cb14-10"><a href="#cb14-10" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb14-11"><a href="#cb14-11" aria-hidden="true" tabindex="-1"></a><span class="co"># Define the URL of the files to download</span></span>
+<span id="cb14-12"><a href="#cb14-12" aria-hidden="true" tabindex="-1"></a><span class="va">url</span><span class="op">=</span><span class="st">"ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta"</span></span>
+<span id="cb14-13"><a href="#cb14-13" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb14-14"><a href="#cb14-14" aria-hidden="true" tabindex="-1"></a><span class="co"># (Optional 1) Save the original file name: filename=$(basename "$url")</span></span>
+<span id="cb14-15"><a href="#cb14-15" aria-hidden="true" tabindex="-1"></a><span class="co"># (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)</span></span>
+<span id="cb14-16"><a href="#cb14-16" aria-hidden="true" tabindex="-1"></a><span class="co"># out_filename = "imgt_hla_prot.fasta"</span></span>
+<span id="cb14-17"><a href="#cb14-17" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb14-18"><a href="#cb14-18" aria-hidden="true" tabindex="-1"></a><span class="co"># Download the file</span></span>
+<span id="cb14-19"><a href="#cb14-19" aria-hidden="true" tabindex="-1"></a><span class="fu">wget</span> <span class="va">$url</span> <span class="kw">&amp;&amp;</span> <span class="dt">\</span></span>
+<span id="cb14-20"><a href="#cb14-20" aria-hidden="true" tabindex="-1"></a><span class="fu">md5sum</span> <span class="at">--status</span> <span class="at">--check</span> <span class="va">$md5file</span></span>
+<span id="cb14-21"><a href="#cb14-21" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb14-22"><a href="#cb14-22" aria-hidden="true" tabindex="-1"></a><span class="ex">We</span> recommend using the argument <span class="kw">`</span><span class="ex">--status</span><span class="kw">`</span> <span class="pp">**</span>only<span class="pp">**</span> when you incorporate this sanity check as part of your pipeline so that it only prints the errors <span class="er">(</span><span class="ex">it</span> doesn<span class="st">'t print output when success).</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <ol start="3" type="1">
 <li>Folder structure</li>
 </ol>
-<div class="sourceCode" id="cb13"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="ex">genomic_resources/</span></span>
-<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="ex">├──</span> specie1/</span>
-<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="ex">│</span>  └── version/</span>
-<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="ex">│</span>     ├── files.txt</span>
-<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a><span class="ex">│</span>     └── indexes/</span>
-<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a><span class="ex">└──</span> dw_resources.sh</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb15"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="ex">genomic_resources/</span></span>
+<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="ex">├──</span> specie1/</span>
+<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="ex">│</span>  └── version/</span>
+<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="ex">│</span>     ├── files.txt</span>
+<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="ex">│</span>     └── indexes/</span>
+<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="ex">└──</span> dw_resources.sh</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <ol start="4" type="1">
 <li>Create a md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files.</li>
 </ol>
-<div class="sourceCode" id="cb14"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="fu">md5sum</span> <span class="op">&lt;</span>data<span class="op">&gt;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb16"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="fu">md5sum</span> <span class="op">&lt;</span>data<span class="op">&gt;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-15-contents" aria-controls="callout-15" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-16-contents" aria-controls="callout-16" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -848,7 +909,7 @@ <h4 class="anchored" data-anchor-id="manual-download">Manual Download</h4>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-15" class="callout-15-contents callout-collapse collapse show">
+<div id="callout-16" class="callout-16-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -898,7 +959,7 @@ <h2 class="anchored" data-anchor-id="naming-conventions">Naming conventions</h2>
 </div>
 </div>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-18-contents" aria-controls="callout-18" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-19-contents" aria-controls="callout-19" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -907,7 +968,7 @@ <h2 class="anchored" data-anchor-id="naming-conventions">Naming conventions</h2>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-18" class="callout-18-contents callout-collapse collapse show">
+<div id="callout-19" class="callout-19-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -917,18 +978,174 @@ <h2 class="anchored" data-anchor-id="naming-conventions">Naming conventions</h2>
 </div>
 </div>
 </div>
-<p>To learn more about naming conventions for NGS analysis and see additional examples, click <a href="../develop/examples/NGS_management.html">here</a>.</p>
+<div class="callout callout-style-default callout-exercise no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-21-contents" aria-controls="callout-21" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Which naming conventions should not be used and why?
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-21" class="callout-21-contents callout-collapse collapse show">
+<div class="callout-body-container callout-body">
+<div>
+<div class="callout-exercise">
+<pre><code>A. data_processing_carlo's.py
+B. raw_sequences_V#20241111.fasta
+C. differential_expression_results_clara.csv
+D. Grant proposal final.doc
+E. sequence_alignment$v1.py
+F. data/gene_annotations_20201107.gff
+G. alpha~1.0/beta~2.0/reg_2024-05-98.tsv
+H. alpha=1.0/beta=2.0/reg_2024-05-98.tsv
+I. run_pipeline:20241203.sh</code></pre>
+<div class="callout callout-style-default callout-hint no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-20-contents" aria-controls="callout-20" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Hint
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-20" class="callout-20-contents callout-collapse collapse">
+<div class="callout-body-container callout-body">
+<div>
+<div class="callout-hint">
+<p>A, B, D, E, H, I</p>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+<div class="callout callout-style-default callout-exercise no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-23-contents" aria-controls="callout-23" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Which file name is more readable?
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-23" class="callout-23-contents callout-collapse collapse show">
+<div class="callout-body-container callout-body">
+<div>
+<div class="callout-exercise">
+<pre><code>1a. forecast2000122420240724.tsv
+1b. forecast_2000-12-24_2024-07-24.tsv
+1c. forecast_2000_12_24_2024_07_24.tsv
+2a. 01_data_preprocessing.R
+2b. 1_data_preProcessing.R
+2c. 01_d4t4_pr3processing.R
+3a. B1_2024-12-12_cond~pH7_temp~37C.fastq
+3b. B1.20241212.pH7.37C.fastq
+3c. b1_2024-12-12_c0nd~pH7_t3mp~37C.fastq</code></pre>
+<div class="callout callout-style-default callout-hint no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-22-contents" aria-controls="callout-22" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Hint
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-22" class="callout-22-contents callout-collapse collapse">
+<div class="callout-body-container callout-body">
+<div>
+<div class="callout-hint">
+<p><strong>1b</strong>: easier for human &amp; machine, <code>_</code> separates dates, <code>-</code> separates within time information (year/month/day). This is important, for example, when using wildcards in Snakemake for building pipelines.</p>
+<p><strong>2a</strong>: start with 0 for sorting, consistently with upper/lower and the use of separators (<code>_</code> separates metadata)</p>
+<p><strong>3a</strong>: indicates variable temperature is set to 37 Celsius (temperature could be negative <code>-</code> and is better used to separate values in time)</p>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+<p>Regular expressions are an incredibly powerful tool for string manipulation. We recommend checking out <a href="https://regexone.com/">RegexOne</a> to learn how to create smart file names that will help you parse them more efficiently. To learn more about naming conventions for NGS analysis and see additional examples, click <a href="../develop/examples/NGS_management.html">here</a>.</p>
+<div class="callout callout-style-default callout-exercise no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-25-contents" aria-controls="callout-25" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Which of the following regexps match the following filenames?
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-25" class="callout-25-contents callout-collapse collapse show">
+<div class="callout-body-container callout-body">
+<div>
+<div class="callout-exercise">
+<p>(in bold filenames that SHOULD be matched):</p>
+<ul>
+<li><strong>rna_seq/2021/03/results/Sample_A123_gene_expression.tsv</strong></li>
+<li>proteomics/2020/11/Sample_B234_protein_abundance.tsv</li>
+<li><strong>rna_seq/2021/03/results/Sample_C345_normalized_counts.tsv</strong></li>
+<li>rna_seq/2021/03/results/Sample_D456_quality_report.log</li>
+<li>metabolomics/2019/05/Sample_E567_metabolite_levels.tsv</li>
+<li>rna_seq/2019/12/Sample_F678_raw_reads.fastq</li>
+<li><strong>rna_seq/2021/03/results/Sample_G789_transcript_counts.tsv</strong></li>
+<li>proteomics/2021/02/Sample_H890_protein_quantification.TSV</li>
+</ul>
+<p>Regular Expressions:</p>
+<pre><code>rna_seq.*\.tsv
+.*\.csv
+.*/2021/03/.*\.tsv
+.*Sample_.*_gene_expression.tsv
+rna_seq/2021/03/results/Sample_.*_.*\.tsv</code></pre>
+<div class="callout callout-style-default callout-hint no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-24-contents" aria-controls="callout-24" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Hint
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-24" class="callout-24-contents callout-collapse collapse">
+<div class="callout-body-container callout-body">
+<div>
+<div class="callout-hint">
+<p><code>.*rna_seq.*\.tsv</code> and <code>rna_seq/2021/03/results/Sample_.*_.*\.tsv</code> match the exact same files</p>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
+</div>
 </section>
 <section id="wrap-up" class="level2">
 <h2 class="anchored" data-anchor-id="wrap-up">Wrap up</h2>
-<p>In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! Complete the practical tutorial on using <code>cookiecutter</code> as a template engine to be able to create your own templates and reuse them as much as you need.</p>
+<p>In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! It is now your responsibility to use and implement them in a reasonable way. Complete the practical tutorial on using <code>cookiecutter</code> as a template engine to be able to create your own templates and reuse them as much as you need.</p>
 <section id="sources" class="level3">
 <h3 class="anchored" data-anchor-id="sources">Sources</h3>
 <ul>
+<li>The Turing way <a href="https://the-turing-way.netlify.app/project-design/project-repo/project-repo-advanced.html" class="uri">https://the-turing-way.netlify.app/project-design/project-repo/project-repo-advanced.html</a></li>
+<li>RDMkit Elixir Europe: <a href="https://rdmkit.elixir-europe.org/data_organisation" class="uri">https://rdmkit.elixir-europe.org/data_organisation</a></li>
+<li>Coderefinery: <a href="https://coderefinery.github.io/reproducible-research/organizing-projects/#directory-structure-for-projects" class="uri">https://coderefinery.github.io/reproducible-research/organizing-projects/#directory-structure-for-projects</a></li>
 <li>UK Data Service: <a href="https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/organising/" class="uri">https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/organising/</a></li>
 <li>Oakland University: <a href="https://library.oakland.edu/services/research-data/file-org.html" class="uri">https://library.oakland.edu/services/research-data/file-org.html</a></li>
 <li>Cessda guidelines: <a href="https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/File-naming-and-folder-structure" class="uri">https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/File-naming-and-folder-structure</a>.</li>
-<li>RDMkit Elixir Europe: <a href="https://rdmkit.elixir-europe.org/data_organisation" class="uri">https://rdmkit.elixir-europe.org/data_organisation</a></li>
 </ul>
 <!-- script to open links in a new tab, add at the end 
 <script>
diff --git a/_site/develop/05_VC.html b/_site/develop/05_VC.html
index 7e707bc..b610e52 100644
--- a/_site/develop/05_VC.html
+++ b/_site/develop/05_VC.html
@@ -261,7 +261,7 @@ <h1 class="title">5. Version Control with Git and GitHub</h1>
     <div>
     <div class="quarto-title-meta-heading">Modified</div>
     <div class="quarto-title-meta-contents">
-      <p class="date-modified">May 22, 2024</p>
+      <p class="date-modified">July 25, 2024</p>
     </div>
   </div>
     
@@ -381,7 +381,7 @@ <h3 class="anchored" data-anchor-id="github-hosting-for-git">GitHub Hosting for
 </div>
 </div>
 <div class="callout-body-container callout-body">
-<p>We will discuss repositories for archiving experimental or large datasets in <a href="../develop/07_repos.html">lesson 7</a>.</p>
+<p>We will discuss repositories for archiving experimental or large datasets in <a href="../develop/07_repos.html">lesson 7</a>. However, if you are interested in version control large files, we recommend the use of <code>git annex</code>. It is important to store files with a checksum (MD5, SHA1, SHA256) to verify that files are not altered or corrupted buy recomputing their signature.</p>
 </div>
 </div>
 <section id="from-project-folders-to-git-repositories" class="level4">
diff --git a/_site/develop/06_pipelines.html b/_site/develop/06_pipelines.html
index 0ac6e8d..0edd33f 100644
--- a/_site/develop/06_pipelines.html
+++ b/_site/develop/06_pipelines.html
@@ -243,7 +243,7 @@ <h1 class="title">6. Processing and analyzing biodata</h1>
     <div>
     <div class="quarto-title-meta-heading">Modified</div>
     <div class="quarto-title-meta-contents">
-      <p class="date-modified">May 22, 2024</p>
+      <p class="date-modified">July 24, 2024</p>
     </div>
   </div>
     
@@ -340,7 +340,9 @@ <h3 class="anchored" data-anchor-id="connecting-data-organization-and-documentat
 <li>Provide <strong>environment files</strong> for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.</li>
 <li>Data versioning: use <strong>version control systems</strong> (e.g., Git) and upload your code to a <strong>code repository</strong> <a href="../develop/05_VC.html">Lesson 5</a>.</li>
 <li>Integrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code</li>
+<li>Use <code>git submodule</code> for code and software that is reused in several projects</li>
 <li>Leverage curated pipelines such as the ones developed by the <a href="https://nf-co.re/">nf-core community</a>, further ensuring adherence to community standards and guidelines.</li>
+<li>Use <a href="https://www.softwareheritage.org/">Software Heritage</a> an archive for software source code are essential for long-term accessibility and reproducibility</li>
 <li>Add a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.</li>
 </ul>
 <div class="callout callout-style-default callout-warning callout-titled" title="Practical HPC pipes">
diff --git a/_site/develop/07_repos.html b/_site/develop/07_repos.html
index ef1eb14..b09711f 100644
--- a/_site/develop/07_repos.html
+++ b/_site/develop/07_repos.html
@@ -41,6 +41,9 @@
 <link href="../site_libs/bootstrap/bootstrap-icons.css" rel="stylesheet">
 <link href="../site_libs/bootstrap/bootstrap.min.css" rel="stylesheet" class="quarto-color-scheme" id="quarto-bootstrap" data-mode="light">
 <link href="../site_libs/bootstrap/bootstrap-dark.min.css" rel="prefetch" class="quarto-color-scheme quarto-color-alternate" id="quarto-bootstrap" data-mode="dark">
+<script src="../site_libs/quarto-contrib/glightbox/glightbox.min.js"></script>
+<link href="../site_libs/quarto-contrib/glightbox/glightbox.min.css" rel="stylesheet">
+<link href="../site_libs/quarto-contrib/glightbox/lightbox.css" rel="stylesheet">
 <script id="quarto-search-options" type="application/json">{
   "location": "navbar",
   "copy-button": false,
@@ -234,7 +237,7 @@ <h2 id="toc-title">On this page</h2>
   <li><a href="#domain-specific-repositories" id="toc-domain-specific-repositories" class="nav-link" data-scroll-target="#domain-specific-repositories">Domain-specific repositories</a></li>
   <li><a href="#general-repositories" id="toc-general-repositories" class="nav-link" data-scroll-target="#general-repositories">General repositories</a>
   <ul class="collapse">
-  <li><a href="#step-by-step-setup-guide" id="toc-step-by-step-setup-guide" class="nav-link" data-scroll-target="#step-by-step-setup-guide">Step-by-Step Setup Guide</a></li>
+  <li><a href="#zenodo" id="toc-zenodo" class="nav-link" data-scroll-target="#zenodo">Zenodo</a></li>
   </ul></li>
   <li><a href="#wrap-up" id="toc-wrap-up" class="nav-link" data-scroll-target="#wrap-up">Wrap up</a></li>
   </ul>
@@ -257,7 +260,7 @@ <h1 class="title">7. Storing and sharing biodata</h1>
     <div>
     <div class="quarto-title-meta-heading">Modified</div>
     <div class="quarto-title-meta-contents">
-      <p class="date-modified">May 22, 2024</p>
+      <p class="date-modified">July 23, 2024</p>
     </div>
   </div>
     
@@ -299,10 +302,23 @@ <h2 class="anchored" data-anchor-id="data-repositories-and-archives">Data Reposi
 <li><strong>Amplification of Research Impact and Contribution</strong>: Archiving data elevates research quality and extends its impact within the scientific community.</li>
 <li><strong>Fulfilling Scholarly Obligations</strong>: Compliance with requirements set by scientific journals and funding agencies ensures adherence to scholarly standards.</li>
 </ul>
+<div class="callout callout-style-default callout-important callout-titled" title="re3data.org">
+<div class="callout-header d-flex align-content-center">
+<div class="callout-icon-container">
+<i class="callout-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+re3data.org
+</div>
+</div>
+<div class="callout-body-container callout-body">
+<p>Check the <strong>registry of research data repositories</strong>–<a href="https://www.re3data.org/">re3data.org</a> for a full overview. You can <a href="https://www.re3data.org/browse/by-subject/">browse by subject</a> if you are looking within a specific field.</p>
+</div>
+</div>
 <p>There are two types of repositories:</p>
 <ul>
 <li><strong>General repositories</strong>: relevant to a wide range of disciplines (e.g.&nbsp;Zenodo).</li>
-<li><strong>Domain-specific</strong>: repositories are customized for specific fields, providing specialized curation and context-specific features (e.g.&nbsp;ENA, GEO, Annotare, etc.).</li>
+<li><strong>Domain-specific</strong>: repositories are customized for specific fields, providing specialized curation and context-specific features (e.g.&nbsp;ENA, GEO, Annotare, etc.)</li>
 </ul>
 <div class="callout callout-style-default callout-note callout-titled">
 <div class="callout-header d-flex align-content-center">
@@ -341,7 +357,7 @@ <h2 class="anchored" data-anchor-id="domain-specific-repositories">Domain-specif
 <p>This tailored approach ensures alignment with standards and maximizes the utility and impact of research findings. By catering to particular research areas, these repositories offer researchers a more focused audience, deeper domain expertise, and increased visibility within their specific research community.</p>
 <p>Explore some examples of NGS data repositories below:</p>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-4-contents" aria-controls="callout-4" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -350,7 +366,7 @@ <h2 class="anchored" data-anchor-id="domain-specific-repositories">Domain-specif
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-3" class="callout-3-contents callout-collapse collapse">
+<div id="callout-4" class="callout-4-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -361,7 +377,7 @@ <h2 class="anchored" data-anchor-id="domain-specific-repositories">Domain-specif
 </div>
 </div>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-4-contents" aria-controls="callout-4" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-5-contents" aria-controls="callout-5" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -370,7 +386,7 @@ <h2 class="anchored" data-anchor-id="domain-specific-repositories">Domain-specif
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-4" class="callout-4-contents callout-collapse collapse">
+<div id="callout-5" class="callout-5-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -381,7 +397,7 @@ <h2 class="anchored" data-anchor-id="domain-specific-repositories">Domain-specif
 </div>
 </div>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-5-contents" aria-controls="callout-5" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-6-contents" aria-controls="callout-6" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -390,7 +406,7 @@ <h2 class="anchored" data-anchor-id="domain-specific-repositories">Domain-specif
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-5" class="callout-5-contents callout-collapse collapse">
+<div id="callout-6" class="callout-6-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -420,18 +436,37 @@ <h2 class="anchored" data-anchor-id="domain-specific-repositories">Domain-specif
 </div>
 <div class="callout-body-container callout-body">
 <p>Keep in mind that these repositories are not intended for downstream analysis data and associated code. However, you should already have those versions controlled by GitHub, which eliminates any concerns. You can then archive such repositories in a general repository like Zenodo.</p>
+<p>Archives for software source code are essential for long-term accessibility and reproducibility and are becoming very popular. Check <a href="https://www.softwareheritage.org/">Software Heritage</a> if you are developing software.</p>
 </div>
 </div>
 </section>
 <section id="general-repositories" class="level2">
 <h2 class="anchored" data-anchor-id="general-repositories">General repositories</h2>
+<p>There are plenty of data archiving repositories. We recommend to check the <a href="https://datamanagement.hms.harvard.edu/share-publish/data-repositories">Longwood Research Data management</a> website at Harvard for a quick overview. Some of the most well-known are:</p>
+<ul>
+<li>Dataverse</li>
+<li>Dryad</li>
+<li>figshare</li>
+<li>Open Science Framework (OSF)</li>
+<li>Zenodo</li>
+</ul>
+<p>We will be using Zenodo for our practical workshop. However, please review the table provided by the Longwood, Harvard Biomedical Data Management team, which outlines the differences between various repositories.</p>
+<div class="quarto-figure quarto-figure-center">
+<figure class="figure">
+<p><a href="./images/longwood_repos.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1" title="Click to enlarge"><img src="./images/longwood_repos.png" class="img-fluid figure-img" alt="Click to enlarge"></a></p>
+<figcaption>Click to enlarge</figcaption>
+</figure>
+</div>
+<section id="zenodo" class="level3">
+<h3 class="anchored" data-anchor-id="zenodo">Zenodo</h3>
 <p><a href="https://zenodo.org/">Zenodo</a> is one of the widely used repositories for a variety of research outputs. It is an open-access digital platform supported by the European Organization for Nuclear Research (CERN) and the European Commission. It caters to various research outputs, including datasets, papers, software, and multimedia files, making it a valuable resource for researchers worldwide. With its user-friendly platform, researchers can easily upload, share, and preserve their research data. Each deposited item receives a unique Digital Object Identifier (DOI), ensuring citability and long-term accessibility. Additionally, Zenodo offers robust metadata capabilities for enriching submissions with contextual information. Moreover, researchers can <a href="(https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content)">link their GitHub</a> accounts to Zenodo, simplifying the process of archiving the GitHub repository releases for long-term accessibility and citation.</p>
 <p>Once your accounts are linked, creating a Zenodo archive becomes as straightforward as tagging a release in your GitHub repository. Zenodo automatically detects the release and generates a corresponding archive, complete with a unique Digital Object Identifier (DOI) for citable reference. Therefore, before submitting your work to a journal, link your data analysis repository to Zenodo, obtain a DOI, and cite it in your manuscript which enhances reproducibility in research.</p>
-<section id="step-by-step-setup-guide" class="level3">
-<h3 class="anchored" data-anchor-id="step-by-step-setup-guide">Step-by-Step Setup Guide</h3>
+<section id="step-by-step-setup-guide" class="level4">
+<h4 class="anchored" data-anchor-id="step-by-step-setup-guide">Step-by-Step Setup Guide</h4>
 <p>Check the practical material where we demonstrate how to link Zenodo and Github (see <strong>Exercise 6</strong> in the <a href="../develop/practical_workshop.html">practical material</a>).</p>
 </section>
 </section>
+</section>
 <section id="wrap-up" class="level2">
 <h2 class="anchored" data-anchor-id="wrap-up">Wrap up</h2>
 <p>In this concluding lesson, we’ve covered the process of submitting your data to a domain-specific repository and archiving your data analysis GitHub repositories in Zenodo. By applying the lessons from this workshop, you’ll significantly enhance the FAIRness of your data and improve its organization for future use. These benefits extend beyond yourself to your teammates, group leader, and the wider scientific community.</p>
@@ -448,6 +483,9 @@ <h2 class="anchored" data-anchor-id="wrap-up">Wrap up</h2>
 -->
 
 
+<div class="hidden" aria-hidden="true">
+<span class="glightbox-desc lightbox-desc-1">Click to enlarge</span>
+</div>
 </section>
 
 <div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-copyright"><h2 class="anchored quarto-appendix-heading">Copyright</h2><div class="quarto-appendix-contents"><div>CC-BY-SA 4.0 license</div></div></section></div></main> <!-- /main -->
@@ -983,6 +1021,31 @@ <h2 class="anchored" data-anchor-id="wrap-up">Wrap up</h2>
 });
 </script>
 </div> <!-- /content -->
+<script>var lightboxQuarto = GLightbox({"loop":false,"selector":".lightbox","closeEffect":"zoom","openEffect":"zoom","descPosition":"bottom"});
+window.onload = () => {
+  lightboxQuarto.on('slide_before_load', (data) => {
+    const { slideIndex, slideNode, slideConfig, player, trigger } = data;
+    const href = trigger.getAttribute('href');
+    if (href !== null) {
+      const imgEl = window.document.querySelector(`a[href="${href}"] img`);
+      if (imgEl !== null) {
+        const srcAttr = imgEl.getAttribute("src");
+        if (srcAttr && srcAttr.startsWith("data:")) {
+          slideConfig.href = srcAttr;
+        }
+      }
+    } 
+  });
+
+  lightboxQuarto.on('slide_after_load', (data) => {
+    const { slideIndex, slideNode, slideConfig, player, trigger } = data;
+    if (window.Quarto?.typesetMath) {
+      window.Quarto.typesetMath(slideNode);
+    }
+  });
+
+};
+          </script>
 
 
 
diff --git a/_site/develop/images/longwood_repos.png b/_site/develop/images/longwood_repos.png
new file mode 100644
index 0000000..345fb30
Binary files /dev/null and b/_site/develop/images/longwood_repos.png differ
diff --git a/_site/develop/practical_workshop.html b/_site/develop/practical_workshop.html
index 412f2e1..5d45cf7 100644
--- a/_site/develop/practical_workshop.html
+++ b/_site/develop/practical_workshop.html
@@ -219,7 +219,7 @@ <h1 class="title">Practical material</h1>
     <div>
     <div class="quarto-title-meta-heading">Modified</div>
     <div class="quarto-title-meta-contents">
-      <p class="date-modified">June 4, 2024</p>
+      <p class="date-modified">August 20, 2024</p>
     </div>
   </div>
     
@@ -275,9 +275,8 @@ <h1 class="title">Practical material</h1>
 </ul>
 <p>Two more tools will be required, choose the one you are familiar with or the first option:</p>
 <ul>
-<li>Option a. Install <a href="https://quarto.org/docs/get-started/">Quarto</a>. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.</li>
-<li>Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.</li>
-</ul>
+<li><p><em>Option a</em>) Install <a href="https://quarto.org/docs/get-started/">Quarto</a>. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.</p></li>
+<li><p><em>Option b</em>) Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.</p>
 <div class="sourceCode" id="cb1"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install mkdocs <span class="co"># create webpages</span></span>
 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install mkdocs-material <span class="co"># customize webpages</span></span>
 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install mkdocs-video <span class="co"># add videos or embed videos from other sources</span></span>
@@ -286,7 +285,8 @@ <h1 class="title">Practical material</h1>
 <span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install mkdocs-jupyter <span class="co"># include Jupyter notebooks</span></span>
 <span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install mkdocs-bibtex <span class="co"># add references in your text (`.bib`)</span></span>
 <span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install neoteroi-mkdocs <span class="co"># create author cards</span></span>
-<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install mkdocs-table-reader-plugin <span class="co"># embed tabular format files (`.tsv`)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install mkdocs-table-reader-plugin <span class="co"># embed tabular format files (`.tsv`)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div></li>
+</ul>
 </div>
 </div>
 <section id="organize-and-structure-your-datasets-and-data-analysis" class="level2">
@@ -466,7 +466,8 @@ <h3 class="anchored" data-anchor-id="template-engine">Template engine</h3>
 <p>Here are some template that you can use to get started, adapt and modify them to your own needs:</p>
 <ul>
 <li><a href="https://cookiecutter.readthedocs.io/en/stable/tutorials/tutorial1.html#step-1-generate-a-python-package-project">Python package project</a></li>
-<li><a href="https://github.com/hds-sandbox/project-template/">Sandbox test</a></li>
+<li><a href="https://github.com/hds-sandbox/cookiecutter-template/">Sandbox bioinformatics project</a></li>
+<li><a href="https://github.com/hds-sandbox/cc-data-template/">Sandbox data project</a></li>
 <li><a href="https://github.com/drivendata/cookiecutter-data-science/">Data science</a></li>
 <li><a href="https://github.com/brickmanlab/ngs-template">NGS data</a></li>
 </ul>
@@ -522,16 +523,16 @@ <h5 class="anchored" data-anchor-id="step-3-use-cookiecutter">Step 3: Use Cookie
 <h5 class="anchored" data-anchor-id="step-4-review-the-generated-project">Step 4: Review the Generated Project</h5>
 <p>After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.</p>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-7-contents" aria-controls="callout-7" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-8-contents" aria-controls="callout-8" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
 <div class="callout-title-container flex-fill">
-Exercise 1: Create your own template
+Exercise 1: Create your own template.
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-7" class="callout-7-contents callout-collapse collapse show">
+<div id="callout-8" class="callout-8-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -543,26 +544,47 @@ <h5 class="anchored" data-anchor-id="step-4-review-the-generated-project">Step 4
 <li><p>Go to our <a href="https://github.com/hds-sandbox/cookiecutter-template">Cookicutter template</a> and click on the <strong>Fork</strong> button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. <img src="./images/fork_repo_project.png" class="img-fluid" alt="fork_repo_example"></p></li>
 <li><p>Open a terminal on your computer, copy the URL of your fork and <strong>clone</strong> the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):</p>
 <div class="sourceCode" id="cb11"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone <span class="op">&lt;</span>your URL to the template<span class="op">&gt;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<p>If you have a GitHub Desktop, click <strong>Add</strong> and select “Clone repository” from the options</p></li>
-<li><p>Open the repository and navigate through the different directories</p></li>
-<li><p>Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.</p>
+<p>If you have a GitHub Desktop, click <strong>Add</strong> and select “Clone repository” from the options.</p></li>
+<li><p>Open the repository and navigate through the different directories.</p></li>
+<li><p>Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones, remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. Our Cookiecutter template is missing the ‘reports’ directory or the ‘requirements.txt’ file. Consider creating them, along with a subdirectory named ‘reports/figures’.</p>
 <pre class="plaintext"><code>├── results/
 │   ├── figures/
-├── requirements.txt</code></pre>
+├── requirements.txt</code></pre></li>
+</ol>
+<div class="callout callout-style-default callout-hint no-icon callout-titled">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-7-contents" aria-controls="callout-7" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-icon-container">
+<i class="callout-icon no-icon"></i>
+</div>
+<div class="callout-title-container flex-fill">
+Hint
+</div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
+</div>
+<div id="callout-7" class="callout-7-contents callout-collapse collapse">
+<div class="callout-body-container callout-body">
+<div>
+<div class="callout-hint">
 <p>Here’s an example of how to do it:</p>
 <div class="sourceCode" id="cb13"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Open your terminal and navigate to your template directory. Then: </span></span>
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="dt">\{\{\ </span>cookiecutter.project_name\ \}\}/  </span>
 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="fu">mkdir</span> reports </span>
-<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="fu">touch</span> requirements.txt</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div></li>
-<li><p>Commit and push changes when you are done with your modifications</p></li>
+<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="fu">touch</span> requirements.txt</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+</div>
+</div>
+</div>
+</div>
+</div>
+<ol start="5" type="1">
+<li>Commit and push changes when you are done with your modifications.</li>
 </ol>
 <ul>
-<li>Stage the changes with <code>git add</code></li>
-<li>Commit the changes with a meaningful commit message <code>git commit -m "update cookicutter template"</code></li>
-<li>Push the changes to your forked repository on Github <code>git push origin main</code> (or the appropriate branch name)</li>
+<li>Stage the changes with <code>git add</code>.</li>
+<li>Commit the changes with a meaningful commit message <code>git commit -m "update cookicutter template"</code>.</li>
+<li>Push the changes to your forked repository on Github <code>git push origin main</code> (or the appropriate branch name).</li>
 </ul>
 <ol start="6" type="1">
-<li><p>Test your template by using <code>cookiecutter &lt;URL to your GitHub repository "cookicutter-template"&gt;</code></p>
+<li><p>Test your template by using <code>cookiecutter &lt;URL to your GitHub repository "cookicutter-template"&gt;</code>.</p>
 <p>Fill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?</p></li>
 </ol>
 </div>
@@ -571,7 +593,7 @@ <h5 class="anchored" data-anchor-id="step-4-review-the-generated-project">Step 4
 </div>
 </div>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-8-contents" aria-controls="callout-8" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-9-contents" aria-controls="callout-9" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -580,7 +602,7 @@ <h5 class="anchored" data-anchor-id="step-4-review-the-generated-project">Step 4
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-8" class="callout-8-contents callout-collapse collapse show">
+<div id="callout-9" class="callout-9-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -624,7 +646,7 @@ <h3 class="anchored" data-anchor-id="metadata">Metadata</h3>
 <div class="callout-body-container callout-body">
 <p>Choose the format that best suits the project’s needs. In this workshop, we will focus on YAMl as it is highly used for configuration files (e.g., in conda or pipelines).</p>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-9-contents" aria-controls="callout-9" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-10-contents" aria-controls="callout-10" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -633,7 +655,7 @@ <h3 class="anchored" data-anchor-id="metadata">Metadata</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-9" class="callout-9-contents callout-collapse collapse">
+<div id="callout-10" class="callout-10-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -691,7 +713,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 <div class="callout-body-container callout-body">
 <p>Choose the format that best suits the project’s needs. In this workshop, we will focused on Markdown as it is the most used format due to its balance of simplicity and expressive formatting options.</p>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-12-contents" aria-controls="callout-12" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-13-contents" aria-controls="callout-13" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -700,7 +722,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-12" class="callout-12-contents callout-collapse collapse">
+<div id="callout-13" class="callout-13-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -737,7 +759,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 <p># OVERVIEW</p>
 <p>Introduction to the project including its aims, and its significance. Describe the main purpose and the biological questions being addressed.</p>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-14-contents" aria-controls="callout-14" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-15-contents" aria-controls="callout-15" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -746,7 +768,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-14" class="callout-14-contents callout-collapse collapse">
+<div id="callout-15" class="callout-15-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -765,7 +787,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 <p># DATASETS</p>
 <p>Describe the data,, including its sources, format, and how to access it. If the data has undergone preprocessing, provide a description of the processes applied or the pipeline used.</p>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-15-contents" aria-controls="callout-15" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-16-contents" aria-controls="callout-16" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -774,7 +796,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-15" class="callout-15-contents callout-collapse collapse">
+<div id="callout-16" class="callout-16-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -788,7 +810,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 <p># RESULTS</p>
 <p>Summarize the results and key findings or outputs.</p>
 <div class="callout callout-style-default callout-definition no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-16-contents" aria-controls="callout-16" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-17-contents" aria-controls="callout-17" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -797,7 +819,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-16" class="callout-16-contents callout-collapse collapse">
+<div id="callout-17" class="callout-17-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-definition">
@@ -815,7 +837,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 </div>
 </div>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-18-contents" aria-controls="callout-18" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-19-contents" aria-controls="callout-19" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -824,7 +846,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-18" class="callout-18-contents callout-collapse collapse show">
+<div id="callout-19" class="callout-19-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -907,7 +929,7 @@ <h3 class="anchored" data-anchor-id="readme-file">README file</h3>
 <h2 class="anchored" data-anchor-id="naming-conventions">3. Naming conventions</h2>
 <p>As discussed in <a href="https://hds-sandbox.github.io/RDM_NGS_course/develop/03_DOD.html#naming-conventions">lesson 3</a>, consistent naming conventions are key for interpreting, comparing, and reproducing findings in scientific research. Standardized naming helps organize and retrieve data or results, allowing researchers to locate and compare similar types of data within or across large datasets.</p>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-20-contents" aria-controls="callout-20" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-21-contents" aria-controls="callout-21" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -916,7 +938,7 @@ <h2 class="anchored" data-anchor-id="naming-conventions">3. Naming conventions</
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-20" class="callout-20-contents callout-collapse collapse show">
+<div id="callout-21" class="callout-21-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -927,7 +949,7 @@ <h2 class="anchored" data-anchor-id="naming-conventions">3. Naming conventions</
 <li>Consider the most common file types you work with, such as visualizations, figures, tables, etc., and create logical and clear file names</li>
 </ol>
 <div class="callout callout-style-default callout-hint no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-19-contents" aria-controls="callout-19" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-20-contents" aria-controls="callout-20" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -936,7 +958,7 @@ <h2 class="anchored" data-anchor-id="naming-conventions">3. Naming conventions</
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-19" class="callout-19-contents callout-collapse collapse">
+<div id="callout-20" class="callout-20-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-hint">
@@ -959,7 +981,7 @@ <h2 class="anchored" data-anchor-id="naming-conventions">3. Naming conventions</
 <h2 class="anchored" data-anchor-id="create-a-catalog-of-your-data-folder">4. Create a catalog of your data folder</h2>
 <p>The next step is to collect all the datasets that you have created in the manner explained above. Since your folders all should contain the <code>metadata.yml</code> file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. he table can be easily viewed in your terminal or even with Microsoft Excel.</p>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-23-contents" aria-controls="callout-23" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-24-contents" aria-controls="callout-24" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -968,7 +990,7 @@ <h2 class="anchored" data-anchor-id="create-a-catalog-of-your-data-folder">4. Cr
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-23" class="callout-23-contents callout-collapse collapse show">
+<div id="callout-24" class="callout-24-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -984,7 +1006,7 @@ <h2 class="anchored" data-anchor-id="create-a-catalog-of-your-data-folder">4. Cr
 <li>Solution A. From a TSV</li>
 </ul>
 <div class="callout callout-style-default callout-hint no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-21-contents" aria-controls="callout-21" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-22-contents" aria-controls="callout-22" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -993,7 +1015,7 @@ <h2 class="anchored" data-anchor-id="create-a-catalog-of-your-data-folder">4. Cr
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-21" class="callout-21-contents callout-collapse collapse">
+<div id="callout-22" class="callout-22-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-hint">
@@ -1060,7 +1082,7 @@ <h2 class="anchored" data-anchor-id="create-a-catalog-of-your-data-folder">4. Cr
 <li>Solution B. SQLite database</li>
 </ul>
 <div class="callout callout-style-default callout-hint no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-22-contents" aria-controls="callout-22" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-23-contents" aria-controls="callout-23" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1069,7 +1091,7 @@ <h2 class="anchored" data-anchor-id="create-a-catalog-of-your-data-folder">4. Cr
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-22" class="callout-22-contents callout-collapse collapse">
+<div id="callout-23" class="callout-23-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-hint">
@@ -1125,7 +1147,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 </div>
 </div>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-26-contents" aria-controls="callout-26" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-27-contents" aria-controls="callout-27" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1134,7 +1156,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-26" class="callout-26-contents callout-collapse collapse show">
+<div id="callout-27" class="callout-27-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -1178,7 +1200,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 <span id="cb21-36"><a href="#cb21-36" aria-hidden="true" tabindex="-1"></a><span class="fu">shinyApp</span>(ui, server)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <p>If you need more assistance, take a look at the code below (Hint).</p>
 <div class="callout callout-style-default callout-hint no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-25-contents" aria-controls="callout-25" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-26-contents" aria-controls="callout-26" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1187,7 +1209,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-25" class="callout-25-contents callout-collapse collapse">
+<div id="callout-26" class="callout-26-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-hint">
@@ -1262,7 +1284,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 </div>
 </div>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-28-contents" aria-controls="callout-28" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-29-contents" aria-controls="callout-29" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1271,7 +1293,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-28" class="callout-28-contents callout-collapse collapse show">
+<div id="callout-29" class="callout-29-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -1289,7 +1311,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 </ul>
 <p>If you need some assistance, take a look at the code below (Hint).</p>
 <div class="callout callout-style-default callout-hint no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-27-contents" aria-controls="callout-27" aria-expanded="false" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-28-contents" aria-controls="callout-28" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1298,7 +1320,7 @@ <h3 class="anchored" data-anchor-id="shiny-apps">Shiny apps</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-27" class="callout-27-contents callout-collapse collapse">
+<div id="callout-28" class="callout-28-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-hint">
@@ -1338,7 +1360,7 @@ <h2 class="anchored" data-anchor-id="version-control-using-git-and-github">5. Ve
 <h3 class="anchored" data-anchor-id="setting-up-a-github-repository-for-your-project-folder">Setting up a GitHub repository for your project folder</h3>
 <p>Version controlling your data analysis folders becomes straightforward once you’ve established your Cookiecutter templates. After you’ve created several folder structures and metadata using your Cookiecutter template, you can manage version control by either converting those folders into Git repositories or copying a folder into an existing Git repository. Both approaches are explained in <a href="https://hds-sandbox.github.io/RDM_NGS_course/develop/05_VC.html#from-project-folders-to-git-repositories">Lesson 5</a>.</p>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-30-contents" aria-controls="callout-30" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-31-contents" aria-controls="callout-31" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1347,7 +1369,7 @@ <h3 class="anchored" data-anchor-id="setting-up-a-github-repository-for-your-pro
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-30" class="callout-30-contents callout-collapse collapse show">
+<div id="callout-31" class="callout-31-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -1398,7 +1420,7 @@ <h3 class="anchored" data-anchor-id="github-pages">GitHub Pages</h3>
 </div>
 </div>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-33-contents" aria-controls="callout-33" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-34-contents" aria-controls="callout-34" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1407,7 +1429,7 @@ <h3 class="anchored" data-anchor-id="github-pages">GitHub Pages</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-33" class="callout-33-contents callout-collapse collapse show">
+<div id="callout-34" class="callout-34-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
@@ -1481,7 +1503,7 @@ <h2 class="anchored" data-anchor-id="archive-github-repositories-on-zenodo">6. A
 <h3 class="anchored" data-anchor-id="zenodo">Zenodo</h3>
 <p><a href="https://zenodo.org/">Zenodo</a> is an open-access digital repository that supports the archiving of scientific research outputs, including datasets, papers, software, and multimedia files. Affiliated with CERN and backed by the European Commission, Zenodo promotes transparency, collaboration, and the advancement of knowledge globally. Researchers can easily upload, share, and preserve their data on its user-friendly platform. Each deposit receives a unique DOI for citability and long-term accessibility. Zenodo also offers robust metadata options and allows linking your GitHub account to archive a specific release of your GitHub repository directly to Zenodo. This integration streamlines the process of preserving a snapshot of your project’s progress.</p>
 <div class="callout callout-style-default callout-exercise no-icon callout-titled">
-<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-35-contents" aria-controls="callout-35" aria-expanded="true" aria-label="Toggle callout">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-36-contents" aria-controls="callout-36" aria-expanded="true" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
@@ -1490,7 +1512,7 @@ <h3 class="anchored" data-anchor-id="zenodo">Zenodo</h3>
 </div>
 <div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
-<div id="callout-35" class="callout-35-contents callout-collapse collapse show">
+<div id="callout-36" class="callout-36-contents callout-collapse collapse show">
 <div class="callout-body-container callout-body">
 <div>
 <div class="callout-exercise">
diff --git a/_site/index.html b/_site/index.html
index cb83103..8687efc 100644
--- a/_site/index.html
+++ b/_site/index.html
@@ -161,7 +161,7 @@
     <div>
     <div class="quarto-title-meta-heading">Modified</div>
     <div class="quarto-title-meta-contents">
-      <p class="date-modified">June 4, 2024</p>
+      <p class="date-modified">July 25, 2024</p>
     </div>
   </div>
     
@@ -192,7 +192,7 @@ <h1>Welcome to RDM for biodata</h1>
 </div>
 </div>
 </div>
-<p>The course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s NGS research landscape, as well as in other related fields to health and bioinformatics.</p>
+<p>The course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs, hence, helping them in their daily data analysis work. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s research landscape, with a focus on fields related to omics, health and bioinformatics.</p>
 <div class="callout callout-style-default callout-note callout-titled" title="Course Overview">
 <div class="callout-header d-flex align-content-center">
 <div class="callout-icon-container">
@@ -240,6 +240,8 @@ <h1>Welcome to RDM for biodata</h1>
     9. Version control of your data analysis (5)
     10. Archiving and repositories (7)
 -->
+<p>This course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity.</p>
+<p>Despite DMPs being essential, they are often too general and lack specific guidelines for practical implementation. That is why we have designed this course to cover practical aspects in detail. Participants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility. The course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.</p>
 <div class="callout callout-style-default callout-warning callout-titled" title="Course Requirements">
 <div class="callout-header d-flex align-content-center">
 <div class="callout-icon-container">
@@ -258,9 +260,6 @@ <h1>Welcome to RDM for biodata</h1>
 </ul>
 </div>
 </div>
-<p>This course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability.</p>
-<p>Participants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility.</p>
-<p>The course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.</p>
 <div class="callout callout-style-default callout-tip callout-titled" title="Course Goals">
 <div class="callout-header d-flex align-content-center">
 <div class="callout-icon-container">
diff --git a/_site/search.json b/_site/search.json
index 90fe704..8073183 100644
--- a/_site/search.json
+++ b/_site/search.json
@@ -459,7 +459,7 @@
     "href": "index.html",
     "title": "Welcome to RDM for biodata",
     "section": "",
-    "text": "Welcome to RDM for biodata\n\n\n\n\n\n\n\nWe offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website.\n\n\n\n\nThe course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s NGS research landscape, as well as in other related fields to health and bioinformatics.\n\n\n\n\n\n\nCourse Overview\n\n\n\n\n📖 Syllabus:\n\n\nData Lifecycle Management\nData Management Plans (DMPs)\nData Organization and storage\nDocumentation standards for biodata\nVersion Control and Collaboration\nProcessing and analyzing biodata\nStoring and sharing biodata\n\n\n⏰ Total Time Estimation: X hours\n\n📁 Supporting Materials:\n\n👨‍💻 Target Audience: Ph.D., MSc, anyone interested in RDM for NGS data or other related fields within bioinformatics.\n👩‍🎓 Level: Beginner.\n🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.\n\n💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268).\n\n\n\n\n\n\n\n\n\n\nCourse Requirements\n\n\n\n\nBasic understanding Next Generation Sequencing data and formats.\nCommand Line experience\nBasic programming experience\nQuarto or Mkdocs tools\n\n\n\nThis course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability.\nParticipants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility.\nThe course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.\n\n\n\n\n\n\nCourse Goals\n\n\n\nBy the end of this workshop, you should be able to apply the following concepts in the context of Next Generation Sequencing data:\n\nUnderstand the Importance of Research Data Management (RDM)\nFamiliarize Yourself with FAIR and Open Science Principles\nDraft a Data Management Plan for your own Data\nEstablish File and Folder Naming Conventions\nEnhance Data with Descriptive Metadata\nImplement Version Control for Data Analysis\nSelect an Appropriate Repository for Data Archiving\nMake your data analysis and workflows reproducible and FAIR\n\n\n\n\n\n\n\n\n\nWarning\n\n\n\nThis is a computational workshop that focuses primarily on the digital aspect of our data. While wet lab Research Data Management (RDM) involving protocols, instruments, reagents, ELM or LIMS systems is integral to the entire RDM process, it won’t be covered in this course.\nAs part of effective data management, it’s crucial to prioritize strategies that ensure security and privacy. While these aspects are important, please note that they won’t be covered in our course. However, we highly recommend enrolling in the GDPR course offered by Center for Health Data Science, specially if you’re working with sensitive data. This course specifically focuses on GDPR compliance and will provide you with valuable insights and skills in managing data privacy and security.\n\n\n\nDanish institutional RDM links\n\nUniversity of Copenhagen\nUniversity Library of Southern Denmark\nTechnical University of Denmark\nAalborg University\nAarhus University\n\n\n\nAcknowledgements\n\nRDMkit, ELIXIR (2021) Research Data Management Kit. A deliverable from the EU-funded ELIXIR-CONVERGE project (grant agreement 871075).\nUniversity of Copenhagen Research Data Management Team.\nMartin Proks and Sarah Lundregan, Brickman Lab, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nRichard Dennis, Data Steward, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nNBISweden.\n\n\n\n\n\n\nCopyrightCC-BY-SA 4.0 license"
+    "text": "Welcome to RDM for biodata\n\n\n\n\n\n\n\nWe offer workshops on practical RDM for biodata. Keep an eye on the upcoming events on the Sandbox website.\n\n\n\n\nThe course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs, hence, helping them in their daily data analysis work. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s research landscape, with a focus on fields related to omics, health and bioinformatics.\n\n\n\n\n\n\nCourse Overview\n\n\n\n\n📖 Syllabus:\n\n\nData Lifecycle Management\nData Management Plans (DMPs)\nData Organization and storage\nDocumentation standards for biodata\nVersion Control and Collaboration\nProcessing and analyzing biodata\nStoring and sharing biodata\n\n\n⏰ Total Time Estimation: X hours\n\n📁 Supporting Materials:\n\n👨‍💻 Target Audience: Ph.D., MSc, anyone interested in RDM for NGS data or other related fields within bioinformatics.\n👩‍🎓 Level: Beginner.\n🔒 License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.\n\n💰 Funding: This project was funded by the Novo Nordisk Fonden (NNF20OC0063268).\n\n\n\n\nThis course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity.\nDespite DMPs being essential, they are often too general and lack specific guidelines for practical implementation. That is why we have designed this course to cover practical aspects in detail. Participants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility. The course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.\n\n\n\n\n\n\nCourse Requirements\n\n\n\n\nBasic understanding Next Generation Sequencing data and formats.\nCommand Line experience\nBasic programming experience\nQuarto or Mkdocs tools\n\n\n\n\n\n\n\n\n\nCourse Goals\n\n\n\nBy the end of this workshop, you should be able to apply the following concepts in the context of Next Generation Sequencing data:\n\nUnderstand the Importance of Research Data Management (RDM)\nFamiliarize Yourself with FAIR and Open Science Principles\nDraft a Data Management Plan for your own Data\nEstablish File and Folder Naming Conventions\nEnhance Data with Descriptive Metadata\nImplement Version Control for Data Analysis\nSelect an Appropriate Repository for Data Archiving\nMake your data analysis and workflows reproducible and FAIR\n\n\n\n\n\n\n\n\n\nWarning\n\n\n\nThis is a computational workshop that focuses primarily on the digital aspect of our data. While wet lab Research Data Management (RDM) involving protocols, instruments, reagents, ELM or LIMS systems is integral to the entire RDM process, it won’t be covered in this course.\nAs part of effective data management, it’s crucial to prioritize strategies that ensure security and privacy. While these aspects are important, please note that they won’t be covered in our course. However, we highly recommend enrolling in the GDPR course offered by Center for Health Data Science, specially if you’re working with sensitive data. This course specifically focuses on GDPR compliance and will provide you with valuable insights and skills in managing data privacy and security.\n\n\n\nDanish institutional RDM links\n\nUniversity of Copenhagen\nUniversity Library of Southern Denmark\nTechnical University of Denmark\nAalborg University\nAarhus University\n\n\n\nAcknowledgements\n\nRDMkit, ELIXIR (2021) Research Data Management Kit. A deliverable from the EU-funded ELIXIR-CONVERGE project (grant agreement 871075).\nUniversity of Copenhagen Research Data Management Team.\nMartin Proks and Sarah Lundregan, Brickman Lab, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nRichard Dennis, Data Steward, NNF Center for Stem Cell Biology (reNEW), University of Copenhagen.\nNBISweden.\n\n\n\n\n\n\nCopyrightCC-BY-SA 4.0 license"
   },
   {
     "objectID": "develop/06_file_structure.html",
@@ -743,7 +743,7 @@
     "href": "develop/02_DMP.html",
     "title": "2. Data Management Plan",
     "section": "",
-    "text": "Course Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nLearn what is a DMP\nLearn about the different DMP templates\nHow to write a DMP focused on NGS data\nThe process of data management involves implementing tailored best practices for your data but how do you ensure comprehensive coverage of the decisions and that data is well-managed throughout its life cycle. To achieve this, a Data Management Plan (DMP) is essential.\nA DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.",
+    "text": "Course Overview\n\n\n\n⏰ Time Estimation: X minutes\n💬 Learning Objectives:\n\nLearn what is a DMP\nLearn about the different DMP templates\nHow to write a DMP focused on NGS data\nThe process of data management involves implementing tailored best practices for your data but how do you ensure comprehensive coverage of the decisions and that data is well-managed throughout its life cycle. To achieve this, a Data Management Plan (DMP) is essential.\nDMP are required for grant applications to ensure research data to be FAIR. A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.",
     "crumbs": [
       "Course material",
       "Basics",
@@ -1000,7 +1000,7 @@
     "href": "develop/06_pipelines.html",
     "title": "6. Processing and analyzing biodata",
     "section": "",
-    "text": "In this section, we explore essential elements of reproducibility and efficiency in computational research, highlighting techniques and tools for creating robust and transparent coding and workflows. By prioritizing reproducibility and replicability, researchers can enhance the credibility and impact of their findings while fostering collaboration and knowledge dissemination within the scientific community.\n\n\n\n\n\n\nBefore you start…\n\n\n\n\nChoose a folder structure (e.g., using cookiecutter)\nChoose a file naming system\nAdd a README describing the project (and the naming conventions)\nInstall and set up version control (e.g., Git and Github)\nChoose a coding style!\n\n\nPython: Python’s PEP or Google’s style guide\nR: Google’s style guide or Tidyverse’s style guide\n\n\n\n\n\nThrough techniques such as scripting, containerization (e.g., Docker), and virtual environments, researchers can create reproducible analyses that enable others to validate and build upon their work. Emphasizing the documentation of data processing steps, parameters, and results ensures transparency and accountability in research outputs. To write clear and reproducible code, take the following approach: write functions, code defensively (such as input validation, error handling, etc.), add comments, conduct testing, and maintain proper documentation.\nTools for reproducibility:\n\nCode notebooks: Utilize tools like Jupyter Notebook and R Markdown to combine code with descriptive text and visualizations, enhancing data documentation.\n\nIntegrated development environments: Consider using platforms such as (knitr or MLflow) to streamline code development and documentation processes.\nPipeline frameworks or workflow management systems: Implement systems like Nextflow and Snakemake to automate data analysis steps (including data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages.\n\n\n\nComputational notebooks (e.g., Jupyter, R Markdown) provide researchers with a versatile platform for exploratory and interactive data analysis. These notebooks facilitate sharing insights with collaborators and documentation of analysis procedures.\n\n\n\nTools such as Nextflow and Snakemake streamline and automate various data analysis steps, enabling parallel processing and seamless integration with existing tools. Remember to create portable code and use relative paths to ensure transferability between users.\n\nNextflow: offers scalable and portable NGS data analysis pipelines, facilitating data processing across diverse computing environments.\nSnakemake: Utilizing Python-based scripting, Snakemake allows for flexible and automated NGS data analysis pipelines, supporting parallel processing and integration with other tools.\n\nOnce your scientific computational workflow is ready to be shared, publish your scientific computational workflow on WorkflowHub.\n\n\n\nEach computer or HPC (High-Performance Computing) platform has a unique computational environment that includes its operating system, installed software, versions of software packages, and other features. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results if it depends on any of these factors.\nFor research to be reproducible, the original computational environment must be recorded so others can replicate it. There are several methods to achieve this:\n\nContainerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.\nVirtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies)\nEnvironment managers: provide an isolated environment with specific packages and dependencies that can be installed without affecting the system-wide configuration. These environments are particularly useful for managing conflicting dependencies and ensuring reproducibility. Configuration files can automate the setup of the computational environment:\n\nconda: allows users to export environment specifications (software and dependencies) to YAML files enabling easy recreation of the environment on another system\nPython virtualenv: is a tool for creating isolated environments to manage dependencies specific to a project\nrequirements.txt: may contain commands for installing packages (such as pip for Python packages or apt-get for system-level dependencies), configuring system settings, and setting environment variables. Package managers can be used to install, upgrade and manage packages.\nR’s renv: The ‘renv’ package creates isolated environments in R.\n\nEnvironment descriptors\n\nsessionInfo() or devtools::session_info(): In R, these functions provide detailed information about the current session\nsessionInfo(), similarly, in Python. Libraries like NumPy and Pandas have show_versions() methods to display package versions.\n\n\nWhile environment managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, Docker containers provide a full env isolation (including the operating system) which ensures consistent behavior across different systems.\n\n\n\n\nTo maintain clarity and organization in the data analysis process, adopt best practices such as:\n\nData documentation: create a README.md file to provide an overview of the project and its structure, and metadata for understanding the context of your analysis.\nAnnotate your pipelines and comment your code (look for tutorials and templates such as this one from freeCodeCamp).\nUse coding style guides (code lay-out, whitespace in expressions, comments, naming conventions, annotations…) to maintain consistency.\nLabel files numerically to organize the entire data analysis process (scripts, notebooks, pipelines, etc.).\n\n00.preprocessing., 01.data_analysis_step1., etc.\n\nProvide environment files for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.\nData versioning: use version control systems (e.g., Git) and upload your code to a code repository Lesson 5.\nIntegrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code\nLeverage curated pipelines such as the ones developed by the nf-core community, further ensuring adherence to community standards and guidelines.\nAdd a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.\n\n\n\n\n\n\n\nPractical HPC pipes\n\n\n\nWe provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the Sandbox website. If you’re interested in delving deeper, check out the HPC best practices module we’ve developed here.",
+    "text": "In this section, we explore essential elements of reproducibility and efficiency in computational research, highlighting techniques and tools for creating robust and transparent coding and workflows. By prioritizing reproducibility and replicability, researchers can enhance the credibility and impact of their findings while fostering collaboration and knowledge dissemination within the scientific community.\n\n\n\n\n\n\nBefore you start…\n\n\n\n\nChoose a folder structure (e.g., using cookiecutter)\nChoose a file naming system\nAdd a README describing the project (and the naming conventions)\nInstall and set up version control (e.g., Git and Github)\nChoose a coding style!\n\n\nPython: Python’s PEP or Google’s style guide\nR: Google’s style guide or Tidyverse’s style guide\n\n\n\n\n\nThrough techniques such as scripting, containerization (e.g., Docker), and virtual environments, researchers can create reproducible analyses that enable others to validate and build upon their work. Emphasizing the documentation of data processing steps, parameters, and results ensures transparency and accountability in research outputs. To write clear and reproducible code, take the following approach: write functions, code defensively (such as input validation, error handling, etc.), add comments, conduct testing, and maintain proper documentation.\nTools for reproducibility:\n\nCode notebooks: Utilize tools like Jupyter Notebook and R Markdown to combine code with descriptive text and visualizations, enhancing data documentation.\n\nIntegrated development environments: Consider using platforms such as (knitr or MLflow) to streamline code development and documentation processes.\nPipeline frameworks or workflow management systems: Implement systems like Nextflow and Snakemake to automate data analysis steps (including data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages.\n\n\n\nComputational notebooks (e.g., Jupyter, R Markdown) provide researchers with a versatile platform for exploratory and interactive data analysis. These notebooks facilitate sharing insights with collaborators and documentation of analysis procedures.\n\n\n\nTools such as Nextflow and Snakemake streamline and automate various data analysis steps, enabling parallel processing and seamless integration with existing tools. Remember to create portable code and use relative paths to ensure transferability between users.\n\nNextflow: offers scalable and portable NGS data analysis pipelines, facilitating data processing across diverse computing environments.\nSnakemake: Utilizing Python-based scripting, Snakemake allows for flexible and automated NGS data analysis pipelines, supporting parallel processing and integration with other tools.\n\nOnce your scientific computational workflow is ready to be shared, publish your scientific computational workflow on WorkflowHub.\n\n\n\nEach computer or HPC (High-Performance Computing) platform has a unique computational environment that includes its operating system, installed software, versions of software packages, and other features. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results if it depends on any of these factors.\nFor research to be reproducible, the original computational environment must be recorded so others can replicate it. There are several methods to achieve this:\n\nContainerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.\nVirtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies)\nEnvironment managers: provide an isolated environment with specific packages and dependencies that can be installed without affecting the system-wide configuration. These environments are particularly useful for managing conflicting dependencies and ensuring reproducibility. Configuration files can automate the setup of the computational environment:\n\nconda: allows users to export environment specifications (software and dependencies) to YAML files enabling easy recreation of the environment on another system\nPython virtualenv: is a tool for creating isolated environments to manage dependencies specific to a project\nrequirements.txt: may contain commands for installing packages (such as pip for Python packages or apt-get for system-level dependencies), configuring system settings, and setting environment variables. Package managers can be used to install, upgrade and manage packages.\nR’s renv: The ‘renv’ package creates isolated environments in R.\n\nEnvironment descriptors\n\nsessionInfo() or devtools::session_info(): In R, these functions provide detailed information about the current session\nsessionInfo(), similarly, in Python. Libraries like NumPy and Pandas have show_versions() methods to display package versions.\n\n\nWhile environment managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, Docker containers provide a full env isolation (including the operating system) which ensures consistent behavior across different systems.\n\n\n\n\nTo maintain clarity and organization in the data analysis process, adopt best practices such as:\n\nData documentation: create a README.md file to provide an overview of the project and its structure, and metadata for understanding the context of your analysis.\nAnnotate your pipelines and comment your code (look for tutorials and templates such as this one from freeCodeCamp).\nUse coding style guides (code lay-out, whitespace in expressions, comments, naming conventions, annotations…) to maintain consistency.\nLabel files numerically to organize the entire data analysis process (scripts, notebooks, pipelines, etc.).\n\n00.preprocessing., 01.data_analysis_step1., etc.\n\nProvide environment files for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.\nData versioning: use version control systems (e.g., Git) and upload your code to a code repository Lesson 5.\nIntegrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code\nUse git submodule for code and software that is reused in several projects\nLeverage curated pipelines such as the ones developed by the nf-core community, further ensuring adherence to community standards and guidelines.\nUse Software Heritage an archive for software source code are essential for long-term accessibility and reproducibility\nAdd a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.\n\n\n\n\n\n\n\nPractical HPC pipes\n\n\n\nWe provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the Sandbox website. If you’re interested in delving deeper, check out the HPC best practices module we’ve developed here.",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1024,7 +1024,7 @@
     "href": "develop/03_DOD.html#folder-organization",
     "title": "3. Data organization and storage",
     "section": "Folder organization",
-    "text": "Folder organization\nHere we suggest the use of three main folders:\n\nShared project data folders:\n\n\nThis shared directory is designated for storing unprocessed sequencing data files, with each subfolder representing a separate project.\nEach project folder contains raw data, corresponding metadata, and optionally pre-processed data like quality control reports and processed data.\n\nInclude the pipeline or workflow used for data processing, along with a metadata file.\n\nData in these folders should be locked and set to read-only to prevent unauthorized (“unwanted”) modifications.\n\n\nIndividual project folders:\n\n\nThis directory typically belongs to the researcher conducting bioinformatics analyses and encompasses all essential files for a specific research project (data, scripts, software, workflows, results, etc.).\nA project may utilize data from various assays or results obtained from other projects. It’s important to avoid duplicating datasets; instead, link them from the original source to maintain data integrity and avoid redundancy.\n\n\nResources and databases folders:\n\n\nThis (commonly) shared directory contains common repositories or curated databases that facilitate research (genomics, clinical data, imaging data, and more!). For instance, in genomics, it includes genome references (fasta files), annotations (gtf files) for different species, and indexes for various alignment algorithms.\nEach folder corresponds to a unique reference or database version, allowing for multiple references from the same organism or different species.\n\nEnsure each contains the version of the reference and a metadata file.\nMore subfolders can be created for different data formats.\n\n\n\n\n\n\n\n\nVerify the integrity of downloaded files!\n\n\n\nEnsure that the person downloading the files employs checksums or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.\n\nMD5 Checksum: Files with names ending in “.md5” contain MD5 checksums. For instance, “filename.txt.md5” holds the MD5 checksum of “filename.txt”.”\n\n\n\n\n\n\n\n\n\nDatabase\n\n\n\n\n\n\n\nA database is a structured repository for storing, managing, and retrieving information, forming the cornerstone of efficient data organization.\n\n\n\n\n\n\n\n\n\n\n\nCreate shortcuts to public datasets and assays!\n\n\n\nThe use of symbolic links, also referred to as softlinks, is a key practice in large labs where data might used for different purposes and by multiple people.\n\nThey act as pointers, containing the path to the location of the target files/directories.\nThey avoid duplication and they are flexible and lightweight (do not occupy much disk space).\nThey simplify directory structures.\n\nExtra use case: create symbolic links to executable files and libraries!\n\n\n\n\n\n\n\n\n\nExercise: create a softlink link\n\n\n\n\n\n\n\nOpen your terminal and create a softlink using the following command. The first path is the target (directory or file) and the second one is where the symbolic link will be created.\nln -s path/to/dataset/&lt;ASSAY_ID&gt; /path/to/user/&lt;PROJECT_ID&gt;/data/\nNow, access the target file/directory through the symbolic link:\nls /path/to/user/&lt;PROJECT_ID&gt;/data/\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nFollow this example if need extra guidance (change paths!):\n\nCreate the target/original file\n\necho \"This is the content of the original file.\" &gt; /home/users/Documents/original_file.txt\n\nCreate the symbolic link\n\nln -s /home/users/Documents/original_file.txt /home/users/Desktop/original_file.txt\n\nVerify the symbolic link\n\nls -s /home/users/Desktop/original_file.txt\n\nAccess the file through the symbolic link:\n\ncat /home/users/Desktop/original_file.txt\nThe last command will display the contents of the original file.",
+    "text": "Folder organization\nHere we suggest the use of three main folders:\n\nShared project data folders:\n\n\nThis shared directory is designated for storing unprocessed sequencing data files, with each subfolder representing a separate project.\nEach project folder contains raw data, corresponding metadata, and optionally pre-processed data like quality control reports and processed data.\n\nInclude the pipeline or workflow used for data processing, along with a metadata file.\n\nData in these folders should be locked and set to read-only to prevent unauthorized (“unwanted”) modifications.\n\n\nIndividual project folders:\n\n\nThis directory typically belongs to the researcher conducting bioinformatics analyses and encompasses all essential files for a specific research project (data, scripts, software, workflows, results, etc.).\nA project may utilize data from various assays or results obtained from other projects. It’s important to avoid duplicating datasets; instead, link them from the original source to maintain data integrity and avoid redundancy.\n\n\nResources and databases folders:\n\n\nThis (commonly) shared directory contains common repositories or curated databases that facilitate research (genomics, clinical data, imaging data, and more!). For instance, in genomics, it includes genome references (fasta files), annotations (gtf files) for different species, and indexes for various alignment algorithms.\nEach folder corresponds to a unique reference or database version, allowing for multiple references from the same organism or different species.\n\nEnsure each contains the version of the reference and a metadata file.\nMore subfolders can be created for different data formats.\n\n\n\n\n\n\n\n\nVerify the integrity of downloaded files!\n\n\n\nEnsure that the person downloading the files employs checksums (MD5, SHA1, SHA256) or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.\n\nMD5 Checksum: Files with names ending in “.md5” contain MD5 checksums. For instance, “filename.txt.md5” holds the MD5 checksum of “filename.txt”.”\n\n\n\n\n\n\n\n\n\nDatabase\n\n\n\n\n\n\n\nA database is a structured repository for storing, managing, and retrieving information, forming the cornerstone of efficient data organization.\n\n\n\n\n\n\n\n\n\n\n\nCreate shortcuts to public datasets and assays!\n\n\n\nThe use of symbolic links, also referred to as softlinks, is a key practice in large labs where data might used for different purposes and by multiple people.\n\nThey act as pointers, containing the path to the location of the target files/directories.\nThey avoid duplication and they are flexible and lightweight (do not occupy much disk space).\nThey simplify directory structures.\n\nExtra use case: create symbolic links to executable files and libraries!\n\n\n\n\n\n\n\n\n\nExercise: create a softlink link\n\n\n\n\n\n\n\nOpen your terminal and create a softlink using the following command. The first path is the target (directory or file) and the second one is where the symbolic link will be created.\nln -s path/to/dataset/&lt;ASSAY_ID&gt; /path/to/user/&lt;PROJECT_ID&gt;/data/\nNow, access the target file/directory through the symbolic link:\nls /path/to/user/&lt;PROJECT_ID&gt;/data/\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nFollow this example if need extra guidance (change paths!):\n\nCreate the target/original file\n\necho \"This is the content of the original file.\" &gt; /home/users/Documents/original_file.txt\n\nCreate the symbolic link\n\nln -s /home/users/Documents/original_file.txt /home/users/Desktop/original_file.txt\n\nVerify the symbolic link\n\nls -s /home/users/Desktop/original_file.txt\n\nAccess the file through the symbolic link:\n\ncat /home/users/Desktop/original_file.txt\nThe last command will display the contents of the original file.",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1079,7 +1079,7 @@
     "href": "develop/03_DOD.html#naming-conventions",
     "title": "3. Data organization and storage",
     "section": "Naming conventions",
-    "text": "Naming conventions\nConsistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. The importance of uniform naming conventions extends to various fields, in fields like genomics or health data science, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.\n\n\n\n\n\n\nGeneral tips for file and folder naming\n\n\n\nRemember to keep the folder structure simple.\n\nKeep it short and meaningful (use understandable abbreviation only, e.g., Cor for correlations or LFC for Log Fold Change)\nConsider including one of these elements: project name, category, descriptor, content, author…\n\nAuthor-based: use initials\n\nUse alphanumeric characters: letters (A-Z) and numbers (0-9)\nAvoid special characters: ~!@#$%^&*()`“|\nDate-based format: use YYYYMMDD format (year/month/day format helps with sorting and listing files in chronological order)\nUse underscores and hyphens as delimiters and avoid spaces.\n\nNot all search tools may work well with spaces (messy to indicate paths)\nIf the length is a concern, use capital letters to delimit words camelCase.\n\nSequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not, 1 if your sequence only goes up to 99)\nVersion control: Indicate the version (“V”) or revision (“R”) as the last element, using the two-digit format (e.g., v01, v02)\nWrite down your naming convention pattern and document it in the README file\n\n\n\n\n\n\n\n\n\nCreate your own naming conventions\n\n\n\n\n\n\n\nConsider the most common types of files and folders you will be working with, such as visualizations, results tables, and processed files. Develop a logical and clear naming system for these files based on the tips provided above. Aim for concise and straightforward names to avoid complexity.\n\n\n\n\n\nTo learn more about naming conventions for NGS analysis and see additional examples, click here.",
+    "text": "Naming conventions\nConsistent naming conventions play a crucial role in scientific research by enhancing organization and data retrieval. By adopting standardized naming conventions, researchers ensure that files, experiments, or datasets are labeled logically, facilitating easy location and comparison of similar data. The importance of uniform naming conventions extends to various fields, in fields like genomics or health data science, uniform naming conventions for files associated with particular experiments or samples allow for swift identification and comparison of relevant data, streamlining the research process and contributing to the reproducibility of findings. Overall, promotes efficiency, collaboration, and the integrity of scientific work.\n\n\n\n\n\n\nGeneral tips for file and folder naming\n\n\n\nRemember to keep the folder structure simple.\n\nKeep it short and meaningful (use understandable abbreviation only, e.g., Cor for correlations or LFC for Log Fold Change)\nConsider including one of these elements: project name, category, descriptor, content, author…\n\nAuthor-based: use initials\n\nUse alphanumeric characters: letters (A-Z) and numbers (0-9)\nAvoid special characters: ~!@#$%^&*()`“|\nDate-based format: use YYYYMMDD format (year/month/day format helps with sorting and listing files in chronological order)\nUse underscores and hyphens as delimiters and avoid spaces.\n\nNot all search tools may work well with spaces (messy to indicate paths)\nIf the length is a concern, use capital letters to delimit words camelCase.\n\nSequential numbering: Use a two-‑digit format for single-digit numbers (0–9) to ensure correct numerical sequence order (for example, 01 and not, 1 if your sequence only goes up to 99)\nVersion control: Indicate the version (“V”) or revision (“R”) as the last element, using the two-digit format (e.g., v01, v02)\nWrite down your naming convention pattern and document it in the README file\n\n\n\n\n\n\n\n\n\nCreate your own naming conventions\n\n\n\n\n\n\n\nConsider the most common types of files and folders you will be working with, such as visualizations, results tables, and processed files. Develop a logical and clear naming system for these files based on the tips provided above. Aim for concise and straightforward names to avoid complexity.\n\n\n\n\n\n\n\n\n\n\n\nWhich naming conventions should not be used and why?\n\n\n\n\n\n\n\nA. data_processing_carlo's.py\nB. raw_sequences_V#20241111.fasta\nC. differential_expression_results_clara.csv\nD. Grant proposal final.doc\nE. sequence_alignment$v1.py\nF. data/gene_annotations_20201107.gff\nG. alpha~1.0/beta~2.0/reg_2024-05-98.tsv\nH. alpha=1.0/beta=2.0/reg_2024-05-98.tsv\nI. run_pipeline:20241203.sh\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nA, B, D, E, H, I\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhich file name is more readable?\n\n\n\n\n\n\n\n1a. forecast2000122420240724.tsv\n1b. forecast_2000-12-24_2024-07-24.tsv\n1c. forecast_2000_12_24_2024_07_24.tsv\n2a. 01_data_preprocessing.R\n2b. 1_data_preProcessing.R\n2c. 01_d4t4_pr3processing.R\n3a. B1_2024-12-12_cond~pH7_temp~37C.fastq\n3b. B1.20241212.pH7.37C.fastq\n3c. b1_2024-12-12_c0nd~pH7_t3mp~37C.fastq\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n1b: easier for human & machine, _ separates dates, - separates within time information (year/month/day). This is important, for example, when using wildcards in Snakemake for building pipelines.\n2a: start with 0 for sorting, consistently with upper/lower and the use of separators (_ separates metadata)\n3a: indicates variable temperature is set to 37 Celsius (temperature could be negative - and is better used to separate values in time)\n\n\n\n\n\n\n\n\n\n\nRegular expressions are an incredibly powerful tool for string manipulation. We recommend checking out RegexOne to learn how to create smart file names that will help you parse them more efficiently. To learn more about naming conventions for NGS analysis and see additional examples, click here.\n\n\n\n\n\n\nWhich of the following regexps match the following filenames?\n\n\n\n\n\n\n\n(in bold filenames that SHOULD be matched):\n\nrna_seq/2021/03/results/Sample_A123_gene_expression.tsv\nproteomics/2020/11/Sample_B234_protein_abundance.tsv\nrna_seq/2021/03/results/Sample_C345_normalized_counts.tsv\nrna_seq/2021/03/results/Sample_D456_quality_report.log\nmetabolomics/2019/05/Sample_E567_metabolite_levels.tsv\nrna_seq/2019/12/Sample_F678_raw_reads.fastq\nrna_seq/2021/03/results/Sample_G789_transcript_counts.tsv\nproteomics/2021/02/Sample_H890_protein_quantification.TSV\n\nRegular Expressions:\nrna_seq.*\\.tsv\n.*\\.csv\n.*/2021/03/.*\\.tsv\n.*Sample_.*_gene_expression.tsv\nrna_seq/2021/03/results/Sample_.*_.*\\.tsv\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\n.*rna_seq.*\\.tsv and rna_seq/2021/03/results/Sample_.*_.*\\.tsv match the exact same files",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1091,7 +1091,7 @@
     "href": "develop/03_DOD.html#wrap-up",
     "title": "3. Data organization and storage",
     "section": "Wrap up",
-    "text": "Wrap up\nIn this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! Complete the practical tutorial on using cookiecutter as a template engine to be able to create your own templates and reuse them as much as you need.\n\nSources\n\nUK Data Service: https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/organising/\nOakland University: https://library.oakland.edu/services/research-data/file-org.html\nCessda guidelines: https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/File-naming-and-folder-structure.\nRDMkit Elixir Europe: https://rdmkit.elixir-europe.org/data_organisation",
+    "text": "Wrap up\nIn this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! It is now your responsibility to use and implement them in a reasonable way. Complete the practical tutorial on using cookiecutter as a template engine to be able to create your own templates and reuse them as much as you need.\n\nSources\n\nThe Turing way https://the-turing-way.netlify.app/project-design/project-repo/project-repo-advanced.html\nRDMkit Elixir Europe: https://rdmkit.elixir-europe.org/data_organisation\nCoderefinery: https://coderefinery.github.io/reproducible-research/organizing-projects/#directory-structure-for-projects\nUK Data Service: https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/organising/\nOakland University: https://library.oakland.edu/services/research-data/file-org.html\nCessda guidelines: https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/File-naming-and-folder-structure.",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1211,7 +1211,7 @@
     "href": "develop/03_DOD.html#navigating-shared-project-data",
     "title": "3. Data organization and storage",
     "section": "1. Navigating Shared Project Data",
-    "text": "1. Navigating Shared Project Data\nLet’s focus on the shared folders containing experimental datasets generated in-house.\n\nNaming Shared Folders Effectively\nCreate a folder for all your NGS experiments, for instance, named Assay. Each subfolder, denoted by a unique Assay-ID, should be named clearly and comprehensibly. Assay-ID comprises raw files, processed files, and the pipeline used to generate them. Raw files should remain unchanged, while modifications to processed files should be restricted post-preprocessing (e.g., after quality control) to prevent unintended alterations. Check the exercise for efficient naming of Assay-ID:\n\n\n\n\n\n\nExercise: name your Assay-ID\n\n\n\n\n\n\n\n\nHow would you ensure its name is unique and understood at a glance?\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n&lt;Assay-ID&gt;_&lt;keyword&gt;_YYYYMMDD\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nOther Assay ID code names\n\n\n\n\n\n\nCHIP: ChIP-seq\nRNA: RNA-seq\nATAC: ATAC-seq\nSCR: scRNA-seq\nPROT: Mass Spectrometry Assay\nCAT: Cut&Tag\nCAR: Cut&Run\nRIME: Rapid Immunoprecipitation Mass spectrometry of Endogenous proteins\n…\n\n\n\n\nKeep in mind that these folders might be (re)used in different individual projects over many years.\n\n\nOptimizing Folder Structures\nThe provided folder structure is designed to be intuitive for NGS data. The description and metadata files aid in understanding the project’s origin and structure, crucial for archiving and manuscript preparation. There is a section dedicated to databases in lesson 4. Let’s explore the example and its folder contents:\n&lt;data_type&gt;_&lt;keyword&gt;_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline.md\n├── processed\n    ├── fastqc/\n    ├── multiqc/\n    ├── final_fastq/\n└── raw\n    ├── .fastq.gz \n    └── samplesheet.csv\n\nREADME.md: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay’s objectives). Sometimes, it also outlines the basic directory structure and file naming conventions.\nmetadata.yml: This serves as the metadata file for the project (see this lesson).\npipeline.md: This document describes the pipeline employed to process the raw data, along with the specific commands used to execute the pipeline. The specific format can vary depending on the workflow system employed (e.g., bash, Snakemake, Nextflow, Jupyter Notebooks, etc.) (see this lesson). Employing a standardized pipeline ensures a consistent file organization system (and the corresponding documentation)\nprocessed_data: folder with results of the preprocessing pipeline. The contents may vary depending on the pipeline utilized. For example,\n\nfastqc: quality Control results of the raw fastq files.\nmultiqc: aggregated quality control results across all samples\nfinal_fastq: cleaned and processed files\n\nraw_data: folder with the raw data.\n\n.fastq.gz or other file formats (depending on the field or the experiment)\nsamplesheet.csv: metadata information for the samples. It may contain additional columns that will facilitate downstream analysis. This file is key if are planning to use nf-core pipelines.",
+    "text": "1. Navigating Shared Project Data\nLet’s focus on the shared folders containing experimental datasets generated in-house.\n\nNaming Shared Folders Effectively\nCreate a folder for all your NGS experiments, for instance, named Assay. Each subfolder, denoted by a unique Assay-ID, should be named clearly and comprehensibly. Assay-ID comprises raw files, processed files, and the pipeline used to generate them. Raw files should remain unchanged, while modifications to processed files should be restricted post-preprocessing (e.g., after quality control) to prevent unintended alterations. Check the exercise for efficient naming of Assay-ID:\n\n\n\n\n\n\nExercise: name your Assay-ID\n\n\n\n\n\n\n\n\nHow would you ensure its name is unique and understood at a glance?\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n&lt;Assay-ID&gt;_&lt;keyword&gt;_YYYYMMDD\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nOther Assay ID code names\n\n\n\n\n\n\nCHIP: ChIP-seq\nRNA: RNA-seq\nATAC: ATAC-seq\nSCR: scRNA-seq\nPROT: Mass Spectrometry Assay\nCAT: Cut&Tag\nCAR: Cut&Run\nRIME: Rapid Immunoprecipitation Mass spectrometry of Endogenous proteins\n…\n\n\n\n\nKeep in mind that these folders might be (re)used in different individual projects over many years.\n\n\nOptimizing Folder Structures\nThe provided folder structure is designed to be intuitive for NGS data. The description and metadata files aid in understanding the project’s origin and structure, crucial for archiving and manuscript preparation. There is a section dedicated to databases in lesson 4. Let’s explore the example and its folder contents:\n&lt;data_type&gt;_&lt;keyword&gt;_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline.md\n├── processed\n    ├── fastqc/\n    ├── multiqc/\n    ├── final_fastq/\n└── raw\n    ├── .fastq.gz \n    └── samplesheet.csv\n\nREADME.md: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay’s objectives). Sometimes, it also outlines the basic directory structure and file naming conventions. README’s are great but the goal is to make everything as self-explanatory as possible!\nmetadata.yml: This serves as the metadata file for the project (see this lesson).\npipeline.md: This document describes the pipeline employed to process the raw data, along with the specific commands used to execute the pipeline. The specific format can vary depending on the workflow system employed (e.g., bash, Snakemake, Nextflow, Jupyter Notebooks, etc.) (see this lesson). Employing a standardized pipeline ensures a consistent file organization system (and the corresponding documentation)\nprocessed_data: folder with results of the preprocessing pipeline. The contents may vary depending on the pipeline utilized. For example,\n\nfastqc: quality Control results of the raw fastq files.\nmultiqc: aggregated quality control results across all samples\nfinal_fastq: cleaned and processed files\n\nraw_data: folder with the raw data.\n\n.fastq.gz or other file formats (depending on the field or the experiment)\nsamplesheet.csv: metadata information for the samples. It may contain additional columns that will facilitate downstream analysis. This file is key if are planning to use nf-core pipelines.",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1235,7 +1235,7 @@
     "href": "develop/03_DOD.html#navigating-project-folder",
     "title": "3. Data organization and storage",
     "section": "2. Navigating Project folder",
-    "text": "2. Navigating Project folder\nIn the Projects folder, usually private to the individual performing the data analysis, each project has its own subfolder containing project information, data analysis scripts and pipelines, and results. It’s advisable to maintain folders for individual projects, separate from shared data folders, as project-specific files typically aren’t reused across multiple projects, and more than one dataset might be needed to answer a specific scientific question.\n\nNaming Project Folders Effectively\nThe Project folder should have a unique, easily readable, distinguishable, and instantly understandable name. For instance, consider naming it using the main author’s initials, a descriptive keyword, and the date:\n&lt;author_initials&gt;_&lt;keyword&gt;_YYYYMMDD\n\n\nOptimizing Folder Structures\nNext, let’s take a look at a possible folder structure and what kind of files you can find there.\n&lt;author_initials&gt;_&lt;keyword&gt;_YYYYMMDD\n├── data (symbolic link)\n│  └── raw\n│  └── processed\n│  └── external (third party resources)\n├── docs\n│  └── project_template.docx\n├── notebooks or pipelines/\n│  └── data_analysis1.ipynb or data_analysis1.smk\n├── README.md\n├── logs\n├── tmp/scratch\n├── environment\n│  └── requirements.txt or environment.yml\n├── scripts/src\n│  └── step1.py \n├── reports\n│  └── figures/\n│  └── &lt;file_name&gt;.html\n├── results\n│  └── tables/\n│  └── figures/\n└── metadata.yml\n\ndata: contains symlinks or shortcuts to where the data is (raw, processed, external, etc.), avoiding duplication and modification of original files.\ndocs: a folder containing Word documents, slides, or PDFs related to the project. It also contains your Data Management Plan.\nnotebooks or pipelines: a folder containing notebooks (Jupyter, R markdown, Quarto notebooks) or workflows (Snakemake or Nextflow) with the actual data analysis. Tip: Label them numerically indicating the sequential order.\nREADME.md: detailed description of the project in markdown format.\nlogs: log files.\ntmp/scratch: store temporary or intermediate files (eg. testing).\nenvironment: files for reproducing the analysis environment to reproduce the results, such as a Dockerfile, conda yaml file, or a text file (See 6th lesson for more tips on making your pipelines reproducible). It includes software, libraries/packages, and dependencies (and their versions!).\nscripts: a folder containing helper scripts to run data analysis or source code\nreports: Generated analysis as HTML, PDF, LaTeX, etc. Great for sharing with colleagues and creating formal reports of the data analysis procedure.\n\nfigures: figures produced upon rendering notebooks. Tip: save the figures under a subfolder named after the notebook/pipeline that created them (you will appreciate this organization when you need to rerun analysis and know which script created each figure!).\n\nresults: results from the data analysis, such as tables and figures, etc. Tip: Create a subfolder named after the notebook or pipeline for storing the results generated by that specific notebook or pipeline.\nmetadata.yml: metadata file describing the dataset, samples, etc. (see this lesson).\n\n\n\n\n\n\n\nExercise: Write your personal data structure\n\n\n\n\n\n\n\n\nCreate your own data structure for one of the projects you are currently working on. Consider how it is similar to the example provided and how it differs. Make sure the data structure is easily understandable and navigable.\nWhat improvements or modifications could be made to enhance clarity and efficiency?",
+    "text": "2. Navigating Project folder\nIn the Projects folder, usually private to the individual performing the data analysis, each project has its own subfolder containing project information, data analysis scripts and pipelines, and results. It’s advisable to maintain folders for individual projects, separate from shared data folders, as project-specific files typically aren’t reused across multiple projects, and more than one dataset might be needed to answer a specific scientific question.\n\nNaming Project Folders Effectively\nThe Project folder should have a unique, easily readable, distinguishable, and instantly understandable name. For instance, consider naming it using the main author’s initials, a descriptive keyword, and the date:\n&lt;author_initials&gt;_&lt;keyword&gt;_YYYYMMDD\n\n\nOptimizing Folder Structures\nNext, let’s take a look at a possible folder structure and what kind of files you can find there.\n&lt;author_initials&gt;_&lt;keyword&gt;_YYYYMMDD\n├── data (symbolic link)\n│  └── raw\n│  └── processed\n│  └── external (third party resources)\n├── docs\n│  └── project_template.docx\n├── notebooks or pipelines/\n│  └── data_analysis1.ipynb or data_analysis1.smk\n├── README.md\n├── logs\n├── tmp/scratch\n├── environment\n│  └── requirements.txt or environment.yml\n├── scripts/src\n│  └── step1.py \n├── reports\n│  └── figures/\n│  └── &lt;file_name&gt;.html\n├── results\n│  └── tables/\n│  └── figures/\n└── metadata.yml\n\ndata: contains symlinks or shortcuts to where the data is (raw, processed, external, etc.), avoiding duplication and modification of original files.\ndocs: a folder containing Word documents, slides, or PDFs related to the project. It also contains your Data Management Plan.\nnotebooks or pipelines: a folder containing notebooks (Jupyter, R markdown, Quarto notebooks) or workflows (Snakemake or Nextflow) with the actual data analysis. Tip: Label them numerically indicating the sequential order.\nREADME.md: detailed description of the project in markdown format.\nlogs: log files.\ntmp/scratch: store temporary or intermediate files (eg. testing).\nenvironment: files for reproducing the analysis environment to reproduce the results, such as a Dockerfile, conda yaml file, or a text file (See 6th lesson for more tips on making your pipelines reproducible). It includes software, libraries/packages, and dependencies (and their versions!).\nscripts: a folder containing helper scripts to run data analysis or source code. Other common directory names: src, source and code, pick one!\nreports: Generated analysis as HTML, PDF, LaTeX, etc. Great for sharing with colleagues and creating formal reports of the data analysis procedure.\n\nfigures: figures produced upon rendering notebooks. Tip: save the figures under a subfolder named after the notebook/pipeline that created them (you will appreciate this organization when you need to rerun analysis and know which script created each figure!).\n\nresults: results from the data analysis, such as tables and figures, etc. Tip: Create a subfolder named after the notebook or pipeline for storing the results generated by that specific notebook or pipeline.\nmetadata.yml: metadata file describing the dataset, samples, etc. (see this lesson).\n\nFor good managing project practices, version control everything with git and git-annex!\n\n\n\n\n\n\nExercise: Write your personal data structure\n\n\n\n\n\n\n\n\nCreate your own data structure for one of the projects you are currently working on. Consider how it is similar to the example provided and how it differs. Make sure the data structure is easily understandable and navigable.\nWhat improvements or modifications could be made to enhance clarity and efficiency? Check the following callout for more examples to get inspired.\n\n\n\n\n\n\n\n\n\n\n\n\nNeed more examples?\n\n\n\n\n\nIf you want to get inspired, here are two other templates proposed by A. The Turing way and B. Coderefinery:\n\n\n\nProject Folder/\n├── docs                     &lt;- documentation\n│   └── codelist.txt\n│   └── project_plan.txt\n│   └── ...\n│   └── deliverables.txt\n├── data\n│   └── raw/\n│       └── my_data.csv\n│   └── clean/\n│       └── data_clean.csv\n├── analysis                 &lt;- scripts\n│   └── my_script.R\n├── results                  &lt;- analysis output     \n│   └── figures\n├── .gitignore               &lt;- files excluded from git version control\n├── install.R                &lt;- environment setup\n├── CODE_OF_CONDUCT          &lt;- Code of Conduct for community projects\n├── CONTRIBUTING             &lt;- Contribution guideline for collaborators\n├── LICENSE                  &lt;- software license\n├── README.md                &lt;- information about the repo\n└── report.md                &lt;- report of project\n\n\n\nproject_name/ \n├── README.md # overview of the project\n├── data/ # data files used in the project \n│   ├── README.md # describes where data came from \n│   └── sub-folder/ # may contain subdirectories \n├── processed_data/ # intermediate files from the analysis \n├── manuscript/ # manuscript describing the results \n├── results/ # results of the analysis (data, tables, figures) \n├── src/ # contains all code in the project \n│   ├── LICENSE # license for your code \n│   ├── requirements.txt # software requirements and dependencies \n│   └── ... \n└── doc/ # documentation for your project \n├── index.rst \n└── ...",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1295,7 +1295,7 @@
     "href": "develop/07_repos.html#wrap-up",
     "title": "7. Storing and sharing biodata",
     "section": "Wrap up",
-    "text": "Wrap up\nIn this concluding lesson, we’ve covered the process of submitting your data to a domain-specific repository and archiving your data analysis GitHub repositories in Zenodo. By applying the lessons from this workshop, you’ll significantly enhance the FAIRness of your data and improve its organization for future use. These benefits extend beyond yourself to your teammates, group leader, and the wider scientific community.\nWe hope that you found this workshop useful. If you would like to leave us some comments or suggestions, feel free to contact us.",
+    "text": "Wrap up\nIn this concluding lesson, we’ve covered the process of submitting your data to a domain-specific repository and archiving your data analysis GitHub repositories in Zenodo. By applying the lessons from this workshop, you’ll significantly enhance the FAIRness of your data and improve its organization for future use. These benefits extend beyond yourself to your teammates, group leader, and the wider scientific community.\nWe hope that you found this workshop useful. If you would like to leave us some comments or suggestions, feel free to contact us.\n\n\n\n\nClick to enlarge",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1386,7 +1386,7 @@
     "href": "develop/03_DOD.html#resources-and-databases-folder",
     "title": "3. Data organization and storage",
     "section": "3. Resources and databases folder",
-    "text": "3. Resources and databases folder\nHealth databases are utilized for storing, organizing, and providing access to diverse health-related data, including genomic data, clinical records, imaging data, and more. These resources are regularly updated and released under different versions from various sources. To ensure data reproducibility, it’s crucial to manage and specify the versions and sources of data within these databases.\n\n\n\n\n\n\nExample NGS: genomic resources\n\n\n\n\n\nFor example, preprocessing NGS data involves utilizing various genomic resources for tasks like aligning and annotating fastq files. Essential resources include reference genomes in FASTA format (e.g., human and mouse), indexed fasta files for alignment tools like STAR and Bowtie, and GTF or GFF files for quantifying reads into genomic regions. One of the latest human reference genome is GRCh38, however many past studies are based on GRCh37.\nHow can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as refgenie.\n\nRefgenie\nIt manages the storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome “assets”, like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. Check this tutorial to get started.\n\n\n\n\n\nManual Download\nBest practices for downloading data from the source while ensuring the preservation of information about the version and other metadata include:\n\nOrganizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices.\nDocumentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata.\nREADME.md: Record the version of the data in the README.md file.\nChecksums: Check for and use checksums provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below.\nVerify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption.\nAutomated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g. use bash script or pipeline).\n\n\n\n\n\n\n\nOptional: Exercise on CHECKSUMS\n\n\n\n\n\nWe recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets. In this example, we use data from the HLA FTP Directory.\n\nInstall md5sum (from coreutils package)\n\n#!/bin/bash\n# On Ubuntu/Debian\napt-get install coreutils\n# On macOS\nbrew install coreutils\n\nCreate a bash script to download the target files (named “dw_resources.sh” in the data structure).\n\n#!/bin/bash\n# Important: go through the README before downloading! Check if a checksums file is included. \n\n# 1. Create or change the directory to the resources dir. \n\n# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:\n1a3d12e4e6cc089388d88e3509e41cb3  hla_gen.fasta\n# Finally, save it: \nmd5file=\"md5checksum.txt\"\n\n# Define the URL of the files to download\nurl=\"ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_gen.fasta\"\n# \nfilename=$(basename \"$url\")\n\n# (Optional) Define a different filename to save the downloaded file (`wget -O $out_filename`)\n# out_filename = \"imgt_hla_gen.fasta\"\n\n# Download the file\nwget $url && \\\nmd5sum --status --check $md5file\n\nFolder structure\n\ngenomic_resources/\n├── specie1/\n│  └── version/\n│     ├── files.txt\n│     └── indexes/\n└── dw_resources.sh\n\nCreate a md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files.\n\nmd5sum &lt;data&gt;\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nDownload a file using md5sums. Choose a file from your favorite dataset or select one from the HLA database (for quick testing, consider using a text file such as Nomenclature_2009.txt).",
+    "text": "3. Resources and databases folder\nHealth databases are utilized for storing, organizing, and providing access to diverse health-related data, including genomic data, clinical records, imaging data, and more. These resources are regularly updated and released under different versions from various sources. To ensure data reproducibility, it’s crucial to manage and specify the versions and sources of data within these databases.\n\n\n\n\n\n\nExample NGS: genomic resources\n\n\n\n\n\nFor example, preprocessing NGS data involves utilizing various genomic resources for tasks like aligning and annotating fastq files. Essential resources include reference genomes in FASTA format (e.g., human and mouse), indexed fasta files for alignment tools like STAR and Bowtie, and GTF or GFF files for quantifying reads into genomic regions. One of the latest human reference genome is GRCh38, however many past studies are based on GRCh37.\nHow can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as refgenie.\n\nRefgenie\nIt manages the storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome “assets”, like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. Check this tutorial to get started.\n\n\n\n\n\nManual Download\nBest practices for downloading data from the source while ensuring the preservation of information about the version and other metadata include:\n\nOrganizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices.\nDocumentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata.\nREADME.md: Record the version of the data in the README.md file.\nChecksums: Check for and use checksums (MD5, SHA1, SHA256, …) provided by the database to verify the integrity of the downloaded data, ensuring that it hasn’t been corrupted during transfer. Do the exercise below to get more familiar with these files.\nVerify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption.\nAutomated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g. use bash script or pipeline).\n\n\n\n\n\n\n\nOptional: Exercise on CHECKSUMS\n\n\n\n\n\nWe recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets, as it is commonly used. In this example, we use data from the HLA FTP Directory.\n\nInstall md5sum (from coreutils package)\n\n#!/bin/bash\n# On Ubuntu/Debian\napt-get install coreutils\n# On macOS\nbrew install coreutils\n\nCreate a bash script to download the target files (named “dw_resources.sh” in the data structure).\n\n#!/bin/bash\n# Important: go through the README before downloading! Check if a checksums file is included. \n\n# 1. Create or change the directory to the resources dir. \n\n# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:\n7348fbef5ab204f3aca67e91f6c59ed2  hla_prot.fasta\n# Finally, save it: \nmd5file=\"md5checksum.txt\"\n\n# Define the URL of the files to download\nurl=\"ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta\"\n\n# (Optional 1) Save the original file name: filename=$(basename \"$url\")\n# (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)\n# out_filename = \"imgt_hla_prot.fasta\"\n\n# Download the file\nwget $url && \\\nmd5sum --status --check $md5file\n\nWe recommend using the argument `--status` **only** when you incorporate this sanity check as part of your pipeline so that it only prints the errors (it doesn't print output when success).\n\nFolder structure\n\ngenomic_resources/\n├── specie1/\n│  └── version/\n│     ├── files.txt\n│     └── indexes/\n└── dw_resources.sh\n\nCreate a md5sum file and share it with collaborators before sharing the data. This allows others to check the integrity of the files.\n\nmd5sum &lt;data&gt;\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nDownload a file using md5sums. Choose a file from your favorite dataset or select one from the HLA database (for quick testing, consider using a text file such as Nomenclature_2009.txt).",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1494,7 +1494,7 @@
     "href": "develop/07_repos.html#data-repositories-and-archives",
     "title": "7. Storing and sharing biodata",
     "section": "Data Repositories and Archives",
-    "text": "Data Repositories and Archives\nSpecialized repositories and archives securely store, curate, and disseminate scientific data, ensuring long-term preservation, transparency, and citability of research findings through standardized formats and rigorous curation processes.\nImportance of archiving scientific data\n\nLong-term accessibility and preservation: Ensures data remains accessible for future researchers.\nEnhanced visibility and attribution: Unique identifiers like DOIs enable citation of datasets, enhancing visibility and proper attribution.\nImproved dataset discoverability and interpretability: Comprehensive metadata, including methodology and experimental details, facilitates understanding and usability by other researchers.\nPromotion of Transparency, Reproducibility, and Research Integrity: Mandatory data deposition fosters transparency and upholds research integrity.\nAmplification of Research Impact and Contribution: Archiving data elevates research quality and extends its impact within the scientific community.\nFulfilling Scholarly Obligations: Compliance with requirements set by scientific journals and funding agencies ensures adherence to scholarly standards.\n\nThere are two types of repositories:\n\nGeneral repositories: relevant to a wide range of disciplines (e.g. Zenodo).\nDomain-specific: repositories are customized for specific fields, providing specialized curation and context-specific features (e.g. ENA, GEO, Annotare, etc.).\n\n\n\n\n\n\n\nList of repositories for biological data\n\n\n\n\nEuropean Nucleotide Archive (ENA)\nNCBI Gene Expression Omnibus (GEO)\nSequence Read Archive (SRA)\nProtein Data Bank (PDB)\nProteomics Identifications Database (PRIDE)\nUniversal Protein Resource (UniProt) SPIN\nArrayExpress\nEBI Metagenomics (MGnify)\nPhysioNet\nFunctional Annotation of Animal Genomes (FAANG) Data Repository\n\nFor more data repositories, please refer to the links provided below to find the appropriate repository for your data:\n\nEMBL-EBI data resources here\nNHI data resources here\nELIXIR Core Data Resources here\n\nYour institution might as well have its repositories such as ERDA (Electronic research data archive at the University of Copenhagen).",
+    "text": "Data Repositories and Archives\nSpecialized repositories and archives securely store, curate, and disseminate scientific data, ensuring long-term preservation, transparency, and citability of research findings through standardized formats and rigorous curation processes.\nImportance of archiving scientific data\n\nLong-term accessibility and preservation: Ensures data remains accessible for future researchers.\nEnhanced visibility and attribution: Unique identifiers like DOIs enable citation of datasets, enhancing visibility and proper attribution.\nImproved dataset discoverability and interpretability: Comprehensive metadata, including methodology and experimental details, facilitates understanding and usability by other researchers.\nPromotion of Transparency, Reproducibility, and Research Integrity: Mandatory data deposition fosters transparency and upholds research integrity.\nAmplification of Research Impact and Contribution: Archiving data elevates research quality and extends its impact within the scientific community.\nFulfilling Scholarly Obligations: Compliance with requirements set by scientific journals and funding agencies ensures adherence to scholarly standards.\n\n\n\n\n\n\n\nre3data.org\n\n\n\nCheck the registry of research data repositories–re3data.org for a full overview. You can browse by subject if you are looking within a specific field.\n\n\nThere are two types of repositories:\n\nGeneral repositories: relevant to a wide range of disciplines (e.g. Zenodo).\nDomain-specific: repositories are customized for specific fields, providing specialized curation and context-specific features (e.g. ENA, GEO, Annotare, etc.)\n\n\n\n\n\n\n\nList of repositories for biological data\n\n\n\n\nEuropean Nucleotide Archive (ENA)\nNCBI Gene Expression Omnibus (GEO)\nSequence Read Archive (SRA)\nProtein Data Bank (PDB)\nProteomics Identifications Database (PRIDE)\nUniversal Protein Resource (UniProt) SPIN\nArrayExpress\nEBI Metagenomics (MGnify)\nPhysioNet\nFunctional Annotation of Animal Genomes (FAANG) Data Repository\n\nFor more data repositories, please refer to the links provided below to find the appropriate repository for your data:\n\nEMBL-EBI data resources here\nNHI data resources here\nELIXIR Core Data Resources here\n\nYour institution might as well have its repositories such as ERDA (Electronic research data archive at the University of Copenhagen).",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1506,7 +1506,7 @@
     "href": "develop/07_repos.html#domain-specific-repositories",
     "title": "7. Storing and sharing biodata",
     "section": "Domain-specific repositories",
-    "text": "Domain-specific repositories\nThis tailored approach ensures alignment with standards and maximizes the utility and impact of research findings. By catering to particular research areas, these repositories offer researchers a more focused audience, deeper domain expertise, and increased visibility within their specific research community.\nExplore some examples of NGS data repositories below:\n\n\n\n\n\n\nENA (European Nucleotide Archive)\n\n\n\n\n\n\n\nENA: hosted by the European Bioinformatics Institute (EBI), provides researchers with a platform to deposit and access nucleotide sequences along with associated metadata, ensuring data preservation and contextualization. ENA adheres to community standards and guidelines for data submission, including those established by the International Nucleotide Sequence Database Collaboration (INSDC).\n\n\n\n\n\n\n\n\n\n\n\nGEO (Gene Expression Omnibus)\n\n\n\n\n\n\n\nGEO (Gene Expression Omnibus): curated by the National Center for Biotechnology Information (NCBI), serves as a specialized repository for high-throughput functional genomic data sets, particularly gene expression data across diverse biological conditions and experimental designs. Researchers can easily deposit and access a variety of genomic data, fostering data transparency and reproducibility within the scientific community. GEO assigns unique accession numbers to each dataset, ensuring traceability and proper citation in research publications.\n\n\n\n\n\n\n\n\n\n\n\nAnnotare\n\n\n\n\n\n\n\nArrayExpress/Annotare: hosted by the European Bioinformatics Institute (EBI), is a specialized repository tailored for storing and submitting functional genomics experiments for high-throughput sequencing data. It offers researchers a platform to upload experimental data along with comprehensive metadata ensuring preservation and contextualization. Annotare provides a curated environment aligned with the standards and practices of the field. This specialization enhances data discoverability, promotes collaboration, and facilitates deeper insights into the functional aspects of the genome.\n\n\n\n\n\nThe repositories mentioned earlier adhere to established community standards for data submission and sharing in genomics research such as:\n\nMIAME (Minimum Information About a Microarray Experiment): These guidelines ensure comprehensive and standardized reporting of microarray experiments.\nMIxS (Minimum Information about a high-throughput SeQuencing Experiment): MIxS standards, developed by the Genomic Standards Consortium, ensure consistent reporting of metadata for high-throughput sequencing experiments.\nSequence Read Archive (SRA) Submission Guidelines: They include requirements for data formatting, metadata inclusion, and quality control.\nCommunity-Specific Standards designed to ensure that submitted data meets the specific requirements and expectations of the field.\n\nBy adhering to standards, repositories ensure that submitted data is high quality, well-documented, and compliant with community best practices, promoting data discovery, reproducibility, and interoperability within the scientific community.\nFollowing all the recommendations in this course makes it straightforward to provide the necessary documentation and information for these repositories. For instance, repositories specific to NGS data will require the raw FASTQ files, sample metadata, and protocols as well as final pre-processing results (for instance, read count matrices in BED files).\n\n\n\n\n\n\nWarning\n\n\n\nKeep in mind that these repositories are not intended for downstream analysis data and associated code. However, you should already have those versions controlled by GitHub, which eliminates any concerns. You can then archive such repositories in a general repository like Zenodo.",
+    "text": "Domain-specific repositories\nThis tailored approach ensures alignment with standards and maximizes the utility and impact of research findings. By catering to particular research areas, these repositories offer researchers a more focused audience, deeper domain expertise, and increased visibility within their specific research community.\nExplore some examples of NGS data repositories below:\n\n\n\n\n\n\nENA (European Nucleotide Archive)\n\n\n\n\n\n\n\nENA: hosted by the European Bioinformatics Institute (EBI), provides researchers with a platform to deposit and access nucleotide sequences along with associated metadata, ensuring data preservation and contextualization. ENA adheres to community standards and guidelines for data submission, including those established by the International Nucleotide Sequence Database Collaboration (INSDC).\n\n\n\n\n\n\n\n\n\n\n\nGEO (Gene Expression Omnibus)\n\n\n\n\n\n\n\nGEO (Gene Expression Omnibus): curated by the National Center for Biotechnology Information (NCBI), serves as a specialized repository for high-throughput functional genomic data sets, particularly gene expression data across diverse biological conditions and experimental designs. Researchers can easily deposit and access a variety of genomic data, fostering data transparency and reproducibility within the scientific community. GEO assigns unique accession numbers to each dataset, ensuring traceability and proper citation in research publications.\n\n\n\n\n\n\n\n\n\n\n\nAnnotare\n\n\n\n\n\n\n\nArrayExpress/Annotare: hosted by the European Bioinformatics Institute (EBI), is a specialized repository tailored for storing and submitting functional genomics experiments for high-throughput sequencing data. It offers researchers a platform to upload experimental data along with comprehensive metadata ensuring preservation and contextualization. Annotare provides a curated environment aligned with the standards and practices of the field. This specialization enhances data discoverability, promotes collaboration, and facilitates deeper insights into the functional aspects of the genome.\n\n\n\n\n\nThe repositories mentioned earlier adhere to established community standards for data submission and sharing in genomics research such as:\n\nMIAME (Minimum Information About a Microarray Experiment): These guidelines ensure comprehensive and standardized reporting of microarray experiments.\nMIxS (Minimum Information about a high-throughput SeQuencing Experiment): MIxS standards, developed by the Genomic Standards Consortium, ensure consistent reporting of metadata for high-throughput sequencing experiments.\nSequence Read Archive (SRA) Submission Guidelines: They include requirements for data formatting, metadata inclusion, and quality control.\nCommunity-Specific Standards designed to ensure that submitted data meets the specific requirements and expectations of the field.\n\nBy adhering to standards, repositories ensure that submitted data is high quality, well-documented, and compliant with community best practices, promoting data discovery, reproducibility, and interoperability within the scientific community.\nFollowing all the recommendations in this course makes it straightforward to provide the necessary documentation and information for these repositories. For instance, repositories specific to NGS data will require the raw FASTQ files, sample metadata, and protocols as well as final pre-processing results (for instance, read count matrices in BED files).\n\n\n\n\n\n\nWarning\n\n\n\nKeep in mind that these repositories are not intended for downstream analysis data and associated code. However, you should already have those versions controlled by GitHub, which eliminates any concerns. You can then archive such repositories in a general repository like Zenodo.\nArchives for software source code are essential for long-term accessibility and reproducibility and are becoming very popular. Check Software Heritage if you are developing software.",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1518,7 +1518,7 @@
     "href": "develop/07_repos.html#general-repositories",
     "title": "7. Storing and sharing biodata",
     "section": "General repositories",
-    "text": "General repositories\nZenodo is one of the widely used repositories for a variety of research outputs. It is an open-access digital platform supported by the European Organization for Nuclear Research (CERN) and the European Commission. It caters to various research outputs, including datasets, papers, software, and multimedia files, making it a valuable resource for researchers worldwide. With its user-friendly platform, researchers can easily upload, share, and preserve their research data. Each deposited item receives a unique Digital Object Identifier (DOI), ensuring citability and long-term accessibility. Additionally, Zenodo offers robust metadata capabilities for enriching submissions with contextual information. Moreover, researchers can link their GitHub accounts to Zenodo, simplifying the process of archiving the GitHub repository releases for long-term accessibility and citation.\nOnce your accounts are linked, creating a Zenodo archive becomes as straightforward as tagging a release in your GitHub repository. Zenodo automatically detects the release and generates a corresponding archive, complete with a unique Digital Object Identifier (DOI) for citable reference. Therefore, before submitting your work to a journal, link your data analysis repository to Zenodo, obtain a DOI, and cite it in your manuscript which enhances reproducibility in research.\n\nStep-by-Step Setup Guide\nCheck the practical material where we demonstrate how to link Zenodo and Github (see Exercise 6 in the practical material).",
+    "text": "General repositories\nThere are plenty of data archiving repositories. We recommend to check the Longwood Research Data management website at Harvard for a quick overview. Some of the most well-known are:\n\nDataverse\nDryad\nfigshare\nOpen Science Framework (OSF)\nZenodo\n\nWe will be using Zenodo for our practical workshop. However, please review the table provided by the Longwood, Harvard Biomedical Data Management team, which outlines the differences between various repositories.\n\n\n\nClick to enlarge\n\n\n\nZenodo\nZenodo is one of the widely used repositories for a variety of research outputs. It is an open-access digital platform supported by the European Organization for Nuclear Research (CERN) and the European Commission. It caters to various research outputs, including datasets, papers, software, and multimedia files, making it a valuable resource for researchers worldwide. With its user-friendly platform, researchers can easily upload, share, and preserve their research data. Each deposited item receives a unique Digital Object Identifier (DOI), ensuring citability and long-term accessibility. Additionally, Zenodo offers robust metadata capabilities for enriching submissions with contextual information. Moreover, researchers can link their GitHub accounts to Zenodo, simplifying the process of archiving the GitHub repository releases for long-term accessibility and citation.\nOnce your accounts are linked, creating a Zenodo archive becomes as straightforward as tagging a release in your GitHub repository. Zenodo automatically detects the release and generates a corresponding archive, complete with a unique Digital Object Identifier (DOI) for citable reference. Therefore, before submitting your work to a journal, link your data analysis repository to Zenodo, obtain a DOI, and cite it in your manuscript which enhances reproducibility in research.\n\nStep-by-Step Setup Guide\nCheck the practical material where we demonstrate how to link Zenodo and Github (see Exercise 6 in the practical material).",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1570,7 +1570,7 @@
     "href": "develop/06_pipelines.html#code-and-pipelines-for-data-analysis",
     "title": "6. Processing and analyzing biodata",
     "section": "",
-    "text": "In this section, we explore essential elements of reproducibility and efficiency in computational research, highlighting techniques and tools for creating robust and transparent coding and workflows. By prioritizing reproducibility and replicability, researchers can enhance the credibility and impact of their findings while fostering collaboration and knowledge dissemination within the scientific community.\n\n\n\n\n\n\nBefore you start…\n\n\n\n\nChoose a folder structure (e.g., using cookiecutter)\nChoose a file naming system\nAdd a README describing the project (and the naming conventions)\nInstall and set up version control (e.g., Git and Github)\nChoose a coding style!\n\n\nPython: Python’s PEP or Google’s style guide\nR: Google’s style guide or Tidyverse’s style guide\n\n\n\n\n\nThrough techniques such as scripting, containerization (e.g., Docker), and virtual environments, researchers can create reproducible analyses that enable others to validate and build upon their work. Emphasizing the documentation of data processing steps, parameters, and results ensures transparency and accountability in research outputs. To write clear and reproducible code, take the following approach: write functions, code defensively (such as input validation, error handling, etc.), add comments, conduct testing, and maintain proper documentation.\nTools for reproducibility:\n\nCode notebooks: Utilize tools like Jupyter Notebook and R Markdown to combine code with descriptive text and visualizations, enhancing data documentation.\n\nIntegrated development environments: Consider using platforms such as (knitr or MLflow) to streamline code development and documentation processes.\nPipeline frameworks or workflow management systems: Implement systems like Nextflow and Snakemake to automate data analysis steps (including data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages.\n\n\n\nComputational notebooks (e.g., Jupyter, R Markdown) provide researchers with a versatile platform for exploratory and interactive data analysis. These notebooks facilitate sharing insights with collaborators and documentation of analysis procedures.\n\n\n\nTools such as Nextflow and Snakemake streamline and automate various data analysis steps, enabling parallel processing and seamless integration with existing tools. Remember to create portable code and use relative paths to ensure transferability between users.\n\nNextflow: offers scalable and portable NGS data analysis pipelines, facilitating data processing across diverse computing environments.\nSnakemake: Utilizing Python-based scripting, Snakemake allows for flexible and automated NGS data analysis pipelines, supporting parallel processing and integration with other tools.\n\nOnce your scientific computational workflow is ready to be shared, publish your scientific computational workflow on WorkflowHub.\n\n\n\nEach computer or HPC (High-Performance Computing) platform has a unique computational environment that includes its operating system, installed software, versions of software packages, and other features. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results if it depends on any of these factors.\nFor research to be reproducible, the original computational environment must be recorded so others can replicate it. There are several methods to achieve this:\n\nContainerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.\nVirtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies)\nEnvironment managers: provide an isolated environment with specific packages and dependencies that can be installed without affecting the system-wide configuration. These environments are particularly useful for managing conflicting dependencies and ensuring reproducibility. Configuration files can automate the setup of the computational environment:\n\nconda: allows users to export environment specifications (software and dependencies) to YAML files enabling easy recreation of the environment on another system\nPython virtualenv: is a tool for creating isolated environments to manage dependencies specific to a project\nrequirements.txt: may contain commands for installing packages (such as pip for Python packages or apt-get for system-level dependencies), configuring system settings, and setting environment variables. Package managers can be used to install, upgrade and manage packages.\nR’s renv: The ‘renv’ package creates isolated environments in R.\n\nEnvironment descriptors\n\nsessionInfo() or devtools::session_info(): In R, these functions provide detailed information about the current session\nsessionInfo(), similarly, in Python. Libraries like NumPy and Pandas have show_versions() methods to display package versions.\n\n\nWhile environment managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, Docker containers provide a full env isolation (including the operating system) which ensures consistent behavior across different systems.\n\n\n\n\nTo maintain clarity and organization in the data analysis process, adopt best practices such as:\n\nData documentation: create a README.md file to provide an overview of the project and its structure, and metadata for understanding the context of your analysis.\nAnnotate your pipelines and comment your code (look for tutorials and templates such as this one from freeCodeCamp).\nUse coding style guides (code lay-out, whitespace in expressions, comments, naming conventions, annotations…) to maintain consistency.\nLabel files numerically to organize the entire data analysis process (scripts, notebooks, pipelines, etc.).\n\n00.preprocessing., 01.data_analysis_step1., etc.\n\nProvide environment files for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.\nData versioning: use version control systems (e.g., Git) and upload your code to a code repository Lesson 5.\nIntegrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code\nLeverage curated pipelines such as the ones developed by the nf-core community, further ensuring adherence to community standards and guidelines.\nAdd a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.\n\n\n\n\n\n\n\nPractical HPC pipes\n\n\n\nWe provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the Sandbox website. If you’re interested in delving deeper, check out the HPC best practices module we’ve developed here.",
+    "text": "In this section, we explore essential elements of reproducibility and efficiency in computational research, highlighting techniques and tools for creating robust and transparent coding and workflows. By prioritizing reproducibility and replicability, researchers can enhance the credibility and impact of their findings while fostering collaboration and knowledge dissemination within the scientific community.\n\n\n\n\n\n\nBefore you start…\n\n\n\n\nChoose a folder structure (e.g., using cookiecutter)\nChoose a file naming system\nAdd a README describing the project (and the naming conventions)\nInstall and set up version control (e.g., Git and Github)\nChoose a coding style!\n\n\nPython: Python’s PEP or Google’s style guide\nR: Google’s style guide or Tidyverse’s style guide\n\n\n\n\n\nThrough techniques such as scripting, containerization (e.g., Docker), and virtual environments, researchers can create reproducible analyses that enable others to validate and build upon their work. Emphasizing the documentation of data processing steps, parameters, and results ensures transparency and accountability in research outputs. To write clear and reproducible code, take the following approach: write functions, code defensively (such as input validation, error handling, etc.), add comments, conduct testing, and maintain proper documentation.\nTools for reproducibility:\n\nCode notebooks: Utilize tools like Jupyter Notebook and R Markdown to combine code with descriptive text and visualizations, enhancing data documentation.\n\nIntegrated development environments: Consider using platforms such as (knitr or MLflow) to streamline code development and documentation processes.\nPipeline frameworks or workflow management systems: Implement systems like Nextflow and Snakemake to automate data analysis steps (including data extraction, transformation, validation, visualization, and more). Additionally, they contribute to ensuring interoperability by facilitating seamless integration and interaction between different components or stages.\n\n\n\nComputational notebooks (e.g., Jupyter, R Markdown) provide researchers with a versatile platform for exploratory and interactive data analysis. These notebooks facilitate sharing insights with collaborators and documentation of analysis procedures.\n\n\n\nTools such as Nextflow and Snakemake streamline and automate various data analysis steps, enabling parallel processing and seamless integration with existing tools. Remember to create portable code and use relative paths to ensure transferability between users.\n\nNextflow: offers scalable and portable NGS data analysis pipelines, facilitating data processing across diverse computing environments.\nSnakemake: Utilizing Python-based scripting, Snakemake allows for flexible and automated NGS data analysis pipelines, supporting parallel processing and integration with other tools.\n\nOnce your scientific computational workflow is ready to be shared, publish your scientific computational workflow on WorkflowHub.\n\n\n\nEach computer or HPC (High-Performance Computing) platform has a unique computational environment that includes its operating system, installed software, versions of software packages, and other features. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results if it depends on any of these factors.\nFor research to be reproducible, the original computational environment must be recorded so others can replicate it. There are several methods to achieve this:\n\nContainerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.\nVirtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies)\nEnvironment managers: provide an isolated environment with specific packages and dependencies that can be installed without affecting the system-wide configuration. These environments are particularly useful for managing conflicting dependencies and ensuring reproducibility. Configuration files can automate the setup of the computational environment:\n\nconda: allows users to export environment specifications (software and dependencies) to YAML files enabling easy recreation of the environment on another system\nPython virtualenv: is a tool for creating isolated environments to manage dependencies specific to a project\nrequirements.txt: may contain commands for installing packages (such as pip for Python packages or apt-get for system-level dependencies), configuring system settings, and setting environment variables. Package managers can be used to install, upgrade and manage packages.\nR’s renv: The ‘renv’ package creates isolated environments in R.\n\nEnvironment descriptors\n\nsessionInfo() or devtools::session_info(): In R, these functions provide detailed information about the current session\nsessionInfo(), similarly, in Python. Libraries like NumPy and Pandas have show_versions() methods to display package versions.\n\n\nWhile environment managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, Docker containers provide a full env isolation (including the operating system) which ensures consistent behavior across different systems.\n\n\n\n\nTo maintain clarity and organization in the data analysis process, adopt best practices such as:\n\nData documentation: create a README.md file to provide an overview of the project and its structure, and metadata for understanding the context of your analysis.\nAnnotate your pipelines and comment your code (look for tutorials and templates such as this one from freeCodeCamp).\nUse coding style guides (code lay-out, whitespace in expressions, comments, naming conventions, annotations…) to maintain consistency.\nLabel files numerically to organize the entire data analysis process (scripts, notebooks, pipelines, etc.).\n\n00.preprocessing., 01.data_analysis_step1., etc.\n\nProvide environment files for reproducing the computational environment (such as ‘requirements.txt’ for Python or ‘environment.yml’ for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis.\nData versioning: use version control systems (e.g., Git) and upload your code to a code repository Lesson 5.\nIntegrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code\nUse git submodule for code and software that is reused in several projects\nLeverage curated pipelines such as the ones developed by the nf-core community, further ensuring adherence to community standards and guidelines.\nUse Software Heritage an archive for software source code are essential for long-term accessibility and reproducibility\nAdd a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.\n\n\n\n\n\n\n\nPractical HPC pipes\n\n\n\nWe provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the Sandbox website. If you’re interested in delving deeper, check out the HPC best practices module we’ve developed here.",
     "crumbs": [
       "Course material",
       "Key practices",
@@ -1601,7 +1601,7 @@
     "href": "develop/practical_workshop.html#organize-and-structure-your-datasets-and-data-analysis",
     "title": "Practical material",
     "section": "1. Organize and structure your datasets and data analysis",
-    "text": "1. Organize and structure your datasets and data analysis\nEstablishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:\n\nData folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.\nProject folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.\n\nData and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nWhen organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\n\n\n\n\n\nData folders\nWhether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n&lt;Assay-ID&gt;_&lt;keyword&gt;_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.\n\n\n\n\n\nLet’s explore a potential folder structure and the types of files you might encounter within it.\n&lt;data_type&gt;_&lt;keyword&gt;_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline\n    ├── pipeline.md\n    ├── scripts/\n├── processed\n    ├── fastqc/\n    ├── multiqc/\n    ├── final_fastq/\n└── raw\n    ├── .fastq.gz \n    └── samplesheet.csv\n\nREADME.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).\nmetadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.\npipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.\nprocessed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).\nraw: This folder holds the raw data.\n\n.fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.\nsamplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.\n\n\n\n\nProject folders\nOn the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.\nThe Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:\n&lt;project&gt;_&lt;keyword&gt;_YYYYMMDD\n\n\n\n\n\n\nNaming examples\n\n\n\n\n\n\n\n\nRNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023\nEHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.\n\n\n\n\n\n\nNow, let’s explore an example of a folder structure and the types of files you might encounter within it.\n&lt;project&gt;_&lt;keyword&gt;_YYYYMMDD\n├── data\n│  └── &lt;ID&gt;_&lt;keyword&gt;_YYYYMMDD &lt;- symbolic link\n├── documents\n│  └── research_project_template.docx\n├── metadata.yml\n├── notebooks\n│  └── 01_data_processing.rmd\n│  └── 02_data_analysis.rmd\n│  └── 03_data_visualization.rmd\n├── README.md\n├── reports\n│  └── 01_data_processing.html\n│  └── 02_data_analysis.html\n│  ├── 03_data_visualization.html\n│  │  └── figures\n│  │  └── tables\n├── requirements.txt // env.yaml\n├── results\n│  ├── figures\n│  │  └── 02_data_analysis/\n│  │    └── heatmap_sampleCor_20230102.png\n│  ├── tables\n│  │  └── 02_data_analysis/\n│  │    └── DEA_treat-control_LFC1_p01.tsv\n│  │    └── SumStats_sampleCor_20230102.tsv\n├── pipeline\n│  ├── rules // processes \n│  │  └── step1_data_processing.smk\n│  └── pipeline.md\n├── scratch\n└── scripts\n\ndata: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.\ndocuments: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.\n\nresearch_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.\n\nmetadata.yml: metadata file describing various keys of the project or experiment (see this lesson).\nnotebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.\nREADME.md: A detailed project description in markdown or plain-text format.\nreports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.\nresults: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.\npipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.\nscratch: A folder designated for temporary files or workspace for experiments and development.\nscripts: Folder for helper scripts needed to run data analysis or reproduce the work.\n\n\n\nTemplate engine\nCreating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.\n\n\n\n\n\n\nCookiecutter templates\n\n\n\nHere are some template that you can use to get started, adapt and modify them to your own needs:\n\nPython package project\nSandbox test\nData science\nNGS data\n\nCreate your own template from scratch.\n\n\n\nQuick tutorial on cookiecutter\nBuilding a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:\n\nStep 1: Create a Folder Template\nFirst, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n|   |-- main.py\n|-- tests\n|   |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n  \"project_name\": \"MyProject\",\n  \"author_name\": \"Your Name\",\n  \"description\": \"A short description of your project\"\n}\nWhen users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.\nBeyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n\n\nmain.py\n\n# main.py\ndef hello():\n    print(\"Hello, {{cookiecutter.project_name}}!\")\n\nThe ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.\nAfter running Cookiecutter, your generated ‘main.py’ file could appear as follows:\n# main.py, assuming \"MyProject\" was entered as the project_name\ndef hello():\n    print(\"Hello, MyProject!\") \n\n\nStep 3: Use Cookiecutter\nOnce your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Review the Generated Project\nAfter the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template\n\n\n\n\n\n\n\nUse Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.\nRequirements\nWe assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.\nProject\n\nGo to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. \nOpen a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):\ngit clone &lt;your URL to the template&gt;\nIf you have a GitHub Desktop, click Add and select “Clone repository” from the options\nOpen the repository and navigate through the different directories\nModify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.\n├── results/\n│   ├── figures/\n├── requirements.txt\nHere’s an example of how to do it:\n# Open your terminal and navigate to your template directory. Then: \ncd \\{\\{\\ cookiecutter.project_name\\ \\}\\}/  \nmkdir reports \ntouch requirements.txt\nCommit and push changes when you are done with your modifications\n\n\nStage the changes with git add\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\"\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name)\n\n\nTest your template by using cookiecutter &lt;URL to your GitHub repository \"cookicutter-template\"&gt;\nFill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?\n\n\n\n\n\n\n\n\n\n\n\n\nOptional Exercise 1, part B\n\n\n\n\n\n\n\nCreate a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:\nmy_template/\n|-- {{cookiecutter.project_name}}\n|   |-- main.py\n|-- tests\n|   |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\n\nStep 1: Create a directory for the template.\nStep 2: Write a cookiecutter.json file with variables such as project_name and author.\nStep 3: Set up the folder structure by creating subdirectories and files as needed.\nStep 4: Incorporate cookiecutter variables in the names of files.\nStep 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name."
+    "text": "1. Organize and structure your datasets and data analysis\nEstablishing a consistent file structure and naming conventions will help you efficiently manage your data. We will classify your data and data analyses into two distinct types of folders to ensure the data can be used and shared by many lab members while preventing modifications by any individual:\n\nData folders (assay or external databases and resources): They house the raw and processed datasets, alongside the pipeline/workflow used to generate the processed data, the provenance of the raw data, and quality control reports of the data. The data should be locked and set to read-only to prevent unintended modifications. This applies to experimental data generated in your lab as well as external resources. Provide an MD5 checksum file when you download them yourself to verify their integrity.\nProject folders: They contain all the essential files for a specific research project. Projects may use data from various resources or experiments, or build upon previous results from other projects. The data should not be copied or duplicated, instead, it should be linked directly from the source.\n\nData and data analysis are kept separate because a project may utilize one or more datasets to address a scientific question. Data can be reused in multiple projects over time, combined with other datasets for comparison, or used to build larger datasets. Additionally, data may be utilized by different researchers to answer various research questions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nWhen organizing your data folders, separate assays from external resources and maintain a consistent structure. For example, organize genome references by species and further categorize them by versions. Make sure to include all relevant information, and refer to this lesson for additional tips on data organization.\nThis will help you to keep your data tidied up, especially if you are working in a big lab where assays may be used for different purposes and by different people!\n\n\n\n\n\n\nData folders\nWhether your lab generates its own experimental data, receives it from collaborators, or works with previously published datasets, the data folder should follow a similar structure to the one presented here. Create a separate folder for each dataset, including raw files and processed files alongside the corresponding documentation and pipeline that generated the processed data. Raw files should remain untouched, and you should consider locking modifications to the final results once data preprocessing is complete. This precaution helps prevent unwanted changes to the data. Each subfolder should be named in a way that is distinct, easily readable and clear at a glance. Check this lesson for tips on naming conventions.\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nUse an acronym (1) that describes the type of NGS assay (RNAseq, ChIPseq, ATACseq) a keyword (2) that represents a unique element to that assay, and the date (3).\n&lt;Assay-ID&gt;_&lt;keyword&gt;_YYYYMMDD\nFor example CHIP_Oct4_20230101 is a ChIPseq assay made on 1st January 2023 with the keyword Oct4, so it is easily identifiable by the eye.\n\n\n\n\n\nLet’s explore a potential folder structure and the types of files you might encounter within it.\n&lt;data_type&gt;_&lt;keyword&gt;_YYYYMMDD/\n├── README.md \n├── CHECKSUMS\n├── pipeline\n    ├── pipeline.md\n    ├── scripts/\n├── processed\n    ├── fastqc/\n    ├── multiqc/\n    ├── final_fastq/\n└── raw\n    ├── .fastq.gz \n    └── samplesheet.csv\n\nREADME.md: This file contains a detailed description of the dataset commonly in markdown format. It should include the provenance of the raw data (such as samples, laboratory protocols used, the aim of the project, folder structure, naming conventions, etc.).\nmetadata.yml: This metadata file outlines different keys and essential information, usually presented in YAML format. For more details, refer to this lesson.\npipeline.md: This file provides an overview of the pipeline used to process raw data, as well as the commands to run the pipeline. The pipeline itself and all the required scripts should be collected in the same directory.\nprocessed: This folder contains the results from the preprocessing pipeline. The content vary depending on the specific pipeline used (create additional subdirectories as needed).\nraw: This folder holds the raw data.\n\n.fastq.gz: For example, in NGS assays, there should be ‘fastq’ files.\nsamplesheet.csv: This file holds essential metadata for the samples, including sample identification, experimental variables, batch information, and other metrics crucial for downstream analysis. It is important that this file is complete and current, as it is key to interpreting results. If you are considering running nf-core pipelines, this file will be required.\n\n\n\n\nProject folders\nOn the other hand, we have another type of folder called Projects which refers to data analyses that are specific to particular tasks, such as those involved in preparing a potential article. In this folder, you will create a subfolder for each project that you or your lab is working on. Each Project subfolder should include project-specific information, data analysis pipelines, notebooks, and scripts used for that particular project. Additionally, you should include an environment file with all the required software and dependencies needed for the project, including their versions. This helps ensure that the analyses can be easily replicated and shared with others.\nThe Project folder should be named in a way that is unique, easy to read, distinguishable, and clear at a glance. For example, you might name it based on the main author’s initials, the dataset being analyzed, the project name, a unique descriptive element related to the project, or the part of the project you are responsible for, along with the date:\n&lt;project&gt;_&lt;keyword&gt;_YYYYMMDD\n\n\n\n\n\n\nNaming examples\n\n\n\n\n\n\n\n\nRNASeq_Mouse_Brain_20230512: a project RNA sequencing data from a mouse brain experiment, created on May 12, 2023\nEHR_COVID19_Study_20230115: a project around electronic health records data for a COVID-19 study, created on January 15, 2023.\n\n\n\n\n\n\nNow, let’s explore an example of a folder structure and the types of files you might encounter within it.\n&lt;project&gt;_&lt;keyword&gt;_YYYYMMDD\n├── data\n│  └── &lt;ID&gt;_&lt;keyword&gt;_YYYYMMDD &lt;- symbolic link\n├── documents\n│  └── research_project_template.docx\n├── metadata.yml\n├── notebooks\n│  └── 01_data_processing.rmd\n│  └── 02_data_analysis.rmd\n│  └── 03_data_visualization.rmd\n├── README.md\n├── reports\n│  └── 01_data_processing.html\n│  └── 02_data_analysis.html\n│  ├── 03_data_visualization.html\n│  │  └── figures\n│  │  └── tables\n├── requirements.txt // env.yaml\n├── results\n│  ├── figures\n│  │  └── 02_data_analysis/\n│  │    └── heatmap_sampleCor_20230102.png\n│  ├── tables\n│  │  └── 02_data_analysis/\n│  │    └── DEA_treat-control_LFC1_p01.tsv\n│  │    └── SumStats_sampleCor_20230102.tsv\n├── pipeline\n│  ├── rules // processes \n│  │  └── step1_data_processing.smk\n│  └── pipeline.md\n├── scratch\n└── scripts\n\ndata: This folder contains symlinks or shortcuts to the actual data files, ensuring that the original files remain unaltered.\ndocuments: This folder houses Word documents, slides, or PDFs associated with the project, including data and project explanations, research papers, and more. It also includes the Data Management Plan.\n\nresearch_project_template.docx. If you download our template you will find a is a pre-filled Data Management Plan based on the Horizon Europe guidelines named ‘Non-sensitive_NGS_research_project_template.docx’.\n\nmetadata.yml: metadata file describing various keys of the project or experiment (see this lesson).\nnotebooks: This folder stores Jupyter, R Markdown, or Quarto notebooks containing the data analysis. Figures and tables used for the reports are organized under subfolders named after the notebook that created them for provenance purposes.\nREADME.md: A detailed project description in markdown or plain-text format.\nreports: Notebooks rendered as HTML, docx, or PDF files for sharing with colleagues or as formal data analysis reports.\n\nfigures: figures produced upon rendering notebooks. The figures will be saved under a subfolder named after the notebook that created them. This is for provenance purposes so we know which notebook created which figures.\n\nrequirements.txt: This file lists the necessary software, libraries, and their versions required to reproduce the code. If you’re using conda environments, you will also find the env.yaml file here, which outlines the specific environment configuration.\nresults: This folder contains analysis results, such as figures and tables. Organizing results by the pipeline, script, or notebook that generated them will make it easier to locate and interpret the data.\npipeline: A folder containing pipeline scripts or workflows for processing and analyzing data.\nscratch: A folder designated for temporary files or workspace for experiments and development.\nscripts: Folder for helper scripts needed to run data analysis or reproduce the work.\n\n\n\nTemplate engine\nCreating a folder template is straightforward with cookiecutter a command-line tool that generates projects from templates (called cookiecutters). For example, it can help you set up a Python package project based on a Python package project template.\n\n\n\n\n\n\nCookiecutter templates\n\n\n\nHere are some template that you can use to get started, adapt and modify them to your own needs:\n\nPython package project\nSandbox bioinformatics project\nSandbox data project\nData science\nNGS data\n\nCreate your own template from scratch.\n\n\n\nQuick tutorial on cookiecutter\nBuilding a Cookiecutter template from scratch requires defining a folder structure, crafting a cookiecutter.json file, and outlining placeholders (keywords) that will be substituted when generating a new project. Here’s a step-by-step guide on how to proceed:\n\nStep 1: Create a Folder Template\nFirst, begin by creating a folder structure that aligns with your desired template design. For instance, let’s set up a simple Python project template:\nmy_template/\n|-- {{cookiecutter.project_name}}\n|   |-- main.py\n|-- tests\n|   |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\nIn this example, {cookiecutter.project_name} is a placeholder that will be replaced with the actual project name when the template is used. This directory contains a python script (‘main.py’), a subdirectory (‘tests’) with a second python script named after the project (‘test_{{cookiecutter.project_name}}.py’) and a ‘README.md’ file.\n\n\nStep 2: Create cookiecutter.json\nIn the root of your template folder, create a file named cookiecutter.json. This file will define the variables (keywords) that users will be prompted to fill in. For our Python project template, it might look like this:\n{\n  \"project_name\": \"MyProject\",\n  \"author_name\": \"Your Name\",\n  \"description\": \"A short description of your project\"\n}\nWhen users generate a project based on your template, they will be prompted with these questions. The provided values (“responses”) will be used to substitute the placeholders in your template files.\nBeyond substituting placeholders in file and directory names, Cookiecutter can automatically populate text file contents with information. This feature is useful for offering default configurations or code file templates. Let’s enhance our earlier example by incorporating a placeholder within a text file:\nFirst, modify the my_template/main.py file to include a placeholder inside its contents:\n\n\nmain.py\n\n# main.py\ndef hello():\n    print(\"Hello, {{cookiecutter.project_name}}!\")\n\nThe ‘{{cookiecutter.project_name}}’ placeholder is now included within the main.py file. When you execute Cookiecutter, it will automatically replace the placeholders in both file and directory names and within text file contents.\nAfter running Cookiecutter, your generated ‘main.py’ file could appear as follows:\n# main.py, assuming \"MyProject\" was entered as the project_name\ndef hello():\n    print(\"Hello, MyProject!\") \n\n\nStep 3: Use Cookiecutter\nOnce your template is prepared, you can utilize Cookiecutter to create a project from it. Open a terminal and execute:\ncookiecutter path/to/your/template\nCookiecutter will prompt you to provide values for project_name, author_name, and description. Once you input these values, Cookiecutter will replace the placeholders in your template files with the entered values.\n\n\nStep 4: Review the Generated Project\nAfter the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.\n\n\n\n\n\n\nExercise 1: Create your own template.\n\n\n\n\n\n\n\nUse Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don’t have to follow our examples exactly.\nRequirements\nWe assume you have already gone through the requirements at the beginning of the practical lesson. This includes installing the necessary tools and setting up accounts as needed.\nProject\n\nGo to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization. \nOpen a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):\ngit clone &lt;your URL to the template&gt;\nIf you have a GitHub Desktop, click Add and select “Clone repository” from the options.\nOpen the repository and navigate through the different directories.\nModify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones, remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. Our Cookiecutter template is missing the ‘reports’ directory or the ‘requirements.txt’ file. Consider creating them, along with a subdirectory named ‘reports/figures’.\n├── results/\n│   ├── figures/\n├── requirements.txt\n\n\n\n\n\n\n\nHint\n\n\n\n\n\n\n\nHere’s an example of how to do it:\n# Open your terminal and navigate to your template directory. Then: \ncd \\{\\{\\ cookiecutter.project_name\\ \\}\\}/  \nmkdir reports \ntouch requirements.txt\n\n\n\n\n\n\nCommit and push changes when you are done with your modifications.\n\n\nStage the changes with git add.\nCommit the changes with a meaningful commit message git commit -m \"update cookicutter template\".\nPush the changes to your forked repository on Github git push origin main (or the appropriate branch name).\n\n\nTest your template by using cookiecutter &lt;URL to your GitHub repository \"cookicutter-template\"&gt;.\nFill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?\n\n\n\n\n\n\n\n\n\n\n\n\nOptional Exercise 1, part B\n\n\n\n\n\n\n\nCreate a template from scratch using this tutorial scratch, it can be as basic as this one below or ‘Data folder’:\nmy_template/\n|-- {{cookiecutter.project_name}}\n|   |-- main.py\n|-- tests\n|   |-- test_{{cookiecutter.project_name}}.py\n|-- README.md\n\nStep 1: Create a directory for the template.\nStep 2: Write a cookiecutter.json file with variables such as project_name and author.\nStep 3: Set up the folder structure by creating subdirectories and files as needed.\nStep 4: Incorporate cookiecutter variables in the names of files.\nStep 5: Use cookiecutter variables within scripts, such as printing a message that includes the project name."
   },
   {
     "objectID": "develop/practical_workshop.html#data-documentation",
@@ -1702,7 +1702,7 @@
     "href": "develop/05_VC.html#best-practices-in-data-analysis",
     "title": "5. Version Control with Git and GitHub",
     "section": "Best Practices in Data Analysis",
-    "text": "Best Practices in Data Analysis\nThis lesson introduces version control with Git and Github and its significance in research. You will gain the ability to create Git repositories, and skills to build GitHub pages for showcasing data analysis.\nVersion control systematically tracks project changes, documenting alterations for understanding project evolution. It holds significant importance in research data management, software development, and data analysis, offering numerous advantages.\n\n\n\n\n\n\nAdvantages of using version control\n\n\n\n\nDocument Progress: Detailed change history aids understanding of project development and modifications.\nEnsure Data Integrity: Prevents accidental data loss or corruption, with each change tracked for easy recovery.\nFacilitate Collaboration: Enables seamless collaboration among team members, allowing multiple individuals to work concurrently without conflicts.\nReproducibility: Preserves project state for accurate validation and analysis.\nBranching and Experimentation: Allows the creation of alternative project versions for experimentation, without altering the main branch.\nGlobal Accessibility: Platforms like GitHub provide visibility for sharing, feedback, and contribution to open science.\n\n\n\n\n\n\n\n\n\nTake our course on Git & Github\n\n\n\nif you’re interested in delving deeper, explore our course on Git and GitHub.\nAlternatively, here are some examples and online resources to expand your understanding:\n\nGit and GitHub online resources\nGitHub documentation\nGit documentation\n\n\n\n\nVersion control using Git\nGit is a widely adopted version control system that empowers developers and researchers to efficiently manage their project’s history, collaborate seamlessly, track changes, and ensure data integrity. Git operates on core principles and mechanisms:\n\nLocal Repository: Each user maintains a local repository on their computer, storing the complete project history for independent work.\nSnapshots, Not Files: Git captures snapshots of the entire project at different points instead of tracking individual file changes, ensuring data consistency.\nCommits: Users create ‘commits’ as snapshots of the project at specific moments, recording changes made to files along with explanatory commit messages.\nBranching: Git supports branching, enabling users to create separate lines of development for new features or bug fixes without affecting the main branch.\nMerging: Changes from one branch can be merged into another, facilitating the incorporation of new features or bug fixes back into the main project with a smooth merging process.\nDistributed Architecture: Git’s distributed nature means each user’s local repository is a complete copy of the project, enabling offline work and ensuring data redundancy.\nRemote Repositories: Users can connect and synchronize their local repositories with remote repositories hosted on platforms like GitHub, facilitating collaboration and project sharing.\nPush and Pull: Users ‘push’ their local changes to a remote repository to share with others and ‘pull’ changes made by others into their local repository to stay updated.\nConflict Resolution: Git provides tools to resolve conflicts manually in cases of conflicting changes, ensuring data integrity during collaboration.\nVersioning and Tagging: Git offers versioning and tagging capabilities, allowing users to mark specific points in history such as major releases or significant milestones.\n\n\n\nGitHub Hosting for Git\nIn addition to exploring Git, we will also explore GitHub, a collaborative platform for hosting Git repositories. GitHub enhances Git’s capabilities by offering features like issue tracking, security measures to protect repositories, and GitHub Pages for creating project websites. Additionally, GitHub provides the option to set repositories as private until you are ready to share your work publicly.\n\n\n\n\n\n\nAlternatives flows for collaborative projects\n\n\n\n\nGitLab\nBitBucket\n\nWe will focus on GitHub for the remainder of this lesson due to its widespread usage and compatibility.\n\n\n\n\n\n\n\n\nWarning\n\n\n\nWe will discuss repositories for archiving experimental or large datasets in lesson 7.\n\n\n\nFrom Project folders to Git repositories\nMoving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section.\n\n\n\n\n\n\nCreate a GitHub account\n\n\n\n\nIf you don’t have a GitHub account yet, click here\nInstall Git from Git webpage\n\n\n\nYou have two options when it comes to creating a repository for your project. First, you can start from scratch by creating a new repository and adding files to it as your project progresses. Alternatively, if you already have an existing folder structure for your project, you can initialize a repository directly from that folder. It is crucial to initiate version control in the early stages of a project to facilitate easy tracking of changes and effective management of the project’s version history from the beginning.\n\nConverting Folders to Git Repositories\nIf you completed all the exercises in lesson 3, you should have a project data structure prepared. Otherwise, consider using one of your existing projects or creating a small toy example for practice using cookiecutter (see practical_workshop).\n\n\n\n\n\n\nGithub documentation link\n\n\n\n\nAdding locally hosted code to Github\n\n\n\n\n\n\n\n\n\nExercise 1: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nInitialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.\nCreate a remote repository: Once the local repository is initialized, create am empty new repository on GitHub.\nConnect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin &lt;URL&gt;. This associates the remote repository with the name “origin.”\nCommit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m \"add local folder\".\nPush to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.\n\n\n\n\n\n\n\n\nSetting Up a Git Repository and copying an existing folder\nAlternatively to converting folders to repositories, you can create a new repository remotely, and then clone (git clone) it locally. Here, git init is not needed. You can move the files into the repository locally (git add, git commit, and git push). If you are creating a collaborative repository, you can now share it with your colleagues.\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nWrite useful and clear Git commits. Check out this post for tips.\n\n\n\n\n\n\nGithub pages\nAfter setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, R Markdown files, or HTML reports, you can showcase them on a GitHub Page.\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, R Markdown files, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you.\nFor simplicity, we recommend using Quarto or MkDocs. Visit their websites and follow the instructions to get started.\n\n\n\n\n\n\nTutorial links\n\n\n\n\nGet started in quarto: https://quarto.org/docs/get-started/. We recommend using the VS code tool, if you do, follow this tutorial.\nMkDocs materials to further customize MkDocs websites.\n\n\n\n\n\nStep-by-Step Setup Guide\nWe provide an example of setting up Git, Quarto, and a GitHub account, enabling you to replicate the process independently! (see Exercise 5 in the practical material)",
+    "text": "Best Practices in Data Analysis\nThis lesson introduces version control with Git and Github and its significance in research. You will gain the ability to create Git repositories, and skills to build GitHub pages for showcasing data analysis.\nVersion control systematically tracks project changes, documenting alterations for understanding project evolution. It holds significant importance in research data management, software development, and data analysis, offering numerous advantages.\n\n\n\n\n\n\nAdvantages of using version control\n\n\n\n\nDocument Progress: Detailed change history aids understanding of project development and modifications.\nEnsure Data Integrity: Prevents accidental data loss or corruption, with each change tracked for easy recovery.\nFacilitate Collaboration: Enables seamless collaboration among team members, allowing multiple individuals to work concurrently without conflicts.\nReproducibility: Preserves project state for accurate validation and analysis.\nBranching and Experimentation: Allows the creation of alternative project versions for experimentation, without altering the main branch.\nGlobal Accessibility: Platforms like GitHub provide visibility for sharing, feedback, and contribution to open science.\n\n\n\n\n\n\n\n\n\nTake our course on Git & Github\n\n\n\nif you’re interested in delving deeper, explore our course on Git and GitHub.\nAlternatively, here are some examples and online resources to expand your understanding:\n\nGit and GitHub online resources\nGitHub documentation\nGit documentation\n\n\n\n\nVersion control using Git\nGit is a widely adopted version control system that empowers developers and researchers to efficiently manage their project’s history, collaborate seamlessly, track changes, and ensure data integrity. Git operates on core principles and mechanisms:\n\nLocal Repository: Each user maintains a local repository on their computer, storing the complete project history for independent work.\nSnapshots, Not Files: Git captures snapshots of the entire project at different points instead of tracking individual file changes, ensuring data consistency.\nCommits: Users create ‘commits’ as snapshots of the project at specific moments, recording changes made to files along with explanatory commit messages.\nBranching: Git supports branching, enabling users to create separate lines of development for new features or bug fixes without affecting the main branch.\nMerging: Changes from one branch can be merged into another, facilitating the incorporation of new features or bug fixes back into the main project with a smooth merging process.\nDistributed Architecture: Git’s distributed nature means each user’s local repository is a complete copy of the project, enabling offline work and ensuring data redundancy.\nRemote Repositories: Users can connect and synchronize their local repositories with remote repositories hosted on platforms like GitHub, facilitating collaboration and project sharing.\nPush and Pull: Users ‘push’ their local changes to a remote repository to share with others and ‘pull’ changes made by others into their local repository to stay updated.\nConflict Resolution: Git provides tools to resolve conflicts manually in cases of conflicting changes, ensuring data integrity during collaboration.\nVersioning and Tagging: Git offers versioning and tagging capabilities, allowing users to mark specific points in history such as major releases or significant milestones.\n\n\n\nGitHub Hosting for Git\nIn addition to exploring Git, we will also explore GitHub, a collaborative platform for hosting Git repositories. GitHub enhances Git’s capabilities by offering features like issue tracking, security measures to protect repositories, and GitHub Pages for creating project websites. Additionally, GitHub provides the option to set repositories as private until you are ready to share your work publicly.\n\n\n\n\n\n\nAlternatives flows for collaborative projects\n\n\n\n\nGitLab\nBitBucket\n\nWe will focus on GitHub for the remainder of this lesson due to its widespread usage and compatibility.\n\n\n\n\n\n\n\n\nWarning\n\n\n\nWe will discuss repositories for archiving experimental or large datasets in lesson 7. However, if you are interested in version control large files, we recommend the use of git annex. It is important to store files with a checksum (MD5, SHA1, SHA256) to verify that files are not altered or corrupted buy recomputing their signature.\n\n\n\nFrom Project folders to Git repositories\nMoving from Git to GitHub involves transitioning from a local version control setup to a remote hosting platform. You will need a GitHub account for the exercise in this section.\n\n\n\n\n\n\nCreate a GitHub account\n\n\n\n\nIf you don’t have a GitHub account yet, click here\nInstall Git from Git webpage\n\n\n\nYou have two options when it comes to creating a repository for your project. First, you can start from scratch by creating a new repository and adding files to it as your project progresses. Alternatively, if you already have an existing folder structure for your project, you can initialize a repository directly from that folder. It is crucial to initiate version control in the early stages of a project to facilitate easy tracking of changes and effective management of the project’s version history from the beginning.\n\nConverting Folders to Git Repositories\nIf you completed all the exercises in lesson 3, you should have a project data structure prepared. Otherwise, consider using one of your existing projects or creating a small toy example for practice using cookiecutter (see practical_workshop).\n\n\n\n\n\n\nGithub documentation link\n\n\n\n\nAdding locally hosted code to Github\n\n\n\n\n\n\n\n\n\nExercise 1: initialize a repository from an existing folder:\n\n\n\n\n\n\n\n\nInitialize the repository: Begin by running the command git init in your project directory. This command sets up a new Git repository in the current directory and is executed only once, even for collaborative projects. See (git init) for more details.\nCreate a remote repository: Once the local repository is initialized, create am empty new repository on GitHub.\nConnect the remote repository: Add the GitHub repository URL to your local repository using the command git remote add origin &lt;URL&gt;. This associates the remote repository with the name “origin.”\nCommit changes: If you have files you want to add to your repository, stage them using git add ., then create a commit to save a snapshot of your changes with git commit -m \"add local folder\".\nPush to GitHub: To synchronize your local repository with the remote repository and establish a tracking relationship, push your commits to the GitHub repository using git push -u origin main.\n\n\n\n\n\n\n\n\nSetting Up a Git Repository and copying an existing folder\nAlternatively to converting folders to repositories, you can create a new repository remotely, and then clone (git clone) it locally. Here, git init is not needed. You can move the files into the repository locally (git add, git commit, and git push). If you are creating a collaborative repository, you can now share it with your colleagues.\n\n\n\n\n\n\nTips to write good commit messages\n\n\n\nWrite useful and clear Git commits. Check out this post for tips.\n\n\n\n\n\n\nGithub pages\nAfter setting up your repository on GitHub, take advantage of the opportunity to enhance it by adding your data analysis reports. Whether they are in Jupyter Notebooks, R Markdown files, or HTML reports, you can showcase them on a GitHub Page.\nOnce you have created your repository (and put it in GitHub), you have now the opportunity to add your data analysis reports that you created, in either Jupyter Notebooks, R Markdown files, or HTML reports, in a GitHub Page website. Creating a GitHub page is very simple, and we recommend that you follow the nice tutorial that GitHub has put for you.\nFor simplicity, we recommend using Quarto or MkDocs. Visit their websites and follow the instructions to get started.\n\n\n\n\n\n\nTutorial links\n\n\n\n\nGet started in quarto: https://quarto.org/docs/get-started/. We recommend using the VS code tool, if you do, follow this tutorial.\nMkDocs materials to further customize MkDocs websites.\n\n\n\n\n\nStep-by-Step Setup Guide\nWe provide an example of setting up Git, Quarto, and a GitHub account, enabling you to replicate the process independently! (see Exercise 5 in the practical material)",
     "crumbs": [
       "Course material",
       "Key practices",
diff --git a/_site/sitemap.xml b/_site/sitemap.xml
index 825953e..77ea90e 100644
--- a/_site/sitemap.xml
+++ b/_site/sitemap.xml
@@ -2,7 +2,7 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/index.html</loc>
-    <lastmod>2024-06-04T14:19:16.663Z</lastmod>
+    <lastmod>2024-07-25T07:30:26.249Z</lastmod>
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/01_RDM_intro.html</loc>
@@ -14,11 +14,11 @@
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/02_DMP.html</loc>
-    <lastmod>2024-05-22T08:08:55.178Z</lastmod>
+    <lastmod>2024-07-30T13:54:44.403Z</lastmod>
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/03_DOD.html</loc>
-    <lastmod>2024-05-22T08:09:10.514Z</lastmod>
+    <lastmod>2024-07-25T09:12:37.033Z</lastmod>
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/04_metadata.html</loc>
@@ -26,18 +26,18 @@
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/05_VC.html</loc>
-    <lastmod>2024-05-22T08:09:31.640Z</lastmod>
+    <lastmod>2024-07-25T07:43:46.402Z</lastmod>
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/06_pipelines.html</loc>
-    <lastmod>2024-05-22T08:09:43.378Z</lastmod>
+    <lastmod>2024-07-24T12:39:09.222Z</lastmod>
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/07_repos.html</loc>
-    <lastmod>2024-05-22T08:09:55.903Z</lastmod>
+    <lastmod>2024-07-23T13:29:16.909Z</lastmod>
   </url>
   <url>
     <loc>https://hds-sandbox.github.io/RDM_NGS_course/develop/practical_workshop.html</loc>
-    <lastmod>2024-06-04T14:16:27.856Z</lastmod>
+    <lastmod>2024-08-20T07:08:33.415Z</lastmod>
   </url>
 </urlset>
diff --git a/develop/02_DMP.qmd b/develop/02_DMP.qmd
index 058c411..4e24fc8 100644
--- a/develop/02_DMP.qmd
+++ b/develop/02_DMP.qmd
@@ -18,7 +18,7 @@ summary: A brief description of my document.
 
 The process of data management involves implementing tailored best practices for your data but how do you ensure comprehensive coverage of the decisions and that data is well-managed throughout its life cycle. To achieve this, a Data Management Plan (DMP) is essential. 
 
-A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation. 
+DMP are required for grant applications to ensure research data to be FAIR. A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation. 
 
 ### Benefits of writing a DMP
 
diff --git a/develop/03_DOD.qmd b/develop/03_DOD.qmd
index 8da3c74..72e39b8 100644
--- a/develop/03_DOD.qmd
+++ b/develop/03_DOD.qmd
@@ -52,7 +52,7 @@ Here we suggest the use of three main folders:
     - More subfolders can be created for different data formats.
 
 :::{.callout-tip title="Verify the integrity of downloaded files!"}
-Ensure that the person downloading the files employs checksums or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.
+Ensure that the person downloading the files employs checksums (MD5, SHA1, SHA256) or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.
 
 - **MD5 Checksum**: Files with names ending in ".md5" contain MD5 checksums. For instance, "filename.txt.md5" holds the MD5 checksum of "filename.txt"."
 :::
@@ -158,7 +158,7 @@ The provided folder structure is designed to be intuitive for NGS data. The desc
     └── samplesheet.csv
 ```
 
-- **README.md**: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay's objectives). Sometimes, it also outlines the basic directory structure and file naming conventions.
+- **README.md**: This file contains general information about the project or experiment, usually in markdown or plain text format. It includes details such as such as the origin of the raw NGS data (including sample information, laboratory protocols used, and the assay's objectives). Sometimes, it also outlines the basic directory structure and file naming conventions. README's are great but the goal is to make everything as self-explanatory as possible!
 - **metadata.yml**: This serves as the metadata file for the project ([see this lesson](./04_metadata.qmd)).
 - **pipeline.md**: This document describes the pipeline employed to process the raw data, along with the specific commands used to execute the pipeline. The specific format can vary depending on the workflow system employed (e.g., bash, Snakemake, Nextflow, Jupyter Notebooks, etc.) ([see this lesson](./06_pipelines.qmd)). Employing a standardized pipeline ensures a consistent file organization system (and the corresponding documentation)
 - **processed_data**: folder with results of the preprocessing pipeline. The contents may vary depending on the pipeline utilized. For example, 
@@ -213,22 +213,75 @@ Next, let's take a look at a possible folder structure and what kind of files yo
 - **data**: contains symlinks or shortcuts to where the data is (raw, processed, external, etc.), avoiding duplication and modification of original files.
 - **docs**: a folder containing Word documents, slides, or PDFs related to the project. It also contains your [Data Management Plan](./02_DMP.qmd).
 - **notebooks or pipelines**: a folder containing notebooks (Jupyter, R markdown, Quarto notebooks) or workflows (Snakemake or Nextflow) with the actual data analysis. Tip: Label them numerically indicating the sequential order.
-- **README.md**: detailed description of the project in markdown format.
+- **README.md**: detailed description of the project in markdown format. 
 - logs: log files.
 - tmp/scratch: store temporary or intermediate files (eg. testing).
 - **environment**: files for reproducing the analysis environment to reproduce the results, such as a Dockerfile, conda yaml file, or a text file ([See 6th lesson](./06_pipelines.qmd) for more tips on making your pipelines reproducible). It includes software, libraries/packages, and dependencies (and their versions!). 
-- **scripts**: a folder containing helper scripts to run data analysis or source code
+- **scripts**: a folder containing helper scripts to run data analysis or source code. Other common directory names: `src`, `source` and `code`, pick one!
 - **reports**: Generated analysis as HTML, PDF, LaTeX, etc. Great for sharing with colleagues and creating formal reports of the data analysis procedure.
     - *figures*: figures produced upon rendering notebooks. Tip: save the figures under a subfolder named after the notebook/pipeline that created them (you will appreciate this organization when you need to rerun analysis and know which script created each figure!). 
 - **results**: results from the data analysis, such as tables and figures, etc. Tip: Create a subfolder named after the notebook or pipeline for storing the results generated by that specific notebook or pipeline.
 - **metadata.yml**: metadata file describing the dataset, samples, etc. ([see this lesson](./04_metadata.qmd)).
 
+For good managing project practices, version control *everything* with git and git-annex!
+
 :::{.callout-exercise}
 # Exercise: Write your personal data structure
 - Create your own data structure for one of the projects you are currently working on. Consider how it is similar to the example provided and how it differs. Make sure the data structure is easily understandable and navigable.
 - What improvements or modifications could be made to enhance clarity and efficiency?
+Check the following callout for more examples to get inspired.
+:::
+
+:::{.callout-note collapse="true"}
+# Need more examples?
+If you want to get inspired, here are two other templates proposed by A. [The Turing way](https://the-turing-way.netlify.app/project-design/project-repo/project-repo-advanced.html) and B. [Coderefinery](https://coderefinery.github.io/reproducible-research/organizing-projects/#directory-structure-for-projects):
+
+A. 
+```
+Project Folder/
+├── docs                     <- documentation
+│   └── codelist.txt
+│   └── project_plan.txt
+│   └── ...
+│   └── deliverables.txt
+├── data
+│   └── raw/
+│       └── my_data.csv
+│   └── clean/
+│       └── data_clean.csv
+├── analysis                 <- scripts
+│   └── my_script.R
+├── results                  <- analysis output     
+│   └── figures
+├── .gitignore               <- files excluded from git version control
+├── install.R                <- environment setup
+├── CODE_OF_CONDUCT          <- Code of Conduct for community projects
+├── CONTRIBUTING             <- Contribution guideline for collaborators
+├── LICENSE                  <- software license
+├── README.md                <- information about the repo
+└── report.md                <- report of project
+```
+B. 
+```
+project_name/ 
+├── README.md # overview of the project
+├── data/ # data files used in the project 
+│   ├── README.md # describes where data came from 
+│   └── sub-folder/ # may contain subdirectories 
+├── processed_data/ # intermediate files from the analysis 
+├── manuscript/ # manuscript describing the results 
+├── results/ # results of the analysis (data, tables, figures) 
+├── src/ # contains all code in the project 
+│   ├── LICENSE # license for your code 
+│   ├── requirements.txt # software requirements and dependencies 
+│   └── ... 
+└── doc/ # documentation for your project 
+├── index.rst 
+└── ...
+```
 :::
 
+
 ## Template engine
 Setting up folder structures manually for each new project can be time-consuming. Thankfully, tools like [Cookiecutter](https://github.com/cookiecutter/cookiecutter) offer a solution by allowing users to create project templates easily. These templates can ensure consistency across projects and save time. Additionally, using [cruft](https://github.com/cruft/cruft) alongside Cookiecutter can assist in maintaining older templates when updates are made (by synchronizing them with the latest version).
 
@@ -266,12 +319,12 @@ Best practices for downloading data from the source while ensuring the preservat
 - Organizing data structure: Create a data structure that allows storing all versions in the same parent directory, and ensure that all lab members follow these practices.
 - Documentation and metadata preservation: Before downloading, carefully review the documentation provided by the database. Download files containing the data version and any associated metadata.
 - README.md: Record the version of the data in the README.md file.
-- Checksums: Check for and use checksums provided by the database to verify the integrity of the downloaded data, ensuring that it hasn't been corrupted during transfer. Do the exercise below. 
+- Checksums: Check for and use checksums (MD5, SHA1, SHA256, ...) provided by the database to verify the integrity of the downloaded data, ensuring that it hasn't been corrupted during transfer. Do the exercise below to get more familiar with these files. 
 - Verify File size: Check the file size provided by the source. It is not as secure as checksum verification but discrepancies could indicate corruption.
 - Automated Processes: whenever possible, automate the download process to reduce the likelihood of errors and ensure consistency (e.g. use bash script or pipeline).
 
 :::{.callout-note title="Optional: Exercise on CHECKSUMS" collapse="true"}
-We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets. In this example, we use data from the [HLA FTP Directory](ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/). 
+We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets, as it is commonly used. In this example, we use data from the [HLA FTP Directory](ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/). 
 
 1. Install md5sum (from coreutils package)
 ```{.bash} 
@@ -289,22 +342,23 @@ brew install coreutils
 # 1. Create or change the directory to the resources dir. 
 
 # Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:
-1a3d12e4e6cc089388d88e3509e41cb3  hla_gen.fasta
+7348fbef5ab204f3aca67e91f6c59ed2  hla_prot.fasta
 # Finally, save it: 
 md5file="md5checksum.txt"
 
 # Define the URL of the files to download
-url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_gen.fasta"
-# 
-filename=$(basename "$url")
+url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta"
 
-# (Optional) Define a different filename to save the downloaded file (`wget -O $out_filename`)
-# out_filename = "imgt_hla_gen.fasta"
+# (Optional 1) Save the original file name: filename=$(basename "$url")
+# (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)
+# out_filename = "imgt_hla_prot.fasta"
 
 # Download the file
 wget $url && \
 md5sum --status --check $md5file
 
+We recommend using the argument `--status` **only** when you incorporate this sanity check as part of your pipeline so that it only prints the errors (it doesn't print output when success).
+
 ```
 3. Folder structure 
 ```{.bash}
@@ -354,19 +408,89 @@ Remember to keep the folder structure simple.
 Consider the most common types of files and folders you will be working with, such as visualizations, results tables, and processed files. Develop a logical and clear naming system for these files based on the tips provided above. Aim for concise and straightforward names to avoid complexity.
 :::
 
+:::{.callout-exercise}
+# Which naming conventions should not be used and why?
+```
+A. data_processing_carlo's.py
+B. raw_sequences_V#20241111.fasta
+C. differential_expression_results_clara.csv
+D. Grant proposal final.doc
+E. sequence_alignment$v1.py
+F. data/gene_annotations_20201107.gff
+G. alpha~1.0/beta~2.0/reg_2024-05-98.tsv
+H. alpha=1.0/beta=2.0/reg_2024-05-98.tsv
+I. run_pipeline:20241203.sh
+```
+:::{.callout-hint}
+A, B, D, E, H, I
+:::
+
+:::
 
-To learn more about naming conventions for NGS analysis and see additional examples, click [here](examples/NGS_management.qmd).
+
+:::{.callout-exercise}
+# Which file name is more readable?
+```
+1a. forecast2000122420240724.tsv
+1b. forecast_2000-12-24_2024-07-24.tsv
+1c. forecast_2000_12_24_2024_07_24.tsv
+2a. 01_data_preprocessing.R
+2b. 1_data_preProcessing.R
+2c. 01_d4t4_pr3processing.R
+3a. B1_2024-12-12_cond~pH7_temp~37C.fastq
+3b. B1.20241212.pH7.37C.fastq
+3c. b1_2024-12-12_c0nd~pH7_t3mp~37C.fastq
+```
+:::{.callout-hint}
+**1b**: easier for human & machine, `_` separates dates, `-` separates within time information (year/month/day). This is important, for example, when using wildcards in Snakemake for building pipelines. 
+
+**2a**: start with 0 for sorting, consistently with upper/lower and the use of separators (`_` separates metadata)
+
+**3a**: indicates variable temperature is set to 37 Celsius (temperature could be negative `-` and is better used to separate values in time)
+:::
+:::
+
+Regular expressions are an incredibly powerful tool for string manipulation. We recommend checking out [RegexOne](https://regexone.com/) to learn how to create smart file names that will help you parse them more efficiently. To learn more about naming conventions for NGS analysis and see additional examples, click [here](examples/NGS_management.qmd).
+
+
+:::{.callout-exercise}
+# Which of the following regexps match the following filenames?
+(in bold filenames that SHOULD be matched):
+
+- **rna_seq/2021/03/results/Sample_A123_gene_expression.tsv**
+- proteomics/2020/11/Sample_B234_protein_abundance.tsv
+- **rna_seq/2021/03/results/Sample_C345_normalized_counts.tsv**
+- rna_seq/2021/03/results/Sample_D456_quality_report.log
+- metabolomics/2019/05/Sample_E567_metabolite_levels.tsv
+- rna_seq/2019/12/Sample_F678_raw_reads.fastq
+- **rna_seq/2021/03/results/Sample_G789_transcript_counts.tsv**
+- proteomics/2021/02/Sample_H890_protein_quantification.TSV
+
+Regular Expressions:
+```
+rna_seq.*\.tsv
+.*\.csv
+.*/2021/03/.*\.tsv
+.*Sample_.*_gene_expression.tsv
+rna_seq/2021/03/results/Sample_.*_.*\.tsv
+```
+:::{.callout-hint}
+`.*rna_seq.*\.tsv` and `rna_seq/2021/03/results/Sample_.*_.*\.tsv` match the exact same files
+:::
+:::
 
 ## Wrap up
 
-In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! Complete the practical tutorial on using `cookiecutter` as a template engine to be able to create your own templates and reuse them as much as you need. 
+In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! It is now your responsibility to use and implement them in a reasonable way. Complete the practical tutorial on using `cookiecutter` as a template engine to be able to create your own templates and reuse them as much as you need. 
 
 ### Sources
 
+- The Turing way <https://the-turing-way.netlify.app/project-design/project-repo/project-repo-advanced.html>
+- RDMkit Elixir Europe: <https://rdmkit.elixir-europe.org/data_organisation>
+- Coderefinery: <https://coderefinery.github.io/reproducible-research/organizing-projects/#directory-structure-for-projects>
 - UK Data Service: <https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/organising/>
 - Oakland University: <https://library.oakland.edu/services/research-data/file-org.html>
 - Cessda guidelines: <https://dmeg.cessda.eu/Data-Management-Expert-Guide/2.-Organise-Document/File-naming-and-folder-structure>.
-- RDMkit Elixir Europe: <https://rdmkit.elixir-europe.org/data_organisation>
 
 <!-- script to open links in a new tab, add at the end 
 <script>
diff --git a/develop/05_VC.qmd b/develop/05_VC.qmd
index e2f8662..a7b7bfa 100644
--- a/develop/05_VC.qmd
+++ b/develop/05_VC.qmd
@@ -71,7 +71,7 @@ We will focus on GitHub for the remainder of this lesson due to its widespread u
 :::
 
 :::{.callout-warning}
-We will discuss repositories for archiving experimental or large datasets in [lesson 7](./07_repos.qmd). 
+We will discuss repositories for archiving experimental or large datasets in [lesson 7](./07_repos.qmd). However, if you are interested in version control large files, we recommend the use of `git annex`. It is important to store files with a checksum (MD5, SHA1, SHA256) to verify that files are not altered or corrupted buy recomputing their signature. 
 :::
 
 #### From Project folders to Git repositories 
diff --git a/develop/06_pipelines.qmd b/develop/06_pipelines.qmd
index 1476659..7c83bf8 100644
--- a/develop/06_pipelines.qmd
+++ b/develop/06_pipelines.qmd
@@ -76,10 +76,11 @@ To maintain clarity and organization in the data analysis process, adopt best pr
 - Provide **environment files** for reproducing the computational environment (such as 'requirements.txt' for Python or 'environment.yml' for Conda). The simplest way is to document the dependencies by reporting the packages and their versions used to run your analysis. 
 - Data versioning: use **version control systems** (e.g., Git) and upload your code to a **code repository** [Lesson 5](./05_VC.qmd). 
 - Integrated development environments (e.g., RStudio, PyCharm) offer tools and features for writing, testing, and debugging code
+- Use `git submodule` for code and software that is reused in several projects
 - Leverage curated pipelines such as the ones developed by the [nf-core community](https://nf-co.re/), further ensuring adherence to community standards and guidelines.
+- Use [Software Heritage](https://www.softwareheritage.org/) an archive for software source code are essential for long-term accessibility and reproducibility
 - Add a LICENSE file and perform regular updates: clarifying usage permissions and facilitating collaboration.
 
-
 :::{.callout-warning title="Practical HPC pipes"}
 We provide a hand-on workshop on computational environments and pipelines. Keep an eye on the upcoming events on the [Sandbox website](https://hds-sandbox.github.io/news.html). If you're interested in delving deeper, check out the HPC best practices module we’ve developed [here](https://hds-sandbox.github.io/HPC-lab/). 
 :::
diff --git a/develop/07_repos.qmd b/develop/07_repos.qmd
index 4d82fd7..b05528b 100644
--- a/develop/07_repos.qmd
+++ b/develop/07_repos.qmd
@@ -30,10 +30,14 @@ Specialized repositories and archives securely store, curate, and disseminate sc
 - **Amplification of Research Impact and Contribution**: Archiving data elevates research quality and extends its impact within the scientific community.
 - **Fulfilling Scholarly Obligations**: Compliance with requirements set by scientific journals and funding agencies ensures adherence to scholarly standards.
 
+:::{.callout-important title="re3data.org"}
+Check the **registry of research data repositories**--[re3data.org](https://www.re3data.org/) for a full overview. You can [browse by subject](https://www.re3data.org/browse/by-subject/) if you are looking within a specific field. 
+:::
+
 There are two types of repositories:
 
 - **General repositories**: relevant to a wide range of disciplines (e.g. Zenodo). 
-- **Domain-specific**: repositories are customized for specific fields, providing specialized curation and context-specific features (e.g. ENA, GEO, Annotare, etc.). 
+- **Domain-specific**: repositories are customized for specific fields, providing specialized curation and context-specific features (e.g. ENA, GEO, Annotare, etc.)
 
 :::{.callout-note}
 # List of repositories for biological data
@@ -89,21 +93,34 @@ By adhering to standards, repositories ensure that submitted data is high qualit
 Following all the recommendations in this course makes it straightforward to provide the necessary documentation and information for these repositories. For instance, repositories specific to NGS data will require the raw FASTQ files, sample metadata, and protocols as well as final pre-processing results (for instance, read count matrices in BED files).  
 
 :::{.callout-warning}
-Keep in mind that these repositories are not intended for downstream analysis data and associated code. However, you should already have those versions controlled by GitHub, which eliminates any concerns. You can then archive such repositories in a general repository like Zenodo.
-:::
+Keep in mind that these repositories are not intended for downstream analysis data and associated code. However, you should already have those versions controlled by GitHub, which eliminates any concerns. You can then archive such repositories in a general repository like Zenodo. 
 
+Archives for software source code are essential for long-term accessibility and reproducibility and are becoming very popular. Check [Software Heritage](https://www.softwareheritage.org/) if you are developing software. 
+:::
 
 ## General repositories
 
+There are plenty of data archiving repositories. We recommend to check the [Longwood Research Data management](https://datamanagement.hms.harvard.edu/share-publish/data-repositories) website at Harvard for a quick overview. Some of the most well-known are:
+
+- Dataverse
+- Dryad
+- figshare
+- Open Science Framework (OSF)
+- Zenodo
+
+We will be using Zenodo for our practical workshop. However, please review the table provided by the Longwood, Harvard Biomedical Data Management team, which outlines the differences between various repositories.
+
+![Click to enlarge](./images/longwood_repos.png){.lightbox}
+
+### Zenodo
 [Zenodo](https://zenodo.org/) is one of the widely used repositories for a variety of research outputs. It is an open-access digital platform supported by the European Organization for Nuclear Research (CERN) and the European Commission. It caters to various research outputs, including datasets, papers, software, and multimedia files, making it a valuable resource for researchers worldwide. With its user-friendly platform, researchers can easily upload, share, and preserve their research data. Each deposited item receives a unique Digital Object Identifier (DOI), ensuring citability and long-term accessibility. Additionally, Zenodo offers robust metadata capabilities for enriching submissions with contextual information. Moreover, researchers can [link their GitHub]((https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content)) accounts to Zenodo, simplifying the process of archiving the GitHub repository releases for long-term accessibility and citation.
 
 Once your accounts are linked, creating a Zenodo archive becomes as straightforward as tagging a release in your GitHub repository. Zenodo automatically detects the release and generates a corresponding archive, complete with a unique Digital Object Identifier (DOI) for citable reference. Therefore, before submitting your work to a journal, link your data analysis repository to Zenodo, obtain a DOI, and cite it in your manuscript which enhances reproducibility in research.
 
-### Step-by-Step Setup Guide
+#### Step-by-Step Setup Guide
 
 Check the practical material where we demonstrate how to link Zenodo and Github (see **Exercise 6** in the [practical material](./practical_workshop.qmd)).
 
-
 ## Wrap up
 
 In this concluding lesson, we've covered the process of submitting your data to a domain-specific repository and archiving your data analysis GitHub repositories in Zenodo. By applying the lessons from this workshop, you'll significantly enhance the FAIRness of your data and improve its organization for future use. These benefits extend beyond yourself to your teammates, group leader, and the wider scientific community.
diff --git a/develop/images/longwood_repos.png b/develop/images/longwood_repos.png
new file mode 100644
index 0000000..345fb30
Binary files /dev/null and b/develop/images/longwood_repos.png differ
diff --git a/develop/practical_workshop.qmd b/develop/practical_workshop.qmd
index 14912d3..754ada5 100644
--- a/develop/practical_workshop.qmd
+++ b/develop/practical_workshop.qmd
@@ -1,4 +1,4 @@
-﻿---
+---
 title: Practical material
 format: 
     html:
@@ -38,8 +38,20 @@ Ensure all necessary tools and software are installed before beginning the pract
 
 Two more tools will be required, choose the one you are familiar with or the first option:
 
-- Option a. Install [Quarto](https://quarto.org/docs/get-started/). We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies. 
-- Option b. Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach. Follow the instructions [here](./examples/mkdocs_pages.qmd) if you prefer MkDocs.
+- *Option a*) Install [Quarto](https://quarto.org/docs/get-started/). We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies. 
+- *Option b*) Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach. 
+
+    ```{.bash}
+    pip install mkdocs # create webpages
+    pip install mkdocs-material # customize webpages
+    pip install mkdocs-video # add videos or embed videos from other sources
+    pip install mkdocs-minify-plugin # Minimize html code
+    pip install mkdocs-git-revision-date-localized-plugin # display last updated date 
+    pip install mkdocs-jupyter # include Jupyter notebooks
+    pip install mkdocs-bibtex # add references in your text (`.bib`)
+    pip install neoteroi-mkdocs # create author cards
+    pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
+    ```
 :::
 
 
@@ -171,7 +183,8 @@ Creating a folder template is straightforward with [cookiecutter](https://github
 Here are some template that you can use to get started, adapt and modify them to your own needs:
 
 - [Python package project](https://cookiecutter.readthedocs.io/en/stable/tutorials/tutorial1.html#step-1-generate-a-python-package-project)
-- [Sandbox test](https://github.com/hds-sandbox/project-template/)
+- [Sandbox bioinformatics project](https://github.com/hds-sandbox/cookiecutter-template/)
+- [Sandbox data project](https://github.com/hds-sandbox/cc-data-template/)
 - [Data science](https://github.com/drivendata/cookiecutter-data-science/)
 - [NGS data](https://github.com/brickmanlab/ngs-template)
 
@@ -248,7 +261,7 @@ After the generation process is complete, navigate to the directory where Cookie
 
 :::{.callout-exercise}
 
-# Exercise 1: Create your own template
+# Exercise 1: Create your own template.
 
 Use Cookiecutter to create custom templates for your folders. You can do it from scratch (see Exercise 1, part B) or opt for one of our pre-made templates available as a Github repository (recommended for this workshop). Feel free to tailor the template to your specific requirements—you don't have to follow our examples exactly.
 
@@ -265,30 +278,34 @@ We assume you have already gone through the requirements at the beginning of the
     ```{.bash}
     git clone <your URL to the template>
     ```
-If you have a GitHub Desktop, click **Add** and select "Clone repository" from the options
+If you have a GitHub Desktop, click **Add** and select "Clone repository" from the options.
 
-3. Open the repository and navigate through the different directories
+3. Open the repository and navigate through the different directories.
 
-4. Modify the contents of the repository as needed to fit your project's requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under 'Project folder'. For instance, this template is missing the 'reports' directory and add the 'requirements.txt' file. Consider creating it, along with a subdirectory named 'reports/figures'. 
+4. Modify the contents of the repository as needed to fit your project's requirements. You can change files, add new ones, remove existing one or adjust the folder structure. For inspiration, review the data structure above under 'Project folder'. Our Cookiecutter template is missing the 'reports' directory or the 'requirements.txt' file. Consider creating them, along with a subdirectory named 'reports/figures'. 
 
     ```plaintext
     ├── results/
     │   ├── figures/
     ├── requirements.txt
     ```
-    Here’s an example of how to do it: 
 
-    ```{.bash}
-    # Open your terminal and navigate to your template directory. Then: 
-    cd \{\{\ cookiecutter.project_name\ \}\}/  
-    mkdir reports 
-    touch requirements.txt
-    ```
-5. Commit and push changes when you are done with your modifications
-- Stage the changes with `git add`
-- Commit the changes with a meaningful commit message `git commit -m "update cookicutter template" `
-- Push the changes to your forked repository on Github `git push origin main` (or the appropriate branch name)
-6. Test your template by using `cookiecutter <URL to your GitHub repository "cookicutter-template">`
+:::{.callout-hint}
+Here’s an example of how to do it: 
+
+```{.bash}
+# Open your terminal and navigate to your template directory. Then: 
+cd \{\{\ cookiecutter.project_name\ \}\}/  
+mkdir reports 
+touch requirements.txt
+```
+:::
+
+5. Commit and push changes when you are done with your modifications.
+- Stage the changes with `git add`.
+- Commit the changes with a meaningful commit message `git commit -m "update cookicutter template" `.
+- Push the changes to your forked repository on Github `git push origin main` (or the appropriate branch name).
+6. Test your template by using `cookiecutter <URL to your GitHub repository "cookicutter-template">`.
 
     Fill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?
 :::
diff --git a/index.qmd b/index.qmd
index f5547ed..8028743 100644
--- a/index.qmd
+++ b/index.qmd
@@ -35,7 +35,7 @@ We offer workshops on practical RDM for biodata. Keep an eye on the upcoming eve
 ![](develop/images/cookie_combined.png)
 :::
 
-The course "Research Data Management (RDM) for biological data" is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course's conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today's NGS research landscape, as well as in other related fields to health and bioinformatics.
+The course "Research Data Management (RDM) for biological data" is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs, hence, helping them in their daily data analysis work. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course's conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today's research landscape, with a focus on fields related to omics, health and bioinformatics.
 
 :::{.callout-note title="Course Overview"}
 - 📖 **Syllabus:**
@@ -69,6 +69,10 @@ The course "Research Data Management (RDM) for biological data" is designed to p
     10. Archiving and repositories (7)
 --> 
 
+This course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. 
+
+Despite DMPs being essential, they are often too general and lack specific guidelines for practical implementation. That is why we have designed this course to cover practical aspects in detail. Participants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility. The course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today's research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.
+
 :::{.callout-warning title="Course Requirements"}
 - Basic understanding Next Generation Sequencing data and formats.
 - Command Line experience
@@ -76,12 +80,6 @@ The course "Research Data Management (RDM) for biological data" is designed to p
 - Quarto or Mkdocs tools
 :::
 
-This course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability.
-
-Participants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility.
-
-The course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today's research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.
-
 ::: {.callout-tip title="Course Goals"}
 By the end of this workshop, you should be able to apply the following concepts in the context of Next Generation Sequencing data: