diff --git a/_site/develop/02_DMP.html b/_site/develop/02_DMP.html index d1f82f5..5b7667c 100644 --- a/_site/develop/02_DMP.html +++ b/_site/develop/02_DMP.html @@ -258,7 +258,7 @@
May 22, 2024
+July 30, 2024
The process of data management involves implementing tailored best practices for your data but how do you ensure comprehensive coverage of the decisions and that data is well-managed throughout its life cycle. To achieve this, a Data Management Plan (DMP) is essential.
-A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.
+DMP are required for grant applications to ensure research data to be FAIR. A DMP serves as a comprehensive document detailing strategies for handling project data, code, and documentation across its life cycle. It includes plans for data collection, documentation, organization, and preservation.
A DMP serves as the initial step toward achieving FAIR principles in a project.
diff --git a/_site/develop/03_DOD.html b/_site/develop/03_DOD.html index e99ae98..1172978 100644 --- a/_site/develop/03_DOD.html +++ b/_site/develop/03_DOD.html @@ -308,7 +308,7 @@May 22, 2024
+July 25, 2024
Ensure that the person downloading the files employs checksums or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.
+Ensure that the person downloading the files employs checksums (MD5, SHA1, SHA256) or cryptographic hash functions to verify the integrity and ascertain that files are neither corrupted nor tampered with.
src
, source
and code
, pick one!For good managing project practices, version control everything with git and git-annex!
If you want to get inspired, here are two other templates proposed by A. The Turing way and B. Coderefinery:
+Project Folder/
+├── docs <- documentation
+│ └── codelist.txt
+│ └── project_plan.txt
+│ └── ...
+│ └── deliverables.txt
+├── data
+│ └── raw/
+│ └── my_data.csv
+│ └── clean/
+│ └── data_clean.csv
+├── analysis <- scripts
+│ └── my_script.R
+├── results <- analysis output
+│ └── figures
+├── .gitignore <- files excluded from git version control
+├── install.R <- environment setup
+├── CODE_OF_CONDUCT <- Code of Conduct for community projects
+├── CONTRIBUTING <- Contribution guideline for collaborators
+├── LICENSE <- software license
+├── README.md <- information about the repo
+└── report.md <- report of project
+project_name/
+├── README.md # overview of the project
+├── data/ # data files used in the project
+│ ├── README.md # describes where data came from
+│ └── sub-folder/ # may contain subdirectories
+├── processed_data/ # intermediate files from the analysis
+├── manuscript/ # manuscript describing the results
+├── results/ # results of the analysis (data, tables, figures)
+├── src/ # contains all code in the project
+│ ├── LICENSE # license for your code
+│ ├── requirements.txt # software requirements and dependencies
+│ └── ...
+└── doc/ # documentation for your project
+├── index.rst
+└── ...
+Health databases are utilized for storing, organizing, and providing access to diverse health-related data, including genomic data, clinical records, imaging data, and more. These resources are regularly updated and released under different versions from various sources. To ensure data reproducibility, it’s crucial to manage and specify the versions and sources of data within these databases.
For example, preprocessing NGS data involves utilizing various genomic resources for tasks like aligning and annotating fastq files. Essential resources include reference genomes in FASTA format (e.g., human and mouse), indexed fasta files for alignment tools like STAR and Bowtie, and GTF or GFF files for quantifying reads into genomic regions. One of the latest human reference genome is GRCh38, however many past studies are based on GRCh37.
How can you keep track of your resources? Name the folder using the version, or use a reference genome manager such as refgenie.
@@ -776,12 +836,12 @@We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets. In this example, we use data from the HLA FTP Directory.
+We recommend the use of md5sum to verify data integrity, especially if you are downloading large datasets, as it is commonly used. In this example, we use data from the HLA FTP Directory.
#!/bin/bash
-# Important: go through the README before downloading! Check if a checksums file is included.
-
-# 1. Create or change the directory to the resources dir.
-
-# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:
-1a3d12e4e6cc089388d88e3509e41cb3 hla_gen.fasta
-# Finally, save it:
-md5file="md5checksum.txt"
-
-# Define the URL of the files to download
-url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_gen.fasta"
-#
-filename=$(basename "$url")
-
-# (Optional) Define a different filename to save the downloaded file (`wget -O $out_filename`)
-# out_filename = "imgt_hla_gen.fasta"
-
-# Download the file
-wget $url && \
-md5sum --status --check $md5file
#!/bin/bash
+# Important: go through the README before downloading! Check if a checksums file is included.
+
+# 1. Create or change the directory to the resources dir.
+
+# Check for checksums (e.g.: md5checksum.txt), download, and modify it so that it only contains the checksums of the target files. The file will look like this:
+7348fbef5ab204f3aca67e91f6c59ed2 hla_prot.fasta
+# Finally, save it:
+md5file="md5checksum.txt"
+
+# Define the URL of the files to download
+url="ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_prot.fasta"
+
+# (Optional 1) Save the original file name: filename=$(basename "$url")
+# (Optional 2) Define a different filename to save the downloaded file (`wget -O $out_filename`)
+# out_filename = "imgt_hla_prot.fasta"
+
+# Download the file
+wget $url && \
+md5sum --status --check $md5file
+
+We recommend using the argument `--status` **only** when you incorporate this sanity check as part of your pipeline so that it only prints the errors (it doesn't print output when success).
genomic_resources/
-├── specie1/
-│ └── version/
-│ ├── files.txt
-│ └── indexes/
-└── dw_resources.sh
genomic_resources/
+├── specie1/
+│ └── version/
+│ ├── files.txt
+│ └── indexes/
+└── dw_resources.sh
To learn more about naming conventions for NGS analysis and see additional examples, click here.
+A. data_processing_carlo's.py
+B. raw_sequences_V#20241111.fasta
+C. differential_expression_results_clara.csv
+D. Grant proposal final.doc
+E. sequence_alignment$v1.py
+F. data/gene_annotations_20201107.gff
+G. alpha~1.0/beta~2.0/reg_2024-05-98.tsv
+H. alpha=1.0/beta=2.0/reg_2024-05-98.tsv
+I. run_pipeline:20241203.sh
+A, B, D, E, H, I
+1a. forecast2000122420240724.tsv
+1b. forecast_2000-12-24_2024-07-24.tsv
+1c. forecast_2000_12_24_2024_07_24.tsv
+2a. 01_data_preprocessing.R
+2b. 1_data_preProcessing.R
+2c. 01_d4t4_pr3processing.R
+3a. B1_2024-12-12_cond~pH7_temp~37C.fastq
+3b. B1.20241212.pH7.37C.fastq
+3c. b1_2024-12-12_c0nd~pH7_t3mp~37C.fastq
+1b: easier for human & machine, _
separates dates, -
separates within time information (year/month/day). This is important, for example, when using wildcards in Snakemake for building pipelines.
2a: start with 0 for sorting, consistently with upper/lower and the use of separators (_
separates metadata)
3a: indicates variable temperature is set to 37 Celsius (temperature could be negative -
and is better used to separate values in time)
Regular expressions are an incredibly powerful tool for string manipulation. We recommend checking out RegexOne to learn how to create smart file names that will help you parse them more efficiently. To learn more about naming conventions for NGS analysis and see additional examples, click here.
+(in bold filenames that SHOULD be matched):
+Regular Expressions:
+rna_seq.*\.tsv
+.*\.csv
+.*/2021/03/.*\.tsv
+.*Sample_.*_gene_expression.tsv
+rna_seq/2021/03/results/Sample_.*_.*\.tsv
+.*rna_seq.*\.tsv
and rna_seq/2021/03/results/Sample_.*_.*\.tsv
match the exact same files
In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! Complete the practical tutorial on using cookiecutter
as a template engine to be able to create your own templates and reuse them as much as you need.
In this lesson, we have learned some practical tips and examples about how to organize your data and bring some order to chaos! It is now your responsibility to use and implement them in a reasonable way. Complete the practical tutorial on using cookiecutter
as a template engine to be able to create your own templates and reuse them as much as you need.
June 4, 2024
+August 20, 2024
Two more tools will be required, choose the one you are familiar with or the first option:
Option a) Install Quarto. We recommend Quarto as is easy to use and provides native support for notebooks (both R Markdown and Jupyter Notebooks). It requires no additional extensions or dependencies.
Option b) Install MkDocs and MkDocs extensions using the command line. Additional extensions are optional but can be useful if you choose this approach.
pip install mkdocs # create webpages
pip install mkdocs-material # customize webpages
pip install mkdocs-video # add videos or embed videos from other sources
@@ -286,7 +285,8 @@ Practical material
pip install mkdocs-jupyter # include Jupyter notebooks
pip install mkdocs-bibtex # add references in your text (`.bib`)
pip install neoteroi-mkdocs # create author cards
-pip install mkdocs-table-reader-plugin # embed tabular format files (`.tsv`)
Here are some template that you can use to get started, adapt and modify them to your own needs:
After the generation process is complete, navigate to the directory where Cookiecutter created the new project. You will find a project structure with the placeholders replaced by the values you provided.
Go to our Cookicutter template and click on the Fork button at the top-right corner of the repository page to create a copy of the repository on your own GitHub account or organization.
Open a terminal on your computer, copy the URL of your fork and clone the repository to your local machine (the URL should look something like https://github.com/your_username/cookiecutter-template):
-If you have a GitHub Desktop, click Add and select “Clone repository” from the options
Open the repository and navigate through the different directories
Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones. remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. For instance, this template is missing the ‘reports’ directory and add the ‘requirements.txt’ file. Consider creating it, along with a subdirectory named ‘reports/figures’.
+If you have a GitHub Desktop, click Add and select “Clone repository” from the options.
Open the repository and navigate through the different directories.
Modify the contents of the repository as needed to fit your project’s requirements. You can change files, add new ones, remove existing one or adjust the folder structure. For inspiration, review the data structure above under ‘Project folder’. Our Cookiecutter template is missing the ‘reports’ directory or the ‘requirements.txt’ file. Consider creating them, along with a subdirectory named ‘reports/figures’.
├── results/
│ ├── figures/
-├── requirements.txt
+├── requirements.txtgit add
git commit -m "update cookicutter template"
git push origin main
(or the appropriate branch name)git add
.git commit -m "update cookicutter template"
.git push origin main
(or the appropriate branch name).Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template">
Test your template by using cookiecutter <URL to your GitHub repository "cookicutter-template">
.
Fill up the variables and verify that the new structure (and folders) looks like you would expect. Have any new folders been added, or have some been removed?
Choose the format that best suits the project’s needs. In this workshop, we will focus on YAMl as it is highly used for configuration files (e.g., in conda or pipelines).
Choose the format that best suits the project’s needs. In this workshop, we will focused on Markdown as it is the most used format due to its balance of simplicity and expressive formatting options.
# OVERVIEW
Introduction to the project including its aims, and its significance. Describe the main purpose and the biological questions being addressed.
# DATASETS
Describe the data,, including its sources, format, and how to access it. If the data has undergone preprocessing, provide a description of the processes applied or the pipeline used.
# RESULTS
Summarize the results and key findings or outputs.
As discussed in lesson 3, consistent naming conventions are key for interpreting, comparing, and reproducing findings in scientific research. Standardized naming helps organize and retrieve data or results, allowing researchers to locate and compare similar types of data within or across large datasets.
The next step is to collect all the datasets that you have created in the manner explained above. Since your folders all should contain the metadata.yml
file in the same place with the same metadata, it should be very easy to iteratively go through all the folders and merge all the metadata.yml files into a one single table. he table can be easily viewed in your terminal or even with Microsoft Excel.
If you need more assistance, take a look at the code below (Hint).
If you need some assistance, take a look at the code below (Hint).
Version controlling your data analysis folders becomes straightforward once you’ve established your Cookiecutter templates. After you’ve created several folder structures and metadata using your Cookiecutter template, you can manage version control by either converting those folders into Git repositories or copying a folder into an existing Git repository. Both approaches are explained in Lesson 5.
Zenodo is an open-access digital repository that supports the archiving of scientific research outputs, including datasets, papers, software, and multimedia files. Affiliated with CERN and backed by the European Commission, Zenodo promotes transparency, collaboration, and the advancement of knowledge globally. Researchers can easily upload, share, and preserve their data on its user-friendly platform. Each deposit receives a unique DOI for citability and long-term accessibility. Zenodo also offers robust metadata options and allows linking your GitHub account to archive a specific release of your GitHub repository directly to Zenodo. This integration streamlines the process of preserving a snapshot of your project’s progress.
June 4, 2024
+July 25, 2024
The course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies, with a focus on Next Generation Sequencing (NGS) data. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s NGS research landscape, as well as in other related fields to health and bioinformatics.
+The course “Research Data Management (RDM) for biological data” is designed to provide participants with foundational knowledge and practical skills in handling the extensive data generated by modern studies. It emphasizes the importance of Open Science and FAIR principles in managing data effectively. This course covers essential principles and best practices guidelines in data organization, metadata annotation, version control, and data preservation. These principles are explored from a computational perspective, ensuring participants gain hands-on experience in applying them to real-world scenarios in their research labs, hence, helping them in their daily data analysis work. Additionally, the course delves into FAIR principles and Open Science, promoting collaboration and reproducibility in research endeavors. By the course’s conclusion, attendees will possess essential tools and techniques to address the data challenges prevalent in today’s research landscape, with a focus on fields related to omics, health and bioinformatics.
This course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity.
+Despite DMPs being essential, they are often too general and lack specific guidelines for practical implementation. That is why we have designed this course to cover practical aspects in detail. Participants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility. The course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.
This course offers participants with an in-depth introduction to effectively managing the vast amounts of data generated in modern studies. Throughout the program, emphasis is placed on practical understanding of RDM principles and the importance of efficient handling of large datasets. In this context, participants will learn the necessity of adopting Open Science and FAIR principles for enhancing data accessibility and reusability.
-Participants will acquire practical skills for organizing data, including the creation of folder and file structures, and the implementation of metadata to facilitate data discoverability and interpretation. Special attention is given to the development of Data Management Plans (DMPs) with examples tailored to omics data, ensuring compliance with institutional and funding agency requirements while maintaining data integrity. Attendees will also gain insights into the establishment of simple databases and the use of version control systems to track changes in data analysis, thereby promoting collaboration and reproducibility.
-The course concludes with a focus on archiving and data repositories, enabling participants to learn strategies for preserving and sharing data for long-term scientific usage. By the end of the course, attendees will be equipped with essential tools and techniques to effectively navigate the challenges prevalent in today’s research landscape. This will not only foster successful data management practices but also enhance collaboration within the scientific community.