Skip to content

Commit

Permalink
Add notebook Merge data and harmonized annotations (icbi-lab#19)
Browse files Browse the repository at this point in the history
* Add notebook Merge data and harmonized annotations

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add options for gencode and Ensembl gtf file conversion

* Update gtf to table script

* Include gtf download

* Move gtf download to before you begin section

* Add metadata check

* Wrap up

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use python type hints in helper functions

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update ensembl id function

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update functions in src

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add documentation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove gtf support for ancient versions

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make patient ids unique

* Fix ruff in check_metadata

* Update 01_merge.py

Change custom admonitions

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update before-you-begin.md

* Update references.bib

* Add validate_obs fun

* Fix docstrings

* Update _check_genes.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update _check_genes.py

* Fix docstrings

* Update 01_merge.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update 01_merge.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update _check_genes.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Gregor Sturm <[email protected]>
Co-authored-by: Valentin Marteau <[email protected]>
Co-authored-by: Valentin Marteau <[email protected]>
  • Loading branch information
5 people authored Aug 24, 2023
1 parent ada5984 commit 270b58d
Show file tree
Hide file tree
Showing 11 changed files with 1,347 additions and 12 deletions.
38 changes: 38 additions & 0 deletions bin/gtf_to_table.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/bin/bash
# Modified from https://www.biostars.org/p/140471/#140798

# Check for correct number of arguments
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <input_file> <output_file> <gtf_source>"
exit 1
fi

# Input and output file names
file="$1"
out="$2"
gtf_source="$3"

# Perform the zcat, awk, and sed commands based on annotation gtf source, and save the output to the specified output file
if [ "$gtf_source" == "gencode" ]; then
zcat "$file" | \
awk 'BEGIN{FS="\t"; OFS=","}{split($9,a,";"); if($3~"gene") print a[1], gensub(/^ +| +$/,"", "g", a[3]), $1, $4, $5, gensub(/^ +| +$/,"", "g", a[2]), $7, $5-$4;}' | \
sed 's/gene_id "//' | \
sed 's/gene_id "//' | \
sed 's/gene_type "//'| \
sed 's/gene_name "//' | \
sed 's/"//g' | \
sed "1i\Geneid,GeneSymbol,Chromosome,Start,End,Class,Strand,Length" > "$out"
elif [ "$gtf_source" == "ensembl" ]; then
zcat "$file" | \
grep -v '^#' | \
awk '$3 == "gene"' | \
cut -f 1,4-5,7,9 | \
awk -F'\t' -v OFS='\t' '{ split($NF, a, /[;\"]+/); $NF=""; for (i=2; i<=length(a); i+=2) $(NF+1) = a[i]; if ($6 == "havana" || $6 == "ensembl") $6 = ""; $6 = $6 "." $7; $7 = ""; } 1' | \
awk -F'\t' -v OFS='\t' '{ print }' | \
cut -f 1-4,6,8,9,10 | \
sed 's/\t/,/g' | \
awk -F',' -v OFS=',' 'BEGIN{print "gene_id","gene_name","chromosome","start","end","gene_biotype","gene_source","strand","length"} { gsub("havana", "", $6); gsub("ensembl", "", $6); print $5, $6, $1, $2, $3, $8, $7, $4, $3-$2+1}' > "$out"
else
echo "Invalid gtf source. Please specify 'gencode' or 'ensembl'."
exit 1
fi
1 change: 1 addition & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
:toctree: generated
pp.is_outlier
pp.validate_obs
```

## Tools
Expand Down
14 changes: 11 additions & 3 deletions docs/before-you-begin.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Raw data preprocessing is beyond the scope of this tutorial. Please refer to the

- The [nf-core/scrnaseq](https://nf-co.re/scrnaseq) workflow for single-cell RNA-seq preprocessing
- The [Raw data processing](https://www.sc-best-practices.org/introduction/raw_data_processing.html) chapter of the single-cell best practice book {cite}`heumosBestPracticesSinglecell2023`.
- Usefull resource to browse available datasets: [Single cell studies database](https://docs.google.com/spreadsheets/d/1En7-UV0k0laDiIfjFkdn7dggyR7jIk3WH8QgXaMOZF0/edit#gid=0) {cite}`svensson2020curated`

:::

Expand All @@ -78,10 +79,17 @@ For this protocol, we provide both the TPM matrix and the clinical annotation ta
curl TODO
```

## Obtain reference genome GTF file
## Obtain reference genome GTF files

For remapping gene symbols, you need a GTF file with genome annotations.
To facilitate integration of the four datasets, it is important to standardize the provided gene IDs. In this tutorial, we will download the GTF files from [gencode](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human)/[ensembl](https://ftp.ensembl.org/pub/) that were originally used to annotate the genes in each dataset, enabling us to remap the provided gene symbols. This remapping is necessary to resolve ambiguity in gene symbols and ensure that only counts mapped to the same genomic location are merged, using unique Ensembl IDs as identifiers.

```bash
curl TODO
cd ./tables
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz
../bin/gtf_to_table.sh gencode.v32.primary_assembly.annotation.gtf.gz gencode.v32_gene_annotation_table.csv gencode
rm gencode.v32.primary_assembly.annotation.gtf.gz

wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
../bin/gtf_to_table.sh Homo_sapiens.GRCh38.109.gtf.gz Homo_sapiens.GRCh38.109_gene_annotation_table.csv ensembl
rm Homo_sapiens.GRCh38.109.gtf.gz
```
Loading

0 comments on commit 270b58d

Please sign in to comment.