Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes to chapter 3. Raw data processing #327

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d9b409d
clean sentences until summary of 3.1(toggle)
seohyonkim Jan 25, 2025
8897042
change until 3.2.1 and in glossary
seohyonkim Jan 27, 2025
b9d813d
add newlines
seohyonkim Jan 28, 2025
00a8b55
clean up to cell barcode correction
seohyonkim Jan 31, 2025
2ea159a
clean the rest of the chapter
seohyonkim Jan 31, 2025
6055e0b
reviewing until 3.2.2.1. Mapping to the full genome (exclusive)
Feb 1, 2025
2968e03
reviewing until 3.3.2. Future challenges (inclusive)
Feb 1, 2025
e19bbbd
reviewing until 3.8. A real-world example (exclusive)
Feb 2, 2025
a417280
reviewing done
Feb 3, 2025
917e40f
final touch, ready for PR
seohyonkim Feb 3, 2025
d354d30
Merge branch 'main' into feature/raw-data-processing
seohyonkim Feb 6, 2025
e9acdfc
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 6, 2025
03aa631
Merge branch 'feature/raw-data-processing' of github.com:theislab/sin…
seohyonkim Feb 6, 2025
715cf55
Revert "Update jupyter-book/introduction/raw_data_processing.md"
seohyonkim Feb 6, 2025
a01a046
Update jupyter-book/glossary.md
seohyonkim Feb 7, 2025
cb85bb0
fix a bit of glossary and chapter 3
seohyonkim Feb 7, 2025
aa2d3dd
Update jupyter-book/glossary.md
seohyonkim Feb 7, 2025
2082448
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 7, 2025
1f7af20
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 7, 2025
e0314d3
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 7, 2025
c85dd4c
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 7, 2025
275228c
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 7, 2025
7f5886f
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 7, 2025
823bc01
Update jupyter-book/introduction/raw_data_processing.md
seohyonkim Feb 7, 2025
81a55f2
Merge branch 'feature/raw-data-processing' of github.com:theislab/sin…
seohyonkim Feb 7, 2025
e4e41a3
changes from feedbacks
seohyonkim Feb 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 108 additions & 29 deletions jupyter-book/glossary.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,50 @@
# Glossary

```{glossary}
Adapter sequences
adapter sequences
Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing.
These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing.
However, if adapters are not properly removed or trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses.

Algorithm
Algorithms
A pre-defined set of instructions to solve a problem.

AnnData
AnnDatas
A Python package for annotated data matrices. The primary data structure used in the scverse ecosystem.
A Python package for handling annotated data matrices, commonly used in single-cell and other omics analyses.
It provides an efficient way to store data as a matrix where rows (observations) and columns (features) can have associated metadata.
[AnnData](https://anndata.readthedocs.io/en/latest/index.html) supports slicing, subsetting, and saving to disk in formats like H5AD and Zarr.

Barcode
Barcodes
Bar code
Bar codes
Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell. Reads are later grouped by their barcode during raw data processing steps.
Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell.
Reads are later grouped by their barcode during raw data processing steps.

Batch effect
Technical confounding factors in an experiment that cause dataset distribution shifts. Usually lead to inaccurate conclusions if the causes of the batch effects are correlated with outcomes of interest in an experiment and should be accounted for (usually removed).
Technical confounding factors in an experiment that cause dataset distribution shifts.
Usually lead to inaccurate conclusions if the causes of the batch effects are correlated with outcomes of interest in an experiment and should be accounted for (usually removed).

Benchmark
An (independent) comparison of performance of several tools with respect to pre-defined metrics.

Bulk RNA sequencing
Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells.
Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.

Cell
The fundamental unit of life. Consists of cytoplasm enclosed within a membrane that contains many biomolecules such as proteins and nucleic acids. Cells acquire specific functions, transition to cell types, divide, communicate and keep the organism going. Learning about the structure, activity and communication of cells helps deciphering biology.

The fundamental unit of life, consisting of cytoplasm enclosed within a membrane, containing biomolecules such as proteins and nucleic acids.
Cells acquire specific functions, transition into different types, divide, and communicate to sustain an organism.
Studying cell structure, activity, and interactions enables insights into gene expression dynamics, cellular trajectories, developmental lineages, and disease mechanisms.
Cell barcode
See {term}`barcode`

Cluster
Clusters
A group of a population or data points that share similarities. In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`).

Cell type annotation
The process of labeling groups of {term}`clusters` of cells by {term}`cell type`. Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference.
The process of labeling groups of {term}`clusters` of cells by {term}`cell type`.
Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference.

Cell type
Cells that share common morphological or phenotypic features.
Expand All @@ -46,27 +55,64 @@ Cell state
Chromatin
The complex of DNA and proteins efficiently packaging the DNA inside the nucleus and involved in regulating gene expression.

Cluster
Clusters
A group of a population or data points that share similarities.
In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`).

Complementary DNA (cDNA)
cDNA
DNA synthesized from an RNA template by the enzyme reverse transcriptase.
cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis.

Demultiplexing
The process of determining which sequencing reads belong to which cell using {term}`barcodes`.

directed graph
A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.

DNA
DNA is the acronym of Deoxyribonucleic acid. It is the organic chemical storing hereditary information and instructions for protein synthesis. DNA gets transcribed into {term}`RNA`.
DNA is the acronym of Deoxyribonucleic acid.
It is the organic chemical storing hereditary information and instructions for protein synthesis.
DNA gets transcribed into {term}`RNA`.

Doublets
Reads obtained from droplet based assays might be mistakenly associated to a single cell while the RNA expression origins from two or more cells (a doublet).

Downstream analysis
downstream analyses
A phase of data analysis that follows the initial processing of raw data.
In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics.

Dropout
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`. The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression. Dropouts are one of the reasons why scRNA-seq data is sparse.
A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`.
The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
Dropouts are one of the reasons why scRNA-seq data is sparse.

Drop-seq
A protocol for scRNA-seq that separates cells into nano-liter sized aqueous droplets enabling large-scale profiling.

Edit distance
Edit distance (often referred to as Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another.

FASTQ reads
Sequencing reads that are saved in the FASTQ format. FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells.
Sequencing reads that are saved in the FASTQ format.
FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells.

Flowcell
flowcell
A consumable device used in sequencing platforms where DNA or RNA fragments are sequenced.
It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments.
During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation.
The flowcell enables high-throughput sequencing by allowing millions of fragments to be sequenced simultaneously.

Gene expression matrix
A cell (barcode) by gene (scverse ecosystem) or gene by cell (barcode) matrix storing counts in the cell values.

Hamming distance
A measure of the number of positions at which two strings of equal length differ.
It is commonly used in error detection and correction, including barcode correction in sequencing data.

Imputation
The replacement of missing values with usually artificial values.

Expand All @@ -76,26 +122,31 @@ Indrop
Library
Also known as sequencing library. A pool of DNA fragments with attached sequencing adapters.

Locus
Loci
loci
Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located.
In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.

MuData
A Python package for multimodal annotated data matrices. The primary data structure in the scverse ecosystem for multimodal data.
A Python package for multimodal annotated data matrices.
The primary data structure in the scverse ecosystem for multimodal data.

Muon
muon
A Python package for multi-modal single-cell analysis in Python by scverse.

Negative binomial distribution
A discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified number of failures.

Pipeline
Also often times denoted as workflow. A pre-specified selection of steps that are commonly executed in order.

RNA
Ribonucleic acid. Single-stranded nucleic acid present in all living cells that encodes and regulates gene expression.

RT-qPCR
Quantitative reverse transcription PCR (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.

PCR
Polymercase chain reaction (PCR) is a method to amplify sequences to create billions of copies. PCR requires primers, which are short synthetic {term}`DNA` fragments, to select the genome segments to be amplified and subsequently multiple rounds of {term}`DNA` synthesis to amplify the targeted segments.
Polymercase chain reaction (PCR) is a method to amplify sequences to create billions of copies.
PCR requires primers, which are short synthetic {term}`DNA` fragments, to select the genome segments to be amplified and subsequently multiple rounds of {term}`DNA` synthesis to amplify the targeted segments.

Pipeline
Also often times denoted as workflow.
A pre-specified selection of steps that are commonly executed in order.

Poisson distribution
Discrete probability distribution denoting the probability of a specified number of events occurring in a fixed interval of time or space with the events occurring independently at a known constant mean rate.
Expand All @@ -104,20 +155,48 @@ Promoter
Sequence of DNA to which proteins bind to initiate and control transcription.

Pseudotime
Latent and therefore unobserved dimension reflecting cells' progression through transitions. Pseudotime is usually related to real time events, but not necessarily the same.
Latent and therefore unobserved dimension reflecting cells' progression through transitions.
Pseudotime is usually related to real time events, but not necessarily the same.

RNA
Ribonucleic acid (RNA) is a single-stranded nucleic acid present in all living cells that encodes and regulates gene expression.
Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease.

RT-qPCR
Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.

scanpy
A Python package for single-cell analysis in Python by scverse.

scverse
A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools. See: https://scverse.org/
A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools.
See: https://scverse.org/

signal-to-noise ratio
A measure of the clarity of a signal relative to background noise.
In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data.
A high SNR indicates that the signal is strong and reliable compared to the noise, resulting in better data quality.
Conversely, a low SNR means the noise may interfere with or reduce the accuracy of the sequencing results.

Spike-in RNA
RNA transcripts of known sequence and quantity to calibrate measurements in RNA hybridization steps for RNA-seq.

Splice Junctions
splice junctions
Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing. These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.

Trajectory inference
Also known as pseudotemporal ordering. The computational recovery of dynamic processes by ordering cells by similarity or other means.
Also known as pseudotemporal ordering.
The computational recovery of dynamic processes by ordering cells by similarity or other means.

Unique Molecular Identifier (UMI)
Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing. UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates.
unique molecular identifiers (UMIs)
Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing.
UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates.

Untranslated Region (UTR)
UTR
A segment of an mRNA transcript that is transcribed but not translated into protein.
UTRs are located at both ends of the coding sequence.
```
Loading
Loading