theislab · seohyonkim · Jan 25, 2025 · Jan 27, 2025 · Jan 28, 2025 · Jan 31, 2025
diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md
@@ -1,41 +1,50 @@
 # Glossary
 
 ```{glossary}
+Adapter sequences
+adapter sequences
+    Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing.
+    These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing.
+    However, if adapters are not properly removed or trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses.
+
 Algorithm
 Algorithms
     A pre-defined set of instructions to solve a problem.
 
 AnnData
 AnnDatas
-    A Python package for annotated data matrices. The primary data structure used in the scverse ecosystem.
+    A Python package for handling annotated data matrices, commonly used in single-cell and other omics analyses.
+    It provides an efficient way to store data as a matrix where rows (observations) and columns (features) can have associated metadata.
+    [AnnData](https://anndata.readthedocs.io/en/latest/index.html) supports slicing, subsetting, and saving to disk in formats like H5AD and Zarr.
 
 Barcode
 Barcodes
 Bar code
 Bar codes
-    Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell. Reads are later grouped by their barcode during raw data processing steps.
+    Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell.
+    Reads are later grouped by their barcode during raw data processing steps.
 
 Batch effect
-    Technical confounding factors in an experiment that cause dataset distribution shifts. Usually lead to inaccurate conclusions if the causes of the batch effects are correlated with outcomes of interest in an experiment and should be accounted for (usually removed).
+    Technical confounding factors in an experiment that cause dataset distribution shifts.
+    Usually lead to inaccurate conclusions if the causes of the batch effects are correlated with outcomes of interest in an experiment and should be accounted for (usually removed).
 
 Benchmark
     An (independent) comparison of performance of several tools with respect to pre-defined metrics.
 
 Bulk RNA sequencing
-    Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
+    Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells.
+    Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze.
 
 Cell
-    The fundamental unit of life. Consists of cytoplasm enclosed within a membrane that contains many biomolecules such as proteins and nucleic acids. Cells acquire specific functions, transition to cell types, divide, communicate and keep the organism going. Learning about the structure, activity and communication of cells helps deciphering biology.
-
+    The fundamental unit of life, consisting of cytoplasm enclosed within a membrane, containing biomolecules such as proteins and nucleic acids.
+    Cells acquire specific functions, transition into different types, divide, and communicate to sustain an organism.
+    Studying cell structure, activity, and interactions enables insights into gene expression dynamics, cellular trajectories, developmental lineages, and disease mechanisms.
 Cell barcode
     See {term}`barcode`
 
-Cluster
-Clusters
-    A group of a population or data points that share similarities. In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`).
-
 Cell type annotation
-    The process of labeling groups of {term}`clusters` of cells by {term}`cell type`. Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference.
+    The process of labeling groups of {term}`clusters` of cells by {term}`cell type`.
+    Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference.
 
 Cell type
     Cells that share common morphological or phenotypic features.
@@ -46,27 +55,64 @@ Cell state
 Chromatin
     The complex of DNA and proteins efficiently packaging the DNA inside the nucleus and involved in regulating gene expression.
 
+Cluster
+Clusters
+    A group of a population or data points that share similarities.
+    In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`).
+
+Complementary DNA (cDNA)
+cDNA
+    DNA synthesized from an RNA template by the enzyme reverse transcriptase.
+    cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis.
+
 Demultiplexing
     The process of determining which sequencing reads belong to which cell using {term}`barcodes`.
 
+directed graph
+    A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes.
+
 DNA
-    DNA is the acronym of Deoxyribonucleic acid. It is the organic chemical storing hereditary information and instructions for protein synthesis. DNA gets transcribed into {term}`RNA`.
+    DNA is the acronym of Deoxyribonucleic acid.
+    It is the organic chemical storing hereditary information and instructions for protein synthesis.
+    DNA gets transcribed into {term}`RNA`.
 
 Doublets
     Reads obtained from droplet based assays might be mistakenly associated to a single cell while the RNA expression origins from two or more cells (a doublet).
 
+Downstream analysis
+downstream analyses
+    A phase of data analysis that follows the initial processing of raw data.
+    In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics.
+
 Dropout
-    A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`. The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression. Dropouts are one of the reasons why scRNA-seq data is sparse.
+    A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`.
+    The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression.
+    Dropouts are one of the reasons why scRNA-seq data is sparse.
 
 Drop-seq
     A protocol for scRNA-seq that separates cells into nano-liter sized aqueous droplets enabling large-scale profiling.
 
+Edit distance
+    Edit distance (often referred to as Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another.
+
 FASTQ reads
-    Sequencing reads that are saved in the FASTQ format. FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells.
+    Sequencing reads that are saved in the FASTQ format.
+    FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells.
+
+Flowcell
+flowcell
+    A consumable device used in sequencing platforms where DNA or RNA fragments are sequenced.
+    It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments.
+    During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation.
+    The flowcell enables high-throughput sequencing by allowing millions of fragments to be sequenced simultaneously.
 
 Gene expression matrix
     A cell (barcode) by gene (scverse ecosystem) or gene by cell (barcode) matrix storing counts in the cell values.
 
+Hamming distance
+    A measure of the number of positions at which two strings of equal length differ.
+    It is commonly used in error detection and correction, including barcode correction in sequencing data.
+
 Imputation
     The replacement of missing values with usually artificial values.
 
@@ -76,26 +122,31 @@ Indrop
 Library
     Also known as sequencing library. A pool of DNA fragments with attached sequencing adapters.
 
+Locus
+Loci
+loci
+    Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located.
+    In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region.
+    Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data.
+
 MuData
-    A Python package for multimodal annotated data matrices. The primary data structure in the scverse ecosystem for multimodal data.
+    A Python package for multimodal annotated data matrices.
+    The primary data structure in the scverse ecosystem for multimodal data.
 
+Muon
 muon
     A Python package for multi-modal single-cell analysis in Python by scverse.
 
 Negative binomial distribution
     A discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified number of failures.
 
-Pipeline
-    Also often times denoted as workflow. A pre-specified selection of steps that are commonly executed in order.
-
-RNA
-    Ribonucleic acid. Single-stranded nucleic acid present in all living cells that encodes and regulates gene expression.
-
-RT-qPCR
-    Quantitative reverse transcription PCR (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
-
 PCR
-    Polymercase chain reaction (PCR) is a method to amplify sequences to create billions of copies. PCR requires primers, which are short synthetic {term}`DNA` fragments, to select the genome segments to be amplified and subsequently multiple rounds of {term}`DNA` synthesis to amplify the targeted segments.
+    Polymercase chain reaction (PCR) is a method to amplify sequences to create billions of copies.
+    PCR requires primers, which are short synthetic {term}`DNA` fragments, to select the genome segments to be amplified and subsequently multiple rounds of {term}`DNA` synthesis to amplify the targeted segments.
+
+Pipeline
+    Also often times denoted as workflow.
+    A pre-specified selection of steps that are commonly executed in order.
 
 Poisson distribution
     Discrete probability distribution denoting the probability of a specified number of events occurring in a fixed interval of time or space with the events occurring independently at a known constant mean rate.
@@ -104,20 +155,48 @@ Promoter
     Sequence of DNA to which proteins bind to initiate and control transcription.
 
 Pseudotime
-    Latent and therefore unobserved dimension reflecting cells' progression through transitions. Pseudotime is usually related to real time events, but not necessarily the same.
+    Latent and therefore unobserved dimension reflecting cells' progression through transitions.
+    Pseudotime is usually related to real time events, but not necessarily the same.
+
+RNA
+    Ribonucleic acid (RNA) is a single-stranded nucleic acid present in all living cells that encodes and regulates gene expression.
+    Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA).
+    RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease.
+
+RT-qPCR
+    Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR.
 
 scanpy
     A Python package for single-cell analysis in Python by scverse.
 
 scverse
-    A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools. See: https://scverse.org/
+    A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools.
+    See: https://scverse.org/
+
+signal-to-noise ratio
+    A measure of the clarity of a signal relative to background noise.
+    In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data.
+    A high SNR indicates that the signal is strong and reliable compared to the noise, resulting in better data quality.
+    Conversely, a low SNR means the noise may interfere with or reduce the accuracy of the sequencing results.
 
 Spike-in RNA
     RNA transcripts of known sequence and quantity to calibrate measurements in RNA hybridization steps for RNA-seq.
 
+Splice Junctions
+splice junctions
+    Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing. These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA.
+
 Trajectory inference
-    Also known as pseudotemporal ordering. The computational recovery of dynamic processes by ordering cells by similarity or other means.
+    Also known as pseudotemporal ordering.
+    The computational recovery of dynamic processes by ordering cells by similarity or other means.
 
 Unique Molecular Identifier (UMI)
-    Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing. UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates.
+unique molecular identifiers (UMIs)
+    Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing.
+    UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates.
+
+Untranslated Region (UTR)
+UTR
+    A segment of an mRNA transcript that is transcribed but not translated into protein.
+    UTRs are located at both ends of the coding sequence.
 ```