Yale University, Spring 2019
E&EB 723
Time: Thursday 1-3
Location: ESC 100
Instructor: Casey Dunn
Prepend the subject line of all course related emails with "genomics: "
Office hours: Thursdays 10AM-11:30AM
The field of evolutionary biology is increasingly drawing on genomic data. The field of genomic biology is becoming more evolutionary as genomes are sequenced for a broader diversity of organisms. This course focuses on the evolution of genome sequence and function at macroevolutionary timescales, with an emphasis on building practical computational skills for genomic and phylogenetic comparative analyses. There will be more focus on using phylogenies to understand genome evolution than on using genomes to build phylogenies.
This course is organized with github education tools. The course will make heavy use of git for sharing, communication, and collaboration. All students need to have a github account, preferably one that is registered to their Yale email address so that they get the full academic features.
The class will be given student cluster accounts on one of the Yale High Performance Compute (HPC) clusters. Access to both compute and storage resources will last the course of the semester, after which data need to be copied elsewhere if they are important enough to be saved.
Classes will consist of lectures, student led discussions, and computational labs.
All materials for the course, including the syllabus, are available at the course site. The syllabus will be updated as the course progresses, please check it weekly. Please submit suggestions and corrections for the class via the issue tracker.
All assignments will be distributed and submitted via GitHub. Here are the basic steps:
-
Go to the class repository for the assignment, for example https://github.com/Yale-EEB723/finalproject . Click the "Fork" button to create a fork. This creates your own copy of the repository with its own url, for example https://github.com/YourGitHubUsername/finalproject .
-
Clone the repository to your laptop. On your repository website, click on the green "Clone or download" button and copy the link there. It will be something like
https://github.com/YourGitHubUsername/finalproject.git
-
Open a Terminal window on your laptop.
cd
to a directory where you keep your git repositories. Then clone your form of the repository (substitute the link copied above):git clone https://github.com/YourGitHubUsername/finalproject.git
-
Edit the files and sage changes. Use
git add
to add any new changes. -
Commit the changes with
git commit -am "my message"
where "my message" describes the changes you made. commit often as you work. -
Push the changes back to GitHub with
git push
. -
Once the assignment is complete, return to the repo page at GitHub, for example https://github.com/YourGitHubUsername/finalproject. Make sure all your changes are reflected there. Save, commit, and push again if not.
-
Click the "New pull request" button. Then click the "Create pull request" button and submit the pull request. This will notify the instructor that you have submitted the assignment. Some assignments will be submitted multiple times as sequential tasks are completed, which will require multiple pull requests.
Each student will work on a project, either in collaboration or individually. The project must relate to one or more themes covered in the course of the class. Final project plans will be presented in week 3 of the course. After the team and topic are set, fork the repository at https://github.com/Yale-EEB723/finalproject to create a repository for your project. Submit the project as a pull request. Your forked repository can be private if it includes unpublished original data, but if private all course members should be granted access so they can view it and provide feedback.
The final project can consider a research project already in progress (eg something that is part of thesis research), analysis of publicly available data, analysis of simulated data, development and testing of statistical methods or software, etc. Ideally each project will advance the existing research goals of each student, or advance an interesting topic identified in the course.
Here are some suggested final project ideas:
-
A deep dive on a specific technical challenge of de novo genome sequencing and assembly, eg repeats or heterozygosity
-
Assembly and annotation of an original or publicly available de novo genome
-
Examine the evolution of genome structure (eg synteny, size, intron distribution, etc) with phylogenetic methods
-
Explore the fit of models of evolution to genomic or functional genomic data
-
Test phylogenetic hypotheses with genomic data
-
Analyze one or more categories of functional genomic data in a phylogenetic context to test hypotheses about the evolution of genome function
-
Use comparative functional genomic and/or genomic data to identify genes that may relate to specific phenotypes
-
Compare within population genome variation to variation at broader phylogenetic scales
Several exercises will be assigned. They will usually be started in class, and then due by the following week.
Reading includes manuscripts, book chapters, online resources, and videos to be watched ahead of class. The readings will be posted by the Monday before each class. In most weeks, the 15-20 minute discussion of the reading will be led by a group of students. All students will get a chance to participate in these groups. A bibliogrpay at the end of this document includes a variety of references that readings can be selected from.
In addition to reading assigned for each class, the following will be used as references throughout the course:
-
Haddock, SHD and CW Dunn (2011). Practical Computing for Biologists. amazon I wrote this book with my colleague Steve Haddock as an introduction to general computing skills for biologists. If you are not already comfortable at the command line then you should get this book as a reference.
-
Whickham, H (2017). R for Data Science. http://r4ds.had.co.nz This book is free online at the provided link. It is an excellent introduction to data analysis in R, and more broadly how to think about data structure and analysis. It presents a coherent introduction to the Tidyverse, a set of R packages for general data manipulation and analysis. Our R coding will follow conventions in this book.
In this course, you will perform some exercises and analyses on your own laptop in class, and some on the cluster. Below are instructions on how to set up your laptop.
Setup an account at GitHub using your educational email address.
Install git.
Install the Atom text editor.
Install Docker.
Discussion leader: Casey Dunn
- Felsenstein, J. 1985. Phylogenies and the Comparative Method. American Naturalist, 125:1–15. https://www.jstor.org/stable/2461605
- Dunn CW, Zapata F, Munro C, Siebert S, Hejnol A. 2018 Pairwise comparisons across species are problematic when analyzing functional genomic data. Proc. Natl. Acad. Sci., 115:E409–E417. https://www.doi.org/10.1073/pnas.1707515115
- Introductions
- Discussion of course goals and structure
- Readings
- Projects
- Class formats
- Course logistics
- Bring laptop to each class
- github account required
- YCRC account setup description, needed prior to week 3
- Review readings
- Overview of computational framework and tools
First, confirm that docker is working by running a container:
docker run -it rocker/rstudio /bin/bash
Next, we will walk through regular expressions in the exercises at https://github.com/Yale-EEB723/syllabus/blob/master/regular_expressions.txt .
Discussion leader: Ian Gilman
-
Goodwin et al. 2016. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics. https://doi.org/10.1038/nrg.2016.49 This review covers a lot of ground. Focus on the bits about Illumina, PacBio, and Oxford Nanopore Technologies (ONT).
-
Practical Computing for Biologists Chapters 2-3. This optional reading provides background for, and builds on, the regular expressions exercises.
-
Sequencing technology and instruments
- Conceptual overview
- Single molecule vs. populations of molecules
- Multiplexing
- Sequencing overview
- Sample preparation
- Data acquisition
- Data preprocessing
- Base calling
- Read processing (trimming, binning, etc) and export
- Downstream analysis (application specific)
- Tradeoffs
- Cost (initial and realtime)
- Read length
- Error rate and error profile (base miscalls, phasing noise, homopolymer length, etc)
- Throughput
- Hands-on limitations (sample prep cost, instrument portability, ease of use, run time, etc)
- Current sequencing technologies
- Illumina
- https://www.youtube.com/watch?v=fCd6B5HRaZ8
- The recent shift to reduced colors
- PacBio
- https://www.youtube.com/watch?v=NHCJ8PtYCFc
- Very long molecules can be sequenced or the same molecule can be sequenced repeatedly with Circular Consensus Sequencing
- Oxford Nanopore
- Illumina
- Conceptual overview
-
Genome sequencing
- Challenging factors
- Large size
- Repeats
- Heterozygosity
- Tissue limitation
- Challenging factors
-
Take homes
- Focus on inputs and outputs, not intermediates. For example, assembly quality is usually much more important than read quality.
- Take a wholistic perspective on costs, including your time. Saving a bit of money on sequencing can sometimes incur large data analysis costs, for example.
- Focus your time and resources on what differentiates your project from others.
- Always, always be thinking about the end goal and evaluate intermediate decisions in terms of these final objectives.
We will walk through regular expressions in the exercises at https://github.com/Yale-EEB723/syllabus/blob/master/regular_expressions.txt .
To get the files for the exercises, make a local clone of the syllabus repository:
git clone https://github.com/Yale-EEB723/syllabus.git
- Hand out git chapter
- Ask students to prepare to present preliminary ideas on final projects next week
-
Jain et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotech. https://doi.org/10.1038/nbt.4060. (Discussion leaders: Edgar Benavides and Vincent Dimassa)
-
Practical Computing for Biologists Chapters git chapter draft (to be provided as hard copy in previous week)
-
Practical Computing for Biologists Chapters chapters 4-6, 20 This optional reading provides background on working in bash and remote access to computers.
-
Quick (<1 minute) description of final project plan
-
Working on your laptop
-
Working on the cluster via your account
-
Getting started with git
- Sedlazeck et al. 2018. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0003-4 (Discussion leader: Ava Ghezlayagh )
The agenda:
- Review paper
- Discuss final projects
- Walk through alignment exercise
- Discuss items below about assembly
We often want to know the full sequence of a genome, but data are fragmented and redundant because:
- DNA isolation leads to fragmentation by mechanical and chemical processes
- Sample preparation leads to fragmentation by mechanical and chemical processes, often deliberately to adjust the length of input molecules
- Sequencing often doesn’t span the full length of a molecule, due to technical limitations, damaged template, damaged sequencing
- The same regions of the genome are sequenced multiple times because:
- It is far easier to randomly sequence regions of the genome than to systematically tile sequencing effort across the genomes. To sequence with enough depth to ensure coverage everywhere, some places will have quite deep coverage
- Sequencing is error prone. Sequencing each spot multiple times enables error detection and correction.
In the broad sense,
- Sequencing takes large biological molecules in and generates character strings in computer memory that are redundant overlapping estimates (reads) of subsequence of of the original molecules. Underlying this is a generative model, ie an idea of how molecular structure impacts observed read sequences.
- Assembly is in some respects the reverse process - it starts with reads and generates an estimate of the sequences of the input biological molecules. It runs the generative model in reverse, and generates descriptions of molecules in computer memory rather than
Common tasks:
- de novo assembly: going from reads (and sometimes additional new structural data) to a genome assembly without reference to an existing assembly.
- mapping: Tiling reads onto an existing reference genome sequence. Used to assess how the reads cover the reference sequence (more on this later when we discuss functional genomics), or to identify how the genome from which the reads are derived differs from the reference genome.
- reference based assembly: assembling a genome by reference to an existing genome, usually by mapping new reads to an existing reference genome sequence. Does not require as much sequencing depth as de novo assembly, also much easier computationally.
The challenge:
- Find similar sequences
- Categorize differences between similar sequences according to whether they are
- Sample prep errors
- Sequencing errors
- Different regions of the genome (eg paralogy, repeats)
- Different alleles
- Mixtures (eg somatic variations)
- Estimate the sequence of the original molecules
The general assembly process (see Figure 1 from https://doi.org/10.1038/s41576-018-0003-4)
- Identify overlaps between reads
- Construct a string graph
- Nodes are unambiguous sequences
- Edges are possible connections between those sequences
- Each path through the string graph is a possible assembly
- Contig construction
- modify and traverse the string graph to derive contig sequences
- Errors create bubbles with low coverage that can be popped
- Some nodes cannot be combined with neighboring nodes because there isn't enough information to know which alternative path to take, and those nodes are emitted as contigs
- Adjacent nodes (ie nodes connected directly by edges) can often be combined
- Ends of contigs usually due to ambiguity of some sort, there are multiple
paths and the assembler doesn't know which to take so it chops them all off
- Sources of ambiguity include error, repeats not spanned by reads, or heterozygosity
- As read quality and length improve, extent of contigs determined in larger part by heterozygosity
- Makes heterozygous-aware assembly even more important
- modify and traverse the string graph to derive contig sequences
- Scaffolding
- Physical ordering of contigs, sometimes introducing gaps
- Usually based on additional structural information
- HiC
- Optical mapping
Error correcting can occur at multiple steps
- Improvements to base calling
- Use short high quality reads to correct lower quality long reads
- Risks introducing errors, eg by mistaking one instance of a repeat for another similar instance of the same repeat
- Use short high quality reads to correct contigs derived from lower quality long reads
- Same risks as above
- Use lower quality long reads to correct each other (requires greater depth)
- In the last year this is where things have started to head
Characterizing an assembly
- contiguity (eg N50)
- completeness (eg BUSCO)
- correctness (eg base level, structural, phasing)
Genome assembly challenging factors
- Large size
- Repeats
- Heterozygosity
- There are tools for assessing all of these before attempting a full assembly, eg https://github.com/schatzlab/genomescope
Phasing
- Collapse haplotypes into a single consensus. Can introduce many errors and fragment the assembly
- Assemble into regions that are collapsed and unzipped
- Can arbitrarily resolve into primary assembly and alternate haplotigs, or pseudophased diploid genome
- Phase into two haploid genomes
Core algorithmic concepts
- Similarity and extension
- Identify similarity be identification of similar seed sequences in different reads
- Expensive because each read needs to be compared to every other read
- Exact matches are really fast, but often need to allow for variation due to errors
- A few methods
- Canu uses MHAP, a kmer method
- Falcon etc use DALIGNER, dynamic programming and kmers
- Extension searches for regions of similarity beyond the seed.
- It essentially sees if if is possible to zip the reads together starting at the seed
- Extension is generally not as expensive is initial identification of similarity, because extension is
- Identify similarity be identification of similar seed sequences in different reads
- k-mers
- Short sequences of length k (often 15-70 nucleotides)
- Very cheap to work with
- Defined memory footprint
- If short relative to frequency of errors, can focus on exact matches
- Easy to code
- Hash tables are sorted lists of k-mer sequences, often with a count of how many times the sequence exists
- de Bruijn assembly
- identification of similarity seeds
Yandell and Ence. 2012. A beginner's guide to eukaryotic genome annotation. Nature Review Genetics. https://doi.org/10.1038/nrg3174 (Discussion leader: Diego and Jasmine)
The agenda:
- Discuss assembly notes from last class
- Go over today's paper
- Identify two papers for next week
- Walk through forking, cloning, and editing final project git repository. Make
preliminary edits and push them.
- Add a few sentences for Goal and Data sections. Make a list of tools you plan to use in the Methods section. Send a pull request when you have added these stubs.
- Explain how to finish alignment exercise from last class as an assignment
- Discuss genome annotation notes
A genome assembly is just a big fasta file. Much more interesting with annotations.
Annotation goals include:
- Identifying repeats
- Biologically interesting
- Technically important for understanding genome assembly and so that repeats can be masked for some downstream analyses that they can negatively impact
- Identifying protein coding genes
- Build inventory of protein coding genes
- Identify introns, exons, promoters
- Predict mRNA structure
- Identifying other regions,
- Noncoding RNA sequences
- Promoters
- etc...
- Understanding genome structure
- Centromeres, telomeres, how scaffolds map to chromosomes
The set of annotations can be encoded in standard formats and loaded into graphical browsers or interrogated computationally (as when comparing genome features across species)
Annotations can be made
- With ab initio methods based on the understanding of what these structures look like or particular properties they have
- Based on new evidence, like mRNA-seq
- Based on comparison to a reference of similar sequences
- Looking for known repetitive DNA elements
- Blasting known protein coding genes
Assessing annotations
- Sensitivity = (True positives) / ( True positives + False negatives )
- BUSCO is a common tool for this
- Specificity = (True positives) / ( True positives + False positives )
- Accuracy = (Sensitivity + Specificity) / 2. If accuracy is high you know you have good sensitivity and specificity, but if it is low could be problems with sensitivity or specificity or both.
Repeats
- The extent and composition of repeats varies widely across species
- Many repeats are well-known conserved sequences
- LINEs Long Interspersed Nuclear Elements. Retrotransposons about 7kb in length. They are transcribed and encode a reverse transcriptase that facilitates integration at new sites.
- SINEs Short Interspersed Nuclear Elements. 100-700 bp retrotransposons. They do not have their on reverse transcriptase, so they are dependent on those of other elements
- Repeat annotation is complicated by the fact that different instances of the same repeat are not identical, and may be quite different
- Tools for identifying and masking repeats
- RepeatMasker
Protein coding genes
- Multiple nested annotation steps
- Gene finding
- What is entire region needed for gene to function, including upstream upstream regions, transcribed region, and downstream regions
- What region is transcribed
- Identify promoter where transcription is initiated
- For eukaryotic protein coding genes, the binding site of RNA Polymerase II
- Identify the location where transcription is terminated and mRNA is polyadenylated
- Identify promoter where transcription is initiated
- Introns and exons
- Mature mRNA sequence prediction
- Protein sequence prediction
- Commonly used tools
- AUGUSTUS http://bioinf.uni-greifswald.de/augustus/ . Developed for human genome.
- MAKER http://www.yandell-lab.org/software/maker.html . Annotation workflow that integrates multiple tools.
- Funanotate https://funannotate.readthedocs.io/en/latest/ . Initially made for fungal genomes.
- AUGUSTUS http://bioinf.uni-greifswald.de/augustus/ . Developed for human genome.
Discussion leaders: Elyse Parker and Dan MacGuigan
Holt and Yandell. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. https://doi.org/10.1186/1471-2105-12-491
Kim et al. 2018. The genome of common long-arm octopus Octopus minor. GigaScience. https://doi.org/10.1093/gigascience/giy119
MarkDown guide - https://guides.github.com/features/mastering-markdown/
Success of ab initio methods
- Tests of ab initio methods work well in species with well annotated genomes, because the gene training datasets are so good. It is easy to find genes when you know what they look like.
- Ab initio methods work less well alone in poorly studied species without good existing gene models to server as training datasets. Additional evidence, like RNAseq, greatly improves outcome in these projects.
Annotation Edit Distance (AED)
- A measure the distance between intron and exon coordinates of two annotations for the same gene, where 0 is identical and 1 is completely different
- This distance indicate incongruence between annotation methods and is interpreted as uncertainty about an annotation
- Genes with higher AED tend to
- Have fewer PFAM domains
- Change more in subsequent annotations that include additional information
- Have less evidence of orthologs in closely related species
MAKER is now part of the GMOD family of genomic tools - http://gmod.org/wiki/Main_Page
Ian notes that conda install -c bioconda maker
works well for installation
A detailed guide on how to use MAKER - https://gist.github.com/darencard/bb1001ac1532dd4225b030cf0cd61ce2
- Identify and mask repeats
- Build gene models based on RNAseq data and homology to genes of related species
- Train ab initio programs based on these evidence-based models
- Rerun MAKER with ab initio prediction
- Rerun MAKER again to refine everything
Mapping
-
Mapping is a general task that comes up in many types of genomic analyses.
-
It is a highly asymmetric sequence comparison, usually between reads and reference sequences
-
Burrows-Wheeler transform
- Transforms the sequence to a sorted string that is easy to compress
- Can work with compressed strings, which is more computationally efficient
- Has the very special property that the original string can be recovered from the sorted form
Discussion leader: Andrew Verdegaal
Primary paper:
Zhang et al. 2019. Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell RNA-Seq Systems. Molecular Cell. https://doi.org/10.1016/j.molcel.2018.10.020
Background (optional paper that covers many technical concepts in the above paper):
Hrdlickova et l. 2016. RNA-Seq methods for transcriptome analysis. WIREs RNA. https://doi.org/10.1002/wrna.1364
RNAseq
-
Randomly sequence RNA from a cell or collection of cells
-
In nearly all cases mRNA is reverse transcribed into complimentary DNA (cDNA) and that is analyzed with with short read DNA sequencing (eg Illumina)
-
The frequency of reads for each gene are used as a proxy for abundance of transcripts of each gene (often referred to as expression)
-
If a suitable high quality reference is available, it can also be used as a proxy for abundance of splice variants.
-
Replaced Expressed Tags (ESTs), shotgun sequencing of cDNA with Sanger sequencing
-
Full length sequencing of cDNA (eg PacBio isoseq, Oxford Nanopore cDNA sequencing) provides a better understanding of slice variants than short reads, but lower throughput provides less statistical power.
-
What it does and doesn't do
- It does not measure absolute transcript count
- It does not measure the relative expression of different genes
- It does come close to measuring the differential expression of the same gene in different tissues.
-
Analysis steps
- Map to reference, which can be gene models from genome or transcriptome assemblies. Many short read aligners can do this, eg Bowtie
- Process mapping data to derive the counts of reads for transcripts and genes.
GRO-seq
- Actively transcribed mRNA is labeled with BrdU, isolated, reverse transcribed, and sequenced.
- Similar to RNA-seq, but only sequences mRNA that is transcribed during BrdU pulse.
- Gives a way to observe active transcription rather than just transcript abundance.
Please post two commits during the week before this class.
Discussion leader: Nick Fisk
Rowley and Victor. 2018. Organizational principles of 3D genome architecture. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0060-8
Definitions - Topologically Associating Domains (TADs) - Nucleosome - A set of 8 histone proteins (two each of H2A, H2B, H3, and H4) wrapped by about 147bp of DNA, along with about 80bp of linker DNA that connects to the next nucleosome. If not packed further, they look like beads on a string. Addition of H1 packs nucleosomes into further coiled 30 nm fiber - Compartmental domains - a linear domain containing one or more genes that is in the same transcriptional or chromatin state. - Transcription complex - the association of DNA bound proteins including polymerase and transcription factors that is resposible for the initiation if transcription. - Promoter - the site where the transcription complex assembles. Includes the TATA box. - Enhancer - a sequence that enhances the transcription rate of a gene. May be upstream, within, or downstream of a gene. Can be quite far. - Conserved Noncoding Elements (CNE) region outside of coding regions that shows high sequence conservation between species - Genomic Regulatory Blocks (GRB) - clusters of syntenic CNEs.
Several components of genome consideration
- 1D linear structure
- Linkage, physical proximity as originally detected by low recombination rate
- Synteny, the conservation of linear gene location (ie linkage)
- Chromatin accessibility and association with proteins such as transcription factors
- 3D packing of chromatin in nucleus
Packing of DNA in space and time
- https://en.wikipedia.org/wiki/Chromatin#/media/File:Chromatin_Structures.png
- Interphase packing is most relevant to most functional genomics questions
- Loop extrusion
- https://www.youtube.com/watch?v=Tn5qgEqWgW8
- Acts on beads-on-a-string packed DNA
- Loop extent controlled by CTCF
Sample pre/ sequencing tools relevant to genome structure
-
Innovations in sample prep and enrichment allow sequencers to be adapted into instruments that measure all kinds of structural and functional features.
-
Hi-C
- Crosslink chromatin
- Digest, label, and ligate
- Fragment, isolate, and sequence
- Results in read pairs that come from regions that were in close physical proximity
- Uses
- Identify sequences that were in same cell together, eg to assist with metagenomes
- Identify linked sequences, eg to assist with assembly scaffolding
- Identify how chromosomes are packed in nucleus
- HiC related methods https://www.nature.com/articles/nrg3454 Box 1
- 3C, chromosome conformation capture. Provides many interactions for sites throughout the genome.
- 4C, circular 3C. Observe the regions that interact with a particular locus. Allows deeper data for one spot.
- 5C, investigate how associations correspond to other processes, like transcription.
- ChiaPET, interrogate long range interactions facilitated by particular proteins.
- Interpreting HiC data - https://www.nature.com/articles/nrg3454/figures/1
-
ATACseq
- Assay for Transposase Accessible Chromatin using sequencing
- Transposase preferentially targets regions that are free of nucleosomes
- Can also map nucleosome composition
-
ChIPseq
- CHromatin ImmunopreciPitation Sequencing
- Use antibodies to enrich chromatin with specific proteins, eg transcription factors
- Sequence isolated Chromatin
- Identify transcription binding sites etc...
The relationship between genome structure and genome function
- Structure may regulate transcription
- Harmston https://doi.org/10.1038/s41467-017-00524-5
- "GRB boundaries coincide with the boundaries of TADs"
- TADs may be highly conserved regulatory domains
- Presents a cautionary tale
- An intronic SNP in FTO is strongly associated with obesity.
- Many theses and drug companies tried to figure out what FTO does to target it
- But turns out that the intronic FTO SNP is in a CNE (conserved non-coding element) physically associates with a distant gene and regulates IRX3
- Harmston https://doi.org/10.1038/s41467-017-00524-5
- Structure may be a result of transcriptional state
- Compartmental domains may be more highly conserved than specific sequences
Please read the following material from Harmon 2018 https://lukejharmon.github.io/pcm/ :
- Chapter 3 Brownian Models (a detailed reading of section 3.3 is not required, though it is quite interesting).
- Chapter 7
Trait evolution
-
Background from first Paul Lewis Likelihood lecture, through slide 40.
- Exercise: Make a Newick tree, for example
(A,(B,(C,(D,E))))
or(A:4,(B:3,(C:2,(D:2,E:1))))
, and view it at http://etetoolkit.org/treeview/ .
- Exercise: Make a Newick tree, for example
-
Lecture on the evolution of discrete traits in the context of molecular sequence data. Slides at https://github.com/Phylogenetics-Brown-BIOL1425/phylogeneticbiology/blob/master/lectures/Lecture_3.pdf .
-
Discussion of discrete traits Chapter 7 from Harmon reading.
-
Discussion of continuous traits Chapter 3 from Harmon reading - https://lukejharmon.github.io/pcm/chapter3_bmintro/.
Application of model based approaches to genomes: beyond sequence evolution
Discussion leader: Arianna Lord
Zhao and Schranz. 2019. Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. PNAS. https://www.pnas.org/content/116/6/2165
- Paper discussion
- Everyone provides summary of goal and state of their project
- Identify overlap between projects
- Time to work on projects, including discussion of shared aspects of projects
Make at least 4 commits to your project before today's class.
Discussion leaders: Jessica Glass and Spencer Irvine
Bhattacharya et al. 2016. Comparative genomics explains the evolutionary success of reef-forming corals. eLife. https://doi.org/10.7554/eLife.13288
- Paper discussion
- Pair-review projects
- Work on projects
Project presentations:
- Nick Fisk
- Ian Gilman
- Arianna Lord
You can suggest references to add to this list via a pull request or the issue tracker. The intent of this bibliography is to serve as a resource for class participants in their own work and as a list of potential readings for class.
Goodwin et al. 2016. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics. https://doi.org/10.1038/nrg.2016.49
Heather et al. 2015. The sequence of sequencers: The history of sequencing DNA. Molecular Cell. https://doi.org/10.1016/j.molcel.2015.05.004
Reuter et al. 2015. High-Throughput Sequencing Technologies. Molecular Cell. https://doi.org/10.1016/j.molcel.2015.05.004
Shendure et al. 2017. DNA sequencing at 40: past, present and future. Nature. https://doi.org/10.1038/nature24286
Vurture et al. 2017. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. https://doi.org/10.1093/bioinformatics/btx153
Alkan et al. 2011. Genome structural variation discovery and genotyping. Nature Reviews Genetics. https://www.nature.com/articles/nrg2958
Bradnam et al. 2013. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. https://doi.org/10.1186/2047-217X-2-10
Gurevich et al. 2013. QUAST: quality assessment tool for genome assemblies Bioinformatics. https://doi.org/10.1093/bioinformatics/btt086
Koren et al. 2018 De novo assembly of haplotype-resolved genomes with trio binning. Nature Biotechnology. https://doi.org/10.1038/nbt.4277 Good overview of phasing
Paajanen et al. 2019. A critical comparison of technologies for a plant genome sequencing project. GigaScience. https://doi.org/10.1093/gigascience/giy163
Rice and Green. 2019. New Approaches for Genome Assembly and Scaffolding. Annual Review of Animal Biosciences. https://doi.org/10.1146/annurev-animal-020518-115344
Schatz et al. 2010. Assembly of large genomes using second-generation sequencing. Genome Research. https://doi.org/10.1101/gr.101360.109
Sedlazeck et al. 2018. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0003-4
Sohn and Nam 2016. The present and future of de novo whole-genome assembly. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbw096
Edgar et al. 2018. Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity. Gigascience. https://doi.org/10.1093/gigascience/gix124
Jain et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotech. https://doi.org/10.1038/nbt.4060 . (For an interesting set of followup analyses see https://genomeinformatics.github.io/na12878update/ )
Kim et al. 2018. The genome of common long-arm octopus Octopus minor. GigaScience. https://doi.org/10.1093/gigascience/giy119
Mohr et al. 2017. Improved de novo Genome Assembly: Linked-Read Sequencing Combined with Optical Mapping Produce a High Quality Mammalian Genome at Relatively Low Cost. BioRxiv. https://doi.org/10.1101/128348
Wenger et al. 2018. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. https://www.biorxiv.org/content/10.1101/519025v2
Jiang et al. 2018. A Hybrid de novo Assembly of the Sea Pansy (Renilla muelleri) Genome. https://doi.org/10.1101/424614
Vertebrate Genomes Project - https://vgp.github.io/genomeark/
Aken et al. 2016. The Ensembl gene annotation system. Database. https://doi.org/10.1093/database/baw093
Holt and Yandell. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. https://doi.org/10.1186/1471-2105-12-491
Mudge and Harrow. 2016 The state of play in higher eukaryote gene annotation. https://doi.org/10.1038/nrg.2016.119
Yandell and Ence. 2012. A beginner's guide to eukaryotic genome annotation. Nature Review Genetics. https://doi.org/10.1038/nrg3174
A collection of papers on 3D genome structure in Nature: https://www.nature.com/collections/rsxlmsyslk
Li et al. 2011. Chromosome Size in Diploid Eukaryotic Species Centers on the Average Length with a Conserved Boundary. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msr011
Harmston et al. 2017. Topologically associating domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation. Nature Communications. https://doi.org/10.1038/s41467-017-00524-5
Rowley and Victor. 2018. Organizational principles of 3D genome architecture. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0060-8
Spielmann et al. 2018. Structural variation in the 3D genome. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0007-0
Hrdlickova et l. 2016. RNA-Seq methods for transcriptome analysis. WIREs RNA. https://doi.org/10.1002/wrna.1364
Klemm et al. 2019. Chromatin accessibility and the regulatory epigenome. Nature Reviews Genetics. https://doi.org/10.1038/s41576-018-0089-8
La Manno et al. 2018. RNA velocity of single cells. Nature. https://doi.org/10.1038/s41586-018-0414-6
Zhang et al. 2019. Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell RNA-Seq Systems. Molecular Cell. https://doi.org/10.1016/j.molcel.2018.10.020
Cheifet 2019. Where is genomics going next? Genome Biology. https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-019-1626-2
Pomerantz et al. 2018. Real-time DNA barcoding in a rainforest using nanopore sequencing: opportunities for rapid biodiversity assessments and local capacity building GigaScience. https://academic.oup.com/gigascience/article/7/4/giy033/4958980
Harmon 2018. Phylogenetic Comparative Methods. https://lukejharmon.github.io/pcm/
My previous course on phylogenetic biology - https://github.com/Phylogenetics-Brown-BIOL1425/phylogeneticbiology
Kim et al. 2017. Reconstruction and evolutionary history of eutherian chromosomes. PNAS. https://doi.org/10.1073/pnas.1702012114
Demas et al. 2018. Reconstruction of avian ancestral karyotypes reveals differences in the evolutionary history of macro- and microchromosomes. BMC Genome Biology. https://doi.org/10.1186/s13059-018-1544-8
O'Connor et al. 2018. Reconstruction of the diapsid ancestral genome permits chromosome evolution tracing in avian and non-avian dinosaurs. https://www.nature.com/articles/s41467-018-04267-9
Compara - pre-built comparative genomics analyses. https://useast.ensembl.org/info/genome/compara/index.html
Alföldi and Lindblad-Toh. 2013. Comparative genomics as a tool to understand evolution and disease. Genome Research. https://doi.org/10.1101/gr.157503.113
Roy and Gilbert. 2005. Rates of intron loss and gain: Implications for early eukaryotic evolution. PNAS. https://doi.org/10.1073/pnas.0500383102
Simakov and Kawashima. 2017. Independent evolution of genomic characters during major metazoan transitions. Developmental Biology. https://doi.org/10.1016/j.ydbio.2016.11.012
Zhao and Schranz. 2019. Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. PNAS. https://www.pnas.org/content/116/6/2165
Hiller et al. 2012. A "Forward Genomics" Approach Links Genotype to Phenotype using Independent Phenotypic Losses among Related Species. Cell Reports. https://doi.org/10.1016/j.celrep.2012.08.032
Sharma et al. 2018 A genomics approach reveals insights into the importance of gene losses for mammalian adaptations. Nature Communications. https://doi.org/10.1038/s41467-018-03667-1
MarkDown guide - https://guides.github.com/features/mastering-markdown/ . Useful for writing text that will be pushed to GitHub, among many other things.