index.Rmd

--- 
title: "Computational Genomics with R"
author: "Altuna Akalin"
date: "`r Sys.Date()`"
knit: "bookdown::render_book"
documentclass: krantz
bibliography: [book.bib]
biblio-style: apalike
link-citations: yes
colorlinks: yes
lot: yes
lof: yes
fontsize: 12pt
monofont: "Source Code Pro"
monofontoptions: "Scale=0.7"
site: bookdown::bookdown_site
description: "A guide to computationa genomics using R. The book covers fundemental topics with practical examples for an interdisciplinery audience"
url: 'https\://compmgenomr.github.io/book/'
github-repo: compgenomr/bookdown
cover-image: images/cover.jpg
---

```{r setup, include=FALSE}
options(
  htmltools.dir.version = FALSE, formatR.indent = 2,
  width = 55, digits = 4, warnPartialMatchAttr = FALSE, warnPartialMatchDollar = FALSE
)

knitr::opts_chunk$set(echo = TRUE,fig.align = "center", cache=TRUE,warning=FALSE,message=FALSE)
                      
local({
  r = getOption('repos')
  if (!length(r) || identical(unname(r['CRAN']), '@CRAN@'))
    r['CRAN'] = 'https://cran.rstudio.com' 
  options(repos = r)
})

lapply(c('citr', 'formatR', 'svglite'), function(pkg) {
  if (system.file(package = pkg) == '') install.packages(pkg)
})
```

# Preface {-}


```{r fig.align='center', echo=FALSE, include=knitr::is_html_output(), fig.link=''}
knitr::include_graphics('images/cover.jpg', dpi = NA)
```

The aim of this book is to provide the fundamentals for data analysis for genomics. We developed this book based on the computational genomics courses we are giving every year. We have had invariably an interdisicplinary audience with backgrounds from physics, biology, medicine, math, computer science or other quantitative fields. We want this book to be a starting point for computational genomics students and a guide for further data analysis in more specific topics in genomics. This is why we tried to cover a large variety of topics from programming to basic genome biology. As the field is interdisciplinary, it requires different starting points for people with different backgrounds. A biologist might skip sections on basic genome biology and start with R programming whereas a computer scientist might want to start with genome biology. In the same manner, a more experienced person might want to refer to this book when s/he needs to do a certain type of analysis where s/he does not have prior experience.


![Creative Commons License](images/by-nc-sa.png)  
The online version of this book is licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/). 

## Who is this book for? {-}
The book contains practical and theoretical aspects for computational genomics. Biology and 
medicine generate more data than ever before. Therefore, we need to educate more people with data 
analysis skills and understanding of computational genomics.
Since computational genomics is interdisciplinary, this book aims to be accessible for
biologists, medical scientists, computer scientists and people from other quantitative backgrounds. We wrote this book for the following audiences:

 - Biologists and medical scientists who generate the data and are keen on analyzing it themselves.
 - Students and researchers who are formally starting to do research on or using computational genomics but do not have extensive domain specific knowledge but have at least a beginner level understanding in a quantitative field: math, stats
 - Experienced researchers looking for recipes or quick how-tos to get started in specific data analysis tasks related to computational genomics. 


### What will you get out of this?  {-}
This resource describes the skills and provides how-tos that will help readers 
analyze their own genomics data.

After reading:

- If you are not familiar with R, you will get the basics of R and dive right in to specialized uses of R for computational genomics.
- You will understand genomic intervals and operations on them, such as overlap
- You will be able to use R and its vast package library to do sequence analysis: Such as calculating GC content for given segments of a genome or find transcription factor binding sites
- You will be familiar with visualization techniques used in genomics, such as heatmaps, meta-gene plots and genomic track visualization
- You will be familiar with supervised and unsupervised learning techniques which are important in data modelling and exploratory analysis of high-dimensional data
- You will be familiar with analysis of different high-throughput sequencing data
sets mostly using R based tools.


## Structure of the book {-}
The book is designed with the idea that practical and conceptual
understanding of data analysis methods is as important, if not more important, than the theoretical understanding, such as detailed derivation of equations in statistics or machine learning. That is why we first try to give a conceptual explanation of the concepts then we try to give essential parts of the mathematical formulas for more detailed understanding. In this sprit, we always show the code and 
explain the code for a particular data analysis task. In addition, we give additional references such as books, websites , video lectures and scientific papers for readers who desire to gain deeper theoretical understanding of data analysis related methods or concepts.

"Introduction to Genomics" chapter introduces the basic concepts in genome biology and genomics. Understanding these concepts are important for computational genomics. 

"Introduction to R for Genomic Data Analysis" chapter provides basic R skills necessary to follow the book in addition to common data analysis paradigms we observe in genomic data analysis.  The "Statistics for Genomics", "Exploratory Data Analysis with Unsupervised Machine Learning" and "Predictive Modeling with Supervised Machine Learning" chapters introduces the necessary quantitative skills that one might need when analyzing high-dimensional genomics data.

The "Operations on Genomic Intervals and Genome Arithmetic" chapter introduces the fundamental tools for dealing with genomic intervals and their relationship to each other over the genome. In addition, the chapter introduces a variety of ways used in genomic data visualization. The skills introduced in this chapter are key skills that are needed to work with processed genomic data which are available through public databases such as Ensembl and UCSC browser.

The next chapters deals with specific analysis of high-throughput sequencing data and integrating different kinds of data sets. The "Quality Check, Processing and Alignment of High-throughput Sequencing Reads" chapter introduces quality checks that need to be done on sequencing reads and different ways to process them further. The chapters "RNA-seq Analysis", "ChIP-seq Analysis" and "BS-seq Analysis" deals 
with the analysis techniques for popular high-throughput sequencing techniques. The last chapter "Multi-omics Analysis" deals with methods for integrating multiple omics data sets.


To sum it up, this book is a comprehensive guide for computational genomics. Some sections are there for the sake of the wide interdisciplinary audience and completeness, and not all sections will be equally useful to all readers of this broad audience.

## Software information and conventions {-}

Package names are in bold text (e.g., **methylKit**), and inline code and filenames are formatted in a typewriter font. Function names are followed by parentheses (e.g., `genomation::ScoreMatrix()`). The double-colon operator `::` means accessing an object from a package.

### Assignment operator convention {-}
Traditionally, `<-` is the preffered assignment operator. However, throughout the book we use `=` and `<-` as the assignment operator interchangeably.

### Packages needed to run the book code {-}
This book is primarily about using R packages to analyze genomics data, therefore if you want to reproduce the analysis in this book you need to install the relevant packages in each chapter using `install.packages` or `BiocManager::install` functions. In each chapter, we load the necessary packages with the `library()` or `require()` function when we use the needed functions from respective packages. By looking at that calls, you can see which packages are needed for that code chunk or chapter. If you need to install all the package dependencies for the book, you can run the following command and have a cup of tea while waiting.
```{r,installAllPackages,eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c('qvalue','plot3D','ggplot2','pheatmap',
                      'cluster', 'NbClust', 'fastICA', 'NMF',
                      'Rtsne', 'mosaic', 'knitr', 'genomation',
                      'ggbio', 'Gviz', 'DESeq2', 'RUVSeq',
                      'gProfileR', 'ggfortify', 'corrplot',
                      'gage', 'EDASeq', 'citr', 'formatR',
                      'svglite', 'Rqc', 'ShortRead', 'QuasR',
                      'methylKit','FactoMineR', 'iClusterPlus',
                      'enrichR','caret','xgboost','glmnet',
                      'DALEX','kernlab','pROC','nnet','RANN',
                      'ranger','GenomeInfoDb', 'GenomicRanges',
                      'GenomicAlignments', 'ComplexHeatmap', 'circlize', 
                      'rtracklayer', 'BSgenome.Hsapiens.UCSC.hg38', 'tidyr',
                      'AnnotationHub', 'GenomicFeatures', 'normr',
                      'MotifDb', 'TFBSTools', 'motifRG', 'JASPAR2018'
                     )
```


## Data for the book {-}
We rely on data from different R and Bioconductor packages through out the book. For the datasets that do not ship with those packages, we created our own package [**compGenomRData**](https://github.com/compgenomr/compGenomRData). You can install this package via `devtools::install_github("compgenomr/compGenomRData")`. We use `system.file()` function to get the path to the files. We noticed many inexperienced users are confused about this function. This function just outputs full path to the file that is installed with the data package. 


## Acknowledgements {-}

We wish to thank R and Bioconductor community for developing and maintaining libraries for genomic data analysis. Without their constant work and dedication, writing such a book will not be possible. 

The following people kindly contributed fixes for typos: Thomas Schalch, Alex Gosdschan and Sarvesh Nikumbh.


```{r contributions,child = if (knitr::is_html_output()) 'addons/how-to-contribute.Rmd'}
```

```{r trainings,child = if (knitr::is_html_output()) 'addons/training.Rmd'}
```