edits v2 in

compgenomr · Sep 30, 2020 · 819225e · 819225e
1 parent b699522
commit 819225e
Show file tree

Hide file tree

Showing 15 changed files with 162 additions and 161 deletions.
diff --git a/00-author.Rmd b/00-author.Rmd
@@ -1,18 +1,18 @@
 # About the Authors {-}
 
-[*Dr. Altuna Akalin*](https://github.com/al2na) organized the book structure, wrote most of the book and edited the rest. Altuna is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He is interested in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He lived in the USA, Norway, Turkey, Japan, and Switzerland in order to pursue research work and education related to computational genomics. The underlying aim of his current work is utilizing complex molecular signatures to provide decision support systems for disease diagnostics and biomarker discovery. In addition to the research efforts and managing a scientific lab, since 2015, he has been organizing and teaching at computational genomics courses in Berlin with participants from across the world. This book is mostly a result of material developed for those and previous teaching efforts at Weill Cornell Medical College in New York and Friedrich Miescher Institute in Basel, Switzerland.
+[*Dr. Altuna Akalin*](https://github.com/al2na) organized the book structure, wrote most of the book and edited the rest. He is a bioinformatics scientist and the head of Bioinformatics and Omics Data Science Platform at the Berlin Institute for Medical Systems Biology, Max Delbrück Center in Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He is interested in using machine learning and statistics to uncover patterns related to important biological variables such as disease state and type. He lived in the USA, Norway, Turkey, Japan and Switzerland in order to pursue research work and education related to computational genomics. The underlying aim of his current work is utilizing complex molecular signatures to provide decision support systems for disease diagnostics and biomarker discovery. In addition to the research efforts and the managing of a scientific lab, since 2015, he has been organizing and teaching at computational genomics courses in Berlin with participants from across the world. This book is mostly a result of material developed for those and previous teaching efforts at Weill Cornell Medical College in New York and Friedrich Miescher Institute in Basel, Switzerland.
 
-Altuna Akalin and the following contributing authors have decades of combined experience in data analysis for genomics. They are developers of Bioconductor packages such as [**methylKit**](https://bioconductor.org/packages/release/bioc/html/methylKit.html), [**genomation**](https://bioconductor.org/packages/release/bioc/html/genomation.html), [**RCAS**](https://bioconductor.org/packages/release/bioc/html/RCAS.html) and [**netSmooth**](https://bioconductor.org/packages/release/bioc/html/netSmooth.html). In addition, they have played key roles in developing end-to-end genomics data analysis pipelines for RNA-seq, ChIP-seq, Bisulfite-seq, and single cell RNA-seq called [PiGx](http://bioinformatics.mdc-berlin.de/pigx/).
+Dr. Akalin and the following contributing authors have decades of combined experience in data analysis for genomics. They are developers of Bioconductor packages such as [**methylKit**](https://bioconductor.org/packages/release/bioc/html/methylKit.html), [**genomation**](https://bioconductor.org/packages/release/bioc/html/genomation.html), [**RCAS**](https://bioconductor.org/packages/release/bioc/html/RCAS.html) and [**netSmooth**](https://bioconductor.org/packages/release/bioc/html/netSmooth.html). In addition, they have played key roles in developing end-to-end genomics data analysis pipelines for RNA-seq, ChIP-seq, Bisulfite-seq, and single cell RNA-seq called [PiGx](http://bioinformatics.mdc-berlin.de/pigx/).
 
 **Contributing authors**
 
-[*Dr. Bora Uyar*](https://github.com/borauyar) contributed Chapter 8, "RNA-seq Analysis". Bora started his bioinformatics training in Sabanci University (Istanbul/Turkey), from which he got his undergraduate degree. Later, he obtained an MSc from Simon Fraser University (Vancouver/Canada), then a PhD from the European Molecular Biology Laboratory in Heidelberg/Germany. Since 2015, he has been working as a bioinformatics scientist at the Bioinformatics Platform and Omics Data Science Platform at the Berlin Institute for Medical Systems Biology. He has been contributing to the bioinformatics platform through research, collaborations, services and data analysis method development. His current primary research interest is the integration of multiple types of omics datasets to discover prognostic/diagnostic biomarkers of cancers. 
+[*Dr. Bora Uyar*](https://github.com/borauyar) contributed Chapter 8, "RNA-seq Analysis". He started his bioinformatics training in Sabanci University (Istanbul/Turkey), from which he got his undergraduate degree. Later, he obtained an MSc from Simon Fraser University (Vancouver/Canada), then a PhD from the European Molecular Biology Laboratory in Heidelberg/Germany. Since 2015, he has been working as a bioinformatics scientist at the Bioinformatics Platform and Omics Data Science Platform at the Berlin Institute for Medical Systems Biology. He has been contributing to the bioinformatics platform through research, collaborations, services and data analysis method development. His current primary research interest is the integration of multiple types of omics datasets to discover prognostic/diagnostic biomarkers of cancers. 
 
 
-[*Dr. Vedran Franke*](https://github.com/frenkiboy) contributed Chapter 9, "ChIP-seq Analysis". Vedran Franke received his PhD from the University of Zagreb. His work focused on the biogenesis and function of small RNA molecules during early embryogenesis, and establishment of pluripotency. Prior to his PhD, he worked as a scientific researcher under Boris
+[*Dr. Vedran Franke*](https://github.com/frenkiboy) contributed Chapter 9, "ChIP-seq Analysis". He received his PhD from the University of Zagreb. His work focused on the biogenesis and function of small RNA molecules during early embryogenesis, and establishment of pluripotency. Prior to his PhD, he worked as a scientific researcher under Boris
 Lenhard at the University of Bergen, Norway, focusing on principles of gene enhancer functions. He continues his research in the
-Bioinformatics and Omics Data Science Platform at the Berlin Institute for Medical System Biology. He develops tools for multi-omics data integration, focusing on single-cell RNA sequencing, and epigenomics. His integrated knowledge of cellular physiology along with his proficiency in data analysis enables him to find creative solutions to difficult biological problems.
+Bioinformatics and Omics Data Science Platform at the Berlin Institute for Medical System Biology. He develops tools for multi-omics data integration, focusing on single-cell RNA sequencing, and epigenomics. His integrated knowledge of cellular physiology along with his proficiency in data analysis enable him to find creative solutions to difficult biological problems.
 
-[*Dr. Jonathan Ronen*](https://github.com/jonathanronen) contributed Chapter 11, "Multi-omics Analysis". Jonathan got his MSc in control engineering from the Norwegian University of Science and Technology in 2010. He then worked as a software developer in Oslo, Brussels, and Munich. During that time, he was also on the founding team of www.holderdeord.no, a website that links votes in the Norwegian parliament to pledges made in party manifestos. In 2014--2015, Jonathan worked as a data scientist in New York University's Social Media and Political Participation lab. During that time, he also launched www.lahadam.co.il, a website which tracked Israeli politicians' facebook posts. Jonathan obtained a PhD in computational biology in 2020, where he has published tools for imputation for single cell RNA-seq using priors, and integrative analysis of multi-omics data using deep learning.
+[*Dr. Jonathan Ronen*](https://github.com/jonathanronen) contributed Chapter 11, "Multi-omics Analysis". Dr. Ronen got his MSc in control engineering from the Norwegian University of Science and Technology in 2010. He then worked as a software developer in Oslo, Brussels, and Munich. During that time, he was also on the founding team of www.holderdeord.no, a website that links votes in the Norwegian parliament to pledges made in party manifestos. In 2014--2015, He worked as a data scientist in New York University's Social Media and Political Participation lab. During that time, he also launched www.lahadam.co.il, a website which tracked Israeli politicians' Facebook posts. He obtained a PhD in computational biology in 2020, where he has published tools for imputation for single cell RNA-seq using priors, and integrative analysis of multi-omics data using deep learning.
 
 
diff --git a/01-intro2Genomics.Rmd b/01-intro2Genomics.Rmd
@@ -724,7 +724,7 @@ __Want to know more?__
 
 - Current and the future high-throughput sequencing technologies: http://www.sciencedirect.com/science/article/pii/S1097276515003408
 
-- Illumina repository for different library preparation protocols for sequencingL http://www.illumina.com/techniques/sequencing/ngs-library-prep/library-prep-methods.html
+- Illumina repository for different library preparation protocols for sequencing: http://www.illumina.com/techniques/sequencing/ngs-library-prep/library-prep-methods.html
 
 
 ```
@@ -747,7 +747,7 @@ known gene sequences are mapped on to them, and you can also
 have conservation to other species to filter functional elements. If you
 are working with a model organism or human, you will also have a lot of
 auxiliary information to help demarcate the functional regions such
-as regulatory regions, ncRNAs,and SNPs that are common in the population.
+as regulatory regions, ncRNAs, and SNPs that are common in the population.
 Or you might have disease- or tissue-specific data available.
 The more the organism is worked on, the more auxiliary data you will have.
 

diff --git a/02-intro2R.Rmd b/02-intro2R.Rmd
@@ -80,7 +80,7 @@ On top of these, genomic data-specific processing and quality check can be achie
 
 #### General data analysis and exploration
 
-Most genomics data sets are suitable for application of general data analysis tools. In some cases, you may need to preprocess the data to get it to a state that is suitable for application of such tools. Here is a non-exhaustive list of what kind of things can be done via R. You will see popular data analysis methods in Chapters \@ref(stats),\@ref(unsupervisedLearning) and \@ref(supervisedLearning).
+Most genomics data sets are suitable for application of general data analysis tools. In some cases, you may need to preprocess the data to get it to a state that is suitable for application of such tools. Here is a non-exhaustive list of what kind of things can be done via R. You will see popular data analysis methods in Chapters \@ref(stats), \@ref(unsupervisedLearning) and \@ref(supervisedLearning).
 
  - Unsupervised data analysis: clustering (k-means, hierarchical), matrix factorization 
 (PCA, ICA, etc.)
@@ -448,7 +448,7 @@ In R, there are other plotting systems besides “base graphics”, which is wha
 Next we will see how this works in practice. Let’s start with a simple scatter plot using `ggplot2`. In order to make basic plots in `ggplot2`, one needs to combine different components. First, we need the data and its transformation to a geometric object; for a scatter plot this would be mapping data to points, for histograms it would be binning the data and making bars. Second, we need the scales and coordinate system, which generates axes and legends so that we can see the values on the plot. And the last component is the plot annotation such as plot title and the background.
 
 
-The main `ggplot2` function, called `ggplot()`, requires a data frame to work with, and this data frame is its first argument as shown in the code snippet below. The second thing you will notice is the `aes()` function in the `ggplot()` function. This function defines which columns in the data frame map to x and y coordinates and if they should be colored or have different shapes based on the values in a different column. These elements are the “aesthetic” elements, this is what we observe in the plot. The last line in the code represents the geometric object to be plotted. These geometric objects defines the type of the plot. In this case, the object is a point, indicated by the `geom_point()`function. Another, peculiar thing in the code is the `+` operation. In `ggplot2`, this operation is used to add layers and modify the plot. The resulting scatter plot from the code snippet below can be seen in Figure \@ref(fig:ggScatterchp3). 
+The main `ggplot2` function, called `ggplot()`, requires a data frame to work with, and this data frame is its first argument as shown in the code snippet below. The second thing you will notice is the `aes()` function in the `ggplot()` function. This function defines which columns in the data frame map to x and y coordinates and if they should be colored or have different shapes based on the values in a different column. These elements are the “aesthetic” elements, this is what we observe in the plot. The last line in the code represents the geometric object to be plotted. These geometric objects define the type of the plot. In this case, the object is a point, indicated by the `geom_point()`function. Another, peculiar thing in the code is the `+` operation. In `ggplot2`, this operation is used to add layers and modify the plot. The resulting scatter plot from the code snippet below can be seen in Figure \@ref(fig:ggScatterchp3). 
 
 ```{r ggScatterchp3,fig.cap="Scatter plot with ggplot2"}
 
@@ -756,7 +756,7 @@ Ex: `matrix(1:6,nrow=3,ncol=2)` makes a 3x2 matrix using numbers between 1 and 6
 21. What happens when you use `byrow = TRUE` in your matrix() as an additional argument?
 Ex: `mat=matrix(1:6,nrow=3,ncol=2,byrow = TRUE)`. [Difficulty: **Beginner**]
 
-22. Extract the first 3 columns and first 3 rows of your matrix using `[]` notation.[Difficulty: **Beginner**]
+22. Extract the first 3 columns and first 3 rows of your matrix using `[]` notation. [Difficulty: **Beginner**]
 
 23. Extract the last two rows of the matrix you created earlier.
 Ex: `mat[2:3,]` or `mat[c(2,3),]` extracts the 2nd and 3rd rows.
@@ -842,7 +842,7 @@ to get the file path within the installed `compGenomRData` package. [Difficulty:
 
 2. Use `head()` on CpGi to see the first few rows. [Difficulty: **Beginner**]
 
-3. Why doesn't the following work? see `sep` argument at `help(read.table)`. [Difficulty: **Beginner**]
+3. Why doesn't the following work? See `sep` argument at `help(read.table)`. [Difficulty: **Beginner**]
 
 ```{r readCpGex, eval=FALSE}
 cpgtFilePath=system.file("extdata",
@@ -863,7 +863,8 @@ cpgiHF=read.table("intro2R_data/data/CpGi.table.hg18.txt",
 
 5. Read only the first 10 rows of the CpGi table. [Difficulty: **Beginner/Intermediate**]  
 
-6. Use `cpgFilePath=system.file("extdata","CpGi.table.hg18.txt",package="compGenomRData")` to get the file path, then use
+6. Use `cpgFilePath=system.file("extdata","CpGi.table.hg18.txt",`
+`package="compGenomRData")` to get the file path, then use
 `read.table()` with argument `header=FALSE`. Use `head()` to see the results. [Difficulty: **Beginner**]  
 
 
@@ -876,7 +877,7 @@ home folder.[Difficulty: **Beginner**]
 
 
 9. Write out the first 10 rows of the `cpgi` data frame.
-**HINT:** use subsetting for data frames we learned before. [Difficulty: **Beginner**]  
+**HINT:** Use subsetting for data frames we learned before. [Difficulty: **Beginner**]  
 
 
 
@@ -1011,7 +1012,7 @@ The 'perGc' column in the data stands for GC percent => percentage of C+G nucleo
 
 3. Use if/else structure to decide if the given GC percent is high, low or medium.
 If it is low, high, or medium: low < 60, high>75, medium is between 60 and 75;
-use greater or less than operators, `<`  or ` >`. Fill in the values in the in code below, where it is written 'YOU_FILL_IN'. [Difficulty: **Intermediate**]
+use greater or less than operators, `<`  or ` >`. Fill in the values in the code below, where it is written 'YOU_FILL_IN'. [Difficulty: **Intermediate**]
 ```{r functionEvExchp2,echo=TRUE,eval=FALSE}
 
 GCper=65