Update for XeLaTeX and OTF fonts. Add index entries. Update for versi…

…ons of packages as of 2016-10-22.
aphalo · Oct 22, 2016 · b94ad29 · b94ad29
1 parent f4f0ab4
commit b94ad29
Show file tree

Hide file tree

Showing 31 changed files with 121,182 additions and 75,069 deletions.
diff --git a/R.as.calculator.Rnw b/R.as.calculator.Rnw
@@ -11,13 +11,13 @@ opts_knit$set(concordance=TRUE)
 
 \section{Aims of this chapter}
 
-In my experience, for those not familiar with computing programming or scripting languages, and who have mostly used computer programs through visual interfaces making heavy use of menus and icons, the best first step in learning \R is to learn the basics of the language through its use at the R command prompt. This will teach not only the syntax and grammar rules, but also give a glimpse at the advantages and flexibility of this approach to data analysis.
+In my experience, for those not familiar with computing programming or scripting languages, and who have mostly used computer programs through visual interfaces making heavy use of menus and icons, the best first step in learning \Rlang is to learn the basics of the language through its use at the R command prompt. This will teach not only the syntax and grammar rules, but also give a glimpse at the advantages and flexibility of this approach to data analysis.
 
 Menu-driven programs are not necessarily bad, they are just unsuitable when there is a need to set very many options and chose from many different actions. They are also difficult to maintain when extensibility is desired, and when independently developed modules of very different characteristics need to be integrated. Textual languages also have the advantage, to be dealt with in the next chapter, that command sequences can be stored as a human- and computer readable text file that keeps a record of all the steps used and that in most cases makes it trivial to reproduce the same steps at a later time. The scripts are also a very simple and handy way of communicating to others how to do a given data analysis.
 
 \section{Working at the R console}
 
-I assume here that you have installed or have had installed by someone else \R and \RStudio and that you are already familiar enough with \RStudio to find your way around its user interface. The examples in this chapter use only the console window, and results are printed to the console. The values stored in the different variables are visible in the Environment tab in \RStudio.
+I\index{console} assume here that you have installed or have had installed by someone else \Rpgrm and \RStudio and that you are already familiar enough with \RStudio to find your way around its user interface. The examples in this chapter use only the console window, and results are printed to the console. The values stored in the different variables are visible in the Environment tab in \RStudio.
 
 In the console you can type commands at the \texttt{>} prompt.
 When you end a line by pressing the return key, if the line can be interpreted as an R command, the result will be printed in the console, followed by a new \texttt{>} prompt.
@@ -27,7 +27,7 @@ When working at the command prompt, results are printed by default, but in other
 
 The idea with these examples is that you learn by working out how different commands work based on the results of the example calculations listed. The examples are designed so that they allow the rules, and also a few quirks, to be found by `detective work'. This should hopefully lead to better understanding than just studying rules.
 
-\section{Examples with numbers}
+\section{Examples with numbers}\index{math operators}\index{math functions}
 
 When working with arithmetic expressions the normal mathematical precedence rules are respected, but parentheses can be used to alter this order. Parentheses can be nested and at all nesting levels the normal rounded parentheses are used. The number of opening (left side) and closing (right side) parentheses must be balanced, and they must be located so that each enclosed term is a valid mathematical expression. For example while \code{(1 + 2) * 3} is valid, \code{(1 +) 2 * 3} is a syntax error as \code{1 +} is incomplete and cannot be calculated.
 
@@ -47,7 +47,7 @@ log2(8)
 exp(1)
 @
 
-One can use variables to store values. Variable names and all other names in R are case sensitive. Variables \texttt{a} and \texttt{A} are two different variables. Variable names can be quite long, but usually it is not a good idea to use very long names. Here I am using very short names, that is usually a very bad idea. However, in cases like these examples where the stored values have no real connection to the real world and are used just once or twice, these names emphasize the abstract nature.
+One can use variables\index{variables}\index{assignment} to store values. The `usual' assignment operator is \Roperator{<-}. Variable names and all other names in R are case sensitive. Variables \texttt{a} and \texttt{A} are two different variables. Variable names can be quite long, but usually it is not a good idea to use very long names. Here I am using very short names, that is usually a very bad idea. However, in cases like these examples where the stored values have no real connection to the real world and are used just once or twice, these names emphasize the abstract nature.
 
 
 <<numbers-2>>=
@@ -60,7 +60,7 @@ b
 3e-2 * 2.0
 @
 
-There are some syntactically legal statements that are not very frequently used, but you should be aware that they are valid, as they will not trigger error messages, and may surprise you. The important thing is that you write commands consistently. \texttt{1 -> a} is valid but rarely used. The use of the equals sign (\code{=}) for assignment although valid is generally discouraged as it is seldom used as this use has not earlier been part of the \R language. Chaining assignments as in the first line below is sometimes used, and signals to the human reader that \code{a}, \code{b} and \code{c} are being assigned the same value.
+There are some syntactically legal statements that are not very frequently used, but you should be aware that they are valid, as they will not trigger error messages, and may surprise you. The important thing is that you write commands consistently. The assignment `backwards' assignment operator \Roperator{->} resulting code like \texttt{1 -> a}\index{assignment!leftwise} is valid but rarely used. The use of the equals sign (\Roperator{=}) for assignment although valid is generally discouraged as it is seldom used as this meaning has not earlier been part of the \R language. Chaining\index{assignment!chaining} assignments as in the first line below is sometimes used, and signals to the human reader that \code{a}, \code{b} and \code{c} are being assigned the same value.
 
 <<numbers-3, tidy=FALSE>>=
 a <- b <- c <- 0.0
@@ -73,9 +73,9 @@ a = 3
 a
 @
 
-Numeric variables can contain more than one value. Even single numbers are \code{vector}s of length one. We will later see why this is important. As you have seen above the results of calculations were printed preceded with \texttt{[1]}. This is the index or position in the vector of the first number (or other value) displayed at the head of the current line.
+Numeric variables can contain more than one value. Even single numbers are \Rclass{vector}s of length one. We will later see why this is important. As you have seen above the results of calculations were printed preceded with \texttt{[1]}. This is the index or position in the vector of the first number (or other value) displayed at the head of the current line.
 
-One can use \texttt{c} `concatenate' to create a vector of numbers from individual numbers.
+One can use \Rmethod{c} `concatenate' to create a vector of numbers from individual numbers.
 
 <<numbers-4>>=
 a <- c(3,1,2)
@@ -88,7 +88,7 @@ d <- c(b, a)
 d
 @
 
-One can also create sequences, or repeat values. In this case I leave to the reader to work out the rules by running these and his/her own examples.
+One can also create sequences\index{sequence} using \Rmethod{seq}, or repeat values. In this case I leave to the reader to work out the rules by running these and his/her own examples.
 
 <<numbers-5>>=
 a <- -1:5
@@ -101,7 +101,7 @@ d <- rep(-5, 4)
 d
 @
 
-Now something that makes \R different from most other programming languages: vectorized arithmetic.
+Now something that makes \Rlang different from most other programming languages: vectorized arithmetic\index{vectorised arithmetic}.
 
 <<numbers-6>>=
 a + 1 # we add one to vector a defined above
@@ -110,7 +110,7 @@ a + b
 a - a
 @
 
-It can be seen in first line above, another peculiarity of \R, that is frequently called ``recycling'': as vector \texttt{a} is of length 6, but the constant 1 is a vector of length 1, this 1 is extended by recycling into a vector of the same length as the longest vector in the statement.
+It can be seen in first line above, another peculiarity of \R, that is frequently called ``recycling'':\index{recycling@``recycling''} as vector \texttt{a} is of length 6, but the constant 1 is a vector of length 1, this 1 is extended by recycling into a vector of the same length as the longest vector in the statement.
 
 Make sure you understand what calculations are taking place in the chunk above, and also the one below.
 

diff --git a/R.data.Rnw b/R.data.Rnw
@@ -7,3 +7,60 @@ opts_knit$set(concordance=TRUE)
 
 \chapter{Storing and manipulating data with R}\label{chap:R:data}
 
+\section{Introduction}
+
+By reading previous chapters, you have already become familiar with base R's classes, methods, functions and operators for storing and manipulating data. Several recently developed packages provide somehow different, and in my view easier, ways of working with data in R without compromising performance to a level that would matter outside the realm of `big data'. Some other recent packages emphasize computation speed, at some cost with respect to simplicity of use, and in particular intuitiveness. Of course, as with any user interface, much depends on one's own preferences and attitudes to data analysis. However, a package designed for maximum efficiency like \pkgname{data.table} requires of the user to have a good understanding of computers to be able to understand the compromises and the unusual behavior compared to the rest of R. I will base this chapters on what I mostly use myself for everyday data analysis and scripting, and exclude the complexities of R programming and package development.
+
+The chapter is divided in three sections, the first one deals with reading data from files produced by other programs or instruments, or typed by users outside of R, and querying databases and very briefly on reading data from the internet. The second section will deal with transformations of the data that do not combine different observations, although they may combine different variables from a single observation event, or select certain variables or observations from a larger set. The third section will deal with operations that produce summaries or involve other operations on groups of observations.
+
+\section{Data input}
+
+\subsection{Text files}
+
+\subsection{Worksheets}
+
+\subsection{Statistical software}
+
+\subsection{Databases}
+
+\subsection{Data acquisition from web}
+
+\subsection{Data acquisition from physical devices}
+
+\section{Pipes and tees}
+
+\subsection{Processing data step by step}
+
+\subsection{Pipes and tees in the Unix shell}
+
+\subsection{Pipes and tees in R scripts}
+
+\section{Row-wise data manipulations}
+
+\subsection{Computations}
+
+\subsection{Subsetting}
+
+\subsection{Merging and joints}
+
+\section{Column-wise data manipulations}
+
+\subsection{Grouping}
+
+\subsection{Summaries}
+
+\subsection{Variable selection}
+
+\section{Data output}
+
+\subsection{Text files}
+
+\subsection{Worksheets}
+
+\subsection{Statistical software}
+
+\subsection{Databases}
+
+\subsection{Publication to the web}
+
+\subsection{Control of physical devices}
diff --git a/R.friends.Rnw b/R.friends.Rnw
@@ -0,0 +1,38 @@
+% !Rnw root = appendix.main.Rnw
+
+<<echo=FALSE, include=FALSE>>=
+opts_chunk$set(opts_fig_wide)
+opts_knit$set(concordance=TRUE)
+@
+
+\chapter{If and when R needs help}\label{chap:R:data}
+
+\section{Introduction}
+
+S (R is sometimes called Gnu S) is a very well designed language, but it was designed with a specific purpose in mind. No tool can handle efficiently and gracefully every task one may want to throw at it. R is no different. However, R is good at working together with other languages and tools. As for most tasks in life, for data analysis one may find the best tools for our toolbox scattered around the place and coming from different sources. What is important in any tool pack is that all the tools can be used together in an efficient and `fluid' combination. In this final chapter, I will give an overview of what is available, giving only minimal examples, as a teaser for readers to explore on their own what may be most useful for their jobs.
+
+\section{R's limitations and strengths}
+
+\subsection{Using the best tool for each job}
+
+\subsection{R is great, but}
+
+\subsection{On choice versus fashion}
+
+\subsection{On finding one's own way around}
+
+\subsection{Getting around performance issues}
+
+\subsection{Re-using code writing in other languages}
+
+\section{C++}
+
+\section{FORTRAN and C}
+
+\section{Phyton}
+
+\section{Java}
+
+\section{Javascript}
+
+\section{sh, bash}
diff --git a/R.more.plotting.Rnw b/R.more.plotting.Rnw
@@ -13,6 +13,7 @@ For executing the examples listed in this chapter you need first to load the fol
 
 <<message=FALSE>>=
 library(ggplot2)
+library(viridis)
 library(ggrepel)
 library(ggpmisc)
 library(xts)
@@ -33,6 +34,10 @@ search()
 getwd()
 @
 
+\section[viridis]{\viridis}
+
+Very interesting package that needs to be described!
+
 \section[ggpmisc]{\ggpmisc}
 
 \sloppy
@@ -414,6 +419,81 @@ ggplot(my.data, aes(x, y, colour = group)) +
   stat_fit_residuals(formula = formula)
 @
 
+\subsection{Filtering observations based on local density}
+
+Statistics \Rfunction{stat\_dens2d\_filter} works best with clouds of observations, so we generate some random data.
+
+<<>>=
+set.seed(1234)
+nrow <- 200
+my.2d.data <-
+  data.frame(
+    x = rnorm(nrow),
+    y = rnorm(nrow) + rep(c(-1, +1), rep(nrow / 2, 2)),
+    group = rep(c("A", "B"), rep(nrow / 2, 2))
+   )
+@
+
+In most recipes in the section we use \Rfunction{stat\_dens2d\_filter} to highlight observations with the \code{color} aesthetic. Other aesthetics can also be used.
+
+By default 1/10 of the observations are kept from regions of lowest density.
+
+<<>>=
+ggplot(my.2d.data, aes(x, y)) +
+  geom_point() +
+  stat_dens2d_filter(color = "red")
+@
+
+Here we change the fraction to 1/3.
+
+<<>>=
+ggplot(my.2d.data, aes(x, y)) +
+  geom_point() +
+  stat_dens2d_filter(color = "red",
+                     keep.fraction = 1/3)
+@
+
+We can also set a maximum number of observations to keep.
+
+<<>>=
+ggplot(my.2d.data, aes(x, y)) +
+  geom_point() +
+  stat_dens2d_filter(color = "red",
+                     keep.number = 3)
+@
+
+We can also keep the observations from the densest areas instead of the from the sparsest.
+
+<<>>=
+ggplot(my.2d.data, aes(x, y)) +
+  geom_point() +
+  stat_dens2d_filter(color = "red",
+                     keep.sparse = FALSE)
+@
+
+<<>>=
+ggplot(my.2d.data, aes(x, y)) +
+  geom_point() +
+  stat_dens2d_filter(color = "red",
+                     keep.sparse = FALSE) +
+  facet_grid(~group)
+@
+
+In addition to \Rfunction{stat\_dens2d\_filter} there is \Rfunction{stat\_dens2d\_filter\_g}. The difference is in that the first one computes the density on a plot-panel basis while the second one does it on a group basis. This makes a difference only when observations are grouped based on another aesthetic within each panel.
+
+<<>>=
+ggplot(my.2d.data, aes(x, y, color = group)) +
+  geom_point() +
+  stat_dens2d_filter(shape = 1, size = 3)
+@
+
+<<>>=
+ggplot(my.2d.data, aes(x, y, color = group)) +
+  geom_point() +
+  stat_dens2d_filter_g(shape = 1, size = 3)
+@
+
+
 \subsection{Learning and/or debugging}
 
 A very simple stat named \code{stat\_debug()} can save the work of adding print statements to the code of stats to get information about what data is being passed to the \code{compute\_group()} function. Because the code of this function is stored in a \code{ggproto} object, at the moment it is impossible to directly set breakpoints in it. This \code{stat\_debug()} may also help users diagnose problems with the mapping of aesthetics in their code or just get a better idea of how the internals of \ggplot work.

diff --git a/R.plot-gantt-concordance.tex b/R.plot-gantt-concordance.tex
@@ -0,0 +1,2 @@
+\Sconcordance{concordance:R.plot-gantt.tex:R.plot-gantt.Rnw:%
+1 11 1 10 0 34 1 38 0 12 1 4 0 6 1 8 0 1 1}
diff --git a/R.plot-gantt-tikzDictionary b/R.plot-gantt-tikzDictionary
diff --git a/R.plot-gantt.Rnw b/R.plot-gantt.Rnw
@@ -0,0 +1,65 @@
+\subsection{Gantt charts}
+
+Example from Richie Cotton's answer at \url{http://stackoverflow.com/a/3556020/2419892}.
+
+<<gantt-1>>=
+library(reshape2)
+library(ggplot2)
+library(ggpmisc)
+library(tibble)
+library(dplyr)
+@
+
+<<gantt-2>>=
+tasks <- c("Screen for photoreceptor mutants (M. truncatula)",
+           "Transfer photoreceptor mutants to accessions (M. truncatula)",
+           "Exp. with accessions (M. truncatula)",
+           "Exp. with photoreceptor mutants (M. truncatula)",
+           "Exp. with cultivars/lines (V. faba)",
+           "Exp. with segregating pop. (V. fava)",
+           "Exp. with photoreceptor mutants (A. thaliana)",
+           "Exp. with accessions (A. thaliana)")
+tasks <- c("Screen for photoreceptor mutants (\\emph{M. truncatula})",
+           "Transfer photoreceptor mutants to accessions (\\emph{M. truncatula})",
+           "Exp. with accessions (\\emph{M. truncatula})",
+           "Exp. with photoreceptor mutants (\\emph{M. truncatula})",
+           "Exp. with cultivars/lines (\\emph{V. faba})",
+           "Exp. with segregating pop. (\\emph{V. fava})",
+           "Exp. with photoreceptor mutants (\\emph{A. thaliana})",
+           "Exp. with accessions (\\emph{A. thaliana})")
+tasks <- factor(tasks, levels = rev(tasks))
+levels(tasks)
+dfr <- tibble(
+  task        = tasks,
+  start.date  = c(0.0, 1.0, 0.35, 2.0, 0.25, 2.25, 0.0, 1.0),
+  end.date    = c(3.0, 4.0, 4.75, 4.5, 1.70, 4.70, 2.0, 3.0),
+  location    = c("HU+HCM", "HU", "HU+BGU+SARDI", "HU+BGU+SARDI", "HU", "HU", "HU", "HU+HCM"),
+#  theme       = c(),
+  species     = c("M. truncatula", "M. truncatula", "M. truncatula", "M. truncatula", "V. faba", "V. faba", "A. thaliana", "A. thaliana")
+)
+levels(dfr$task)
+dfr <- arrange(dfr, species, start.date)
+
+mdfr <- melt(dfr, measure.vars = c("start.date", "end.date"))
+levels(mdfr$task)
+@
+
+<<gantt-3>>=
+fig.gantt <- ggplot(mdfr, aes(value, task)) +
+#  geom_debug() +
+  geom_line(aes(color = location), size = 8) +
+#  geom_line(aes(color = species), size = 3) +
+  xlab("Time (years)") +
+  ylab(NULL) +
+  xlim(0,5) +
+#  theme_minimal() +
+  theme_light() +
+  theme(legend.position = c(3.2, 3.7))
+@
+
+<<gantt-5>>=
+tikzDevice::tikz("gantt.tex", width = 6, height = 3)
+print(fig.gantt)
+dev.off()
+@
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		\Sconcordance{concordance:R.plot-gantt.tex:R.plot-gantt.Rnw:%
		1 11 1 10 0 34 1 38 0 12 1 4 0 6 1 8 0 1 1}