Skip to content

Commit

Permalink
Update for XeLaTeX and OTF fonts. Add index entries. Update for versi…
Browse files Browse the repository at this point in the history
…ons of packages as of 2016-10-22.
  • Loading branch information
aphalo committed Oct 22, 2016
1 parent f4f0ab4 commit b94ad29
Show file tree
Hide file tree
Showing 31 changed files with 121,182 additions and 75,069 deletions.
20 changes: 10 additions & 10 deletions R.as.calculator.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ opts_knit$set(concordance=TRUE)

\section{Aims of this chapter}

In my experience, for those not familiar with computing programming or scripting languages, and who have mostly used computer programs through visual interfaces making heavy use of menus and icons, the best first step in learning \R is to learn the basics of the language through its use at the R command prompt. This will teach not only the syntax and grammar rules, but also give a glimpse at the advantages and flexibility of this approach to data analysis.
In my experience, for those not familiar with computing programming or scripting languages, and who have mostly used computer programs through visual interfaces making heavy use of menus and icons, the best first step in learning \Rlang is to learn the basics of the language through its use at the R command prompt. This will teach not only the syntax and grammar rules, but also give a glimpse at the advantages and flexibility of this approach to data analysis.

Menu-driven programs are not necessarily bad, they are just unsuitable when there is a need to set very many options and chose from many different actions. They are also difficult to maintain when extensibility is desired, and when independently developed modules of very different characteristics need to be integrated. Textual languages also have the advantage, to be dealt with in the next chapter, that command sequences can be stored as a human- and computer readable text file that keeps a record of all the steps used and that in most cases makes it trivial to reproduce the same steps at a later time. The scripts are also a very simple and handy way of communicating to others how to do a given data analysis.

\section{Working at the R console}

I assume here that you have installed or have had installed by someone else \R and \RStudio and that you are already familiar enough with \RStudio to find your way around its user interface. The examples in this chapter use only the console window, and results are printed to the console. The values stored in the different variables are visible in the Environment tab in \RStudio.
I\index{console} assume here that you have installed or have had installed by someone else \Rpgrm and \RStudio and that you are already familiar enough with \RStudio to find your way around its user interface. The examples in this chapter use only the console window, and results are printed to the console. The values stored in the different variables are visible in the Environment tab in \RStudio.

In the console you can type commands at the \texttt{>} prompt.
When you end a line by pressing the return key, if the line can be interpreted as an R command, the result will be printed in the console, followed by a new \texttt{>} prompt.
Expand All @@ -27,7 +27,7 @@ When working at the command prompt, results are printed by default, but in other

The idea with these examples is that you learn by working out how different commands work based on the results of the example calculations listed. The examples are designed so that they allow the rules, and also a few quirks, to be found by `detective work'. This should hopefully lead to better understanding than just studying rules.

\section{Examples with numbers}
\section{Examples with numbers}\index{math operators}\index{math functions}

When working with arithmetic expressions the normal mathematical precedence rules are respected, but parentheses can be used to alter this order. Parentheses can be nested and at all nesting levels the normal rounded parentheses are used. The number of opening (left side) and closing (right side) parentheses must be balanced, and they must be located so that each enclosed term is a valid mathematical expression. For example while \code{(1 + 2) * 3} is valid, \code{(1 +) 2 * 3} is a syntax error as \code{1 +} is incomplete and cannot be calculated.

Expand All @@ -47,7 +47,7 @@ log2(8)
exp(1)
@

One can use variables to store values. Variable names and all other names in R are case sensitive. Variables \texttt{a} and \texttt{A} are two different variables. Variable names can be quite long, but usually it is not a good idea to use very long names. Here I am using very short names, that is usually a very bad idea. However, in cases like these examples where the stored values have no real connection to the real world and are used just once or twice, these names emphasize the abstract nature.
One can use variables\index{variables}\index{assignment} to store values. The `usual' assignment operator is \Roperator{<-}. Variable names and all other names in R are case sensitive. Variables \texttt{a} and \texttt{A} are two different variables. Variable names can be quite long, but usually it is not a good idea to use very long names. Here I am using very short names, that is usually a very bad idea. However, in cases like these examples where the stored values have no real connection to the real world and are used just once or twice, these names emphasize the abstract nature.


<<numbers-2>>=
Expand All @@ -60,7 +60,7 @@ b
3e-2 * 2.0
@

There are some syntactically legal statements that are not very frequently used, but you should be aware that they are valid, as they will not trigger error messages, and may surprise you. The important thing is that you write commands consistently. \texttt{1 -> a} is valid but rarely used. The use of the equals sign (\code{=}) for assignment although valid is generally discouraged as it is seldom used as this use has not earlier been part of the \R language. Chaining assignments as in the first line below is sometimes used, and signals to the human reader that \code{a}, \code{b} and \code{c} are being assigned the same value.
There are some syntactically legal statements that are not very frequently used, but you should be aware that they are valid, as they will not trigger error messages, and may surprise you. The important thing is that you write commands consistently. The assignment `backwards' assignment operator \Roperator{->} resulting code like \texttt{1 -> a}\index{assignment!leftwise} is valid but rarely used. The use of the equals sign (\Roperator{=}) for assignment although valid is generally discouraged as it is seldom used as this meaning has not earlier been part of the \R language. Chaining\index{assignment!chaining} assignments as in the first line below is sometimes used, and signals to the human reader that \code{a}, \code{b} and \code{c} are being assigned the same value.

<<numbers-3, tidy=FALSE>>=
a <- b <- c <- 0.0
Expand All @@ -73,9 +73,9 @@ a = 3
a
@

Numeric variables can contain more than one value. Even single numbers are \code{vector}s of length one. We will later see why this is important. As you have seen above the results of calculations were printed preceded with \texttt{[1]}. This is the index or position in the vector of the first number (or other value) displayed at the head of the current line.
Numeric variables can contain more than one value. Even single numbers are \Rclass{vector}s of length one. We will later see why this is important. As you have seen above the results of calculations were printed preceded with \texttt{[1]}. This is the index or position in the vector of the first number (or other value) displayed at the head of the current line.

One can use \texttt{c} `concatenate' to create a vector of numbers from individual numbers.
One can use \Rmethod{c} `concatenate' to create a vector of numbers from individual numbers.

<<numbers-4>>=
a <- c(3,1,2)
Expand All @@ -88,7 +88,7 @@ d <- c(b, a)
d
@

One can also create sequences, or repeat values. In this case I leave to the reader to work out the rules by running these and his/her own examples.
One can also create sequences\index{sequence} using \Rmethod{seq}, or repeat values. In this case I leave to the reader to work out the rules by running these and his/her own examples.

<<numbers-5>>=
a <- -1:5
Expand All @@ -101,7 +101,7 @@ d <- rep(-5, 4)
d
@

Now something that makes \R different from most other programming languages: vectorized arithmetic.
Now something that makes \Rlang different from most other programming languages: vectorized arithmetic\index{vectorised arithmetic}.

<<numbers-6>>=
a + 1 # we add one to vector a defined above
Expand All @@ -110,7 +110,7 @@ a + b
a - a
@

It can be seen in first line above, another peculiarity of \R, that is frequently called ``recycling'': as vector \texttt{a} is of length 6, but the constant 1 is a vector of length 1, this 1 is extended by recycling into a vector of the same length as the longest vector in the statement.
It can be seen in first line above, another peculiarity of \R, that is frequently called ``recycling'':\index{recycling@``recycling''} as vector \texttt{a} is of length 6, but the constant 1 is a vector of length 1, this 1 is extended by recycling into a vector of the same length as the longest vector in the statement.

Make sure you understand what calculations are taking place in the chunk above, and also the one below.

Expand Down
57 changes: 57 additions & 0 deletions R.data.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,60 @@ opts_knit$set(concordance=TRUE)

\chapter{Storing and manipulating data with R}\label{chap:R:data}

\section{Introduction}

By reading previous chapters, you have already become familiar with base R's classes, methods, functions and operators for storing and manipulating data. Several recently developed packages provide somehow different, and in my view easier, ways of working with data in R without compromising performance to a level that would matter outside the realm of `big data'. Some other recent packages emphasize computation speed, at some cost with respect to simplicity of use, and in particular intuitiveness. Of course, as with any user interface, much depends on one's own preferences and attitudes to data analysis. However, a package designed for maximum efficiency like \pkgname{data.table} requires of the user to have a good understanding of computers to be able to understand the compromises and the unusual behavior compared to the rest of R. I will base this chapters on what I mostly use myself for everyday data analysis and scripting, and exclude the complexities of R programming and package development.

The chapter is divided in three sections, the first one deals with reading data from files produced by other programs or instruments, or typed by users outside of R, and querying databases and very briefly on reading data from the internet. The second section will deal with transformations of the data that do not combine different observations, although they may combine different variables from a single observation event, or select certain variables or observations from a larger set. The third section will deal with operations that produce summaries or involve other operations on groups of observations.

\section{Data input}

\subsection{Text files}

\subsection{Worksheets}

\subsection{Statistical software}

\subsection{Databases}

\subsection{Data acquisition from web}

\subsection{Data acquisition from physical devices}

\section{Pipes and tees}

\subsection{Processing data step by step}

\subsection{Pipes and tees in the Unix shell}

\subsection{Pipes and tees in R scripts}

\section{Row-wise data manipulations}

\subsection{Computations}

\subsection{Subsetting}

\subsection{Merging and joints}

\section{Column-wise data manipulations}

\subsection{Grouping}

\subsection{Summaries}

\subsection{Variable selection}

\section{Data output}

\subsection{Text files}

\subsection{Worksheets}

\subsection{Statistical software}

\subsection{Databases}

\subsection{Publication to the web}

\subsection{Control of physical devices}
38 changes: 38 additions & 0 deletions R.friends.Rnw
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
% !Rnw root = appendix.main.Rnw

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
@

\chapter{If and when R needs help}\label{chap:R:data}

\section{Introduction}

S (R is sometimes called Gnu S) is a very well designed language, but it was designed with a specific purpose in mind. No tool can handle efficiently and gracefully every task one may want to throw at it. R is no different. However, R is good at working together with other languages and tools. As for most tasks in life, for data analysis one may find the best tools for our toolbox scattered around the place and coming from different sources. What is important in any tool pack is that all the tools can be used together in an efficient and `fluid' combination. In this final chapter, I will give an overview of what is available, giving only minimal examples, as a teaser for readers to explore on their own what may be most useful for their jobs.

\section{R's limitations and strengths}

\subsection{Using the best tool for each job}

\subsection{R is great, but}

\subsection{On choice versus fashion}

\subsection{On finding one's own way around}

\subsection{Getting around performance issues}

\subsection{Re-using code writing in other languages}

\section{C++}

\section{FORTRAN and C}

\section{Phyton}

\section{Java}

\section{Javascript}

\section{sh, bash}
80 changes: 80 additions & 0 deletions R.more.plotting.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ For executing the examples listed in this chapter you need first to load the fol

<<message=FALSE>>=
library(ggplot2)
library(viridis)
library(ggrepel)
library(ggpmisc)
library(xts)
Expand All @@ -33,6 +34,10 @@ search()
getwd()
@

\section[viridis]{\viridis}

Very interesting package that needs to be described!

\section[ggpmisc]{\ggpmisc}

\sloppy
Expand Down Expand Up @@ -414,6 +419,81 @@ ggplot(my.data, aes(x, y, colour = group)) +
stat_fit_residuals(formula = formula)
@

\subsection{Filtering observations based on local density}

Statistics \Rfunction{stat\_dens2d\_filter} works best with clouds of observations, so we generate some random data.

<<>>=
set.seed(1234)
nrow <- 200
my.2d.data <-
data.frame(
x = rnorm(nrow),
y = rnorm(nrow) + rep(c(-1, +1), rep(nrow / 2, 2)),
group = rep(c("A", "B"), rep(nrow / 2, 2))
)
@

In most recipes in the section we use \Rfunction{stat\_dens2d\_filter} to highlight observations with the \code{color} aesthetic. Other aesthetics can also be used.

By default 1/10 of the observations are kept from regions of lowest density.

<<>>=
ggplot(my.2d.data, aes(x, y)) +
geom_point() +
stat_dens2d_filter(color = "red")
@

Here we change the fraction to 1/3.

<<>>=
ggplot(my.2d.data, aes(x, y)) +
geom_point() +
stat_dens2d_filter(color = "red",
keep.fraction = 1/3)
@

We can also set a maximum number of observations to keep.

<<>>=
ggplot(my.2d.data, aes(x, y)) +
geom_point() +
stat_dens2d_filter(color = "red",
keep.number = 3)
@

We can also keep the observations from the densest areas instead of the from the sparsest.

<<>>=
ggplot(my.2d.data, aes(x, y)) +
geom_point() +
stat_dens2d_filter(color = "red",
keep.sparse = FALSE)
@

<<>>=
ggplot(my.2d.data, aes(x, y)) +
geom_point() +
stat_dens2d_filter(color = "red",
keep.sparse = FALSE) +
facet_grid(~group)
@

In addition to \Rfunction{stat\_dens2d\_filter} there is \Rfunction{stat\_dens2d\_filter\_g}. The difference is in that the first one computes the density on a plot-panel basis while the second one does it on a group basis. This makes a difference only when observations are grouped based on another aesthetic within each panel.

<<>>=
ggplot(my.2d.data, aes(x, y, color = group)) +
geom_point() +
stat_dens2d_filter(shape = 1, size = 3)
@

<<>>=
ggplot(my.2d.data, aes(x, y, color = group)) +
geom_point() +
stat_dens2d_filter_g(shape = 1, size = 3)
@


\subsection{Learning and/or debugging}

A very simple stat named \code{stat\_debug()} can save the work of adding print statements to the code of stats to get information about what data is being passed to the \code{compute\_group()} function. Because the code of this function is stored in a \code{ggproto} object, at the moment it is impossible to directly set breakpoints in it. This \code{stat\_debug()} may also help users diagnose problems with the mapping of aesthetics in their code or just get a better idea of how the internals of \ggplot work.
Expand Down
2 changes: 2 additions & 0 deletions R.plot-gantt-concordance.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
\Sconcordance{concordance:R.plot-gantt.tex:R.plot-gantt.Rnw:%
1 11 1 10 0 34 1 38 0 12 1 4 0 6 1 8 0 1 1}
Binary file added R.plot-gantt-tikzDictionary
Binary file not shown.
65 changes: 65 additions & 0 deletions R.plot-gantt.Rnw
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
\subsection{Gantt charts}

Example from Richie Cotton's answer at \url{http://stackoverflow.com/a/3556020/2419892}.

<<gantt-1>>=
library(reshape2)
library(ggplot2)
library(ggpmisc)
library(tibble)
library(dplyr)
@

<<gantt-2>>=
tasks <- c("Screen for photoreceptor mutants (M. truncatula)",
"Transfer photoreceptor mutants to accessions (M. truncatula)",
"Exp. with accessions (M. truncatula)",
"Exp. with photoreceptor mutants (M. truncatula)",
"Exp. with cultivars/lines (V. faba)",
"Exp. with segregating pop. (V. fava)",
"Exp. with photoreceptor mutants (A. thaliana)",
"Exp. with accessions (A. thaliana)")
tasks <- c("Screen for photoreceptor mutants (\\emph{M. truncatula})",
"Transfer photoreceptor mutants to accessions (\\emph{M. truncatula})",
"Exp. with accessions (\\emph{M. truncatula})",
"Exp. with photoreceptor mutants (\\emph{M. truncatula})",
"Exp. with cultivars/lines (\\emph{V. faba})",
"Exp. with segregating pop. (\\emph{V. fava})",
"Exp. with photoreceptor mutants (\\emph{A. thaliana})",
"Exp. with accessions (\\emph{A. thaliana})")
tasks <- factor(tasks, levels = rev(tasks))
levels(tasks)
dfr <- tibble(
task = tasks,
start.date = c(0.0, 1.0, 0.35, 2.0, 0.25, 2.25, 0.0, 1.0),
end.date = c(3.0, 4.0, 4.75, 4.5, 1.70, 4.70, 2.0, 3.0),
location = c("HU+HCM", "HU", "HU+BGU+SARDI", "HU+BGU+SARDI", "HU", "HU", "HU", "HU+HCM"),
# theme = c(),
species = c("M. truncatula", "M. truncatula", "M. truncatula", "M. truncatula", "V. faba", "V. faba", "A. thaliana", "A. thaliana")
)
levels(dfr$task)
dfr <- arrange(dfr, species, start.date)
mdfr <- melt(dfr, measure.vars = c("start.date", "end.date"))
levels(mdfr$task)
@

<<gantt-3>>=
fig.gantt <- ggplot(mdfr, aes(value, task)) +
# geom_debug() +
geom_line(aes(color = location), size = 8) +
# geom_line(aes(color = species), size = 3) +
xlab("Time (years)") +
ylab(NULL) +
xlim(0,5) +
# theme_minimal() +
theme_light() +
theme(legend.position = c(3.2, 3.7))
@

<<gantt-5>>=
tikzDevice::tikz("gantt.tex", width = 6, height = 3)
print(fig.gantt)
dev.off()
@

Loading

0 comments on commit b94ad29

Please sign in to comment.