diff --git a/R.data.containers.Rnw b/R.data.containers.Rnw index f2c99e13..e2071484 100644 --- a/R.data.containers.Rnw +++ b/R.data.containers.Rnw @@ -1,4 +1,3 @@ -<<<<<<< Updated upstream % !Rnw root = appendix.main.Rnw <>= @@ -1291,1299 +1290,3 @@ knitter_diag() R_diag() other_diag() @ - -======= -% !Rnw root = appendix.main.Rnw - -<>= -opts_chunk$set(opts_fig_wide) -opts_knit$set(concordance=TRUE) -opts_knit$set(unnamed.chunk.label = 'container-chunk') -@ - -\chapter{Base R: ``Collective Nouns''}\label{chap:R:collective} - -\begin{VF} -The information that is available to the computer consists of a selected set of \emph{data} about the real world, namely, that set which is considered relevant to the problem at hand, that set from which it is believed that the desired results can be derived. The data represent an abstraction of reality\ldots - -\VA{Niklaus Wirth}{\emph{Algorithms + Data Structures = Programs}, 1976}\nocite{Wirth1976} -\end{VF} - -\section{Aims of this chapter} - -Data-set organization and storage is one of the keys to efficient data analysis. How to keep together all the information that belongs together, say all measurements from an experiment and corresponding metadata such as treatments applied and/or dates. The title ``collective nouns'' is based on the idea that a data set is a collection of data objects. - -In this chapter you will familiarize with how data sets are usually managed in \Rlang. I use both abstract examples to emphasize the general properties of data sets and the \Rlang classes available for their storage and a few more specific examples to exemplify their use in a more concrete way. While in chapter \ref{chap:R:as:calc} the focus was on atomic data types and objects, like vectors, useful for the storage of collections of values of a given type, like numbers, in the present chapter the focus is on the storage within a single object of heterogeneous data, such as a combination of factors, and character and numeric vectors. Broadly speaking, heterogeneous \emph{data containers}. - -As in the previous chapter, I use diagrams to describe the structure of objects. - -\index{data sets!their storage|(} - -\section{Data from surveys and experiments} -\index{data sets!origin}\index{data sets!characteristics} -The data we plot, summarize and analyze in \Rlang, in most cases originate from measurements done as part of experiments or surveys. Data collected mechanically from user interactions with web sites or by crawling through internet content originate from a statistical perspective from surveys. The value of any data comes from knowing their origin, say treatments applied to plants, or country from where web site users connect, sometimes several properties are of interest to describe the origin of the data and in other cases observations consist in the measurement of multiple properties on each subject under study. Consequently, all software designed for data analysis implements ways of dealing with data sets as a whole both during storage and when passing them as arguments to functions. A data set is a usually heterogeneous collection of data with related information. - -In \Rlang, lists are the most flexible type of objects useful for storing whole data sets. In most cases we do not need this much flexibility, so rectangular collections of observations are most frequently stored in a variation upon lists called data frames. These objects can have as their members the vectors and factors we described in chapter\ref{chap:R:as:calc}. - -Any \Rlang object can have attributes, allowing objects to carry along additional bits of information. Some like comments are part of \Rlang and aimed at storage of ancillary information or metadata by users. Other attributes are used internally by \Rlang and finally users can store arbitrary ancillary data using attributes created \emph{ad hoc}. - -\section{Lists}\label{sec:calc:lists} -\index{lists|(}\qRclass{list} -In \Rlang, objects of class \Rclass{list} are in several respects similar the vectors described in chapter \ref{chap:R:as:calc} but differently to vectors, the members they contain can be heterogeneous, i.e., different members of the same list can belong to different classes. In addition, while the member elements of a vector must be \emph{atomic} values like numbers or character strings, any \Rlang object can be a list member including other lists. - -In \Rlang, the members of a list can be considered as following a sequence, and accessible through numerical indexes, the same as the members of vectors. Members of a list as well as members of a vector can be named, and retrieved (indexed) through their names. In practice, named lists are more frequently used than named vectors. Lists are created using function \Rfunction{list()} similarly as \Rfunction{c()} is used for vectors. - -\begin{explainbox} - In \Rlang lists can have as members not only objects storing data on observations and categories, but also function definitions, model formulas, unevaluated expressions, matrices, arrays, and objects of user defined classes. -\end{explainbox} - -\begin{explainbox} -List and list-like objects are widely used in \Rlang because they make it possible to keep, for example, the data, instructions for operations and results from operations together in a single \Rlang object that can saved, copied, etc. as a unit. This avoids the proliferation of multiple disconnected objects with their interrelations being encoded only by their names, or even worse in separate notes or even in a person'd memory, all approaches that are error prone. Model fit functions described in chapter \ref{chap:R:statistics} are good examples of this approach. Objects used to store the instructions to build plots with multiple layers as described in chapter \ref{chap:R:plotting} are also good examples. -\end{explainbox} - -Our first list has as its members three different vectors, each one belonging to a different class: \code{numeric}, \code{character} and \code{logical}. The three vectors also differ in their length: 6, 1, and 2, respectively.\qRfunction{list()}\qRfunction{names()} - -<>= -lst1 <- list(x = 1:3, y = "ab", z = c(TRUE, FALSE)) -@ - -<>= -str(lst1) -names(lst1) -@ - -\begin{center} -\begin{footnotesize} -\begin{tikzpicture}[font=\sffamily, my shape/.style={ - rectangle split, rectangle split parts=#1, draw, anchor=north, minimum size=12mm}, -array/.style={matrix of nodes,nodes={draw, minimum size=1mm, fill=black},column sep=2pt, row sep=0.5mm, nodes in empty cells, -row 1/.style={nodes={draw=none, fill=none, minimum size=1mm}}}] - -\matrix[array] (array) { -1 & 2 & 3 \\ -\rule{10mm}{.1pt} & \rule{10mm}{.1pt} & \rule{10mm}{.1pt}\\}; - -\begin{scope}[on background layer] -\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east); -\end{scope} - -\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut lst1}}; -\draw (array-2-1.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (nameh) {\rotatebox{90}{x\strut}}; -\draw (array-2-2.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namec) {\rotatebox{90}{y\strut}}; -\draw (array-2-3.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namew) {\rotatebox{90}{z\strut}}; -%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index}; -\draw (array-1-3.east)--++(0:12.5mm) node [right]{\code{integer} positional indices}; -\draw (array-2-3.east)--++(0:8mm) node [right]{\textsl{heterogeneous} class, \textsl{varying} length}; -\draw (namew)--++(0:15mm) node [right]{\code{character} member names}; -% - \node [my shape=3, rectangle split, fill=blue!20] at (-1.3,-.25) - {1\strut\nodepart{two}2\strut\nodepart{three}3\strut}; - \node [my shape=1, fill=red!20] at (0,-.25) - {``ab''\strut}; - \node [my shape=2, fill=yellow!20] at (1.3,-.25) - {TRUE\strut\nodepart{two}FALSE\strut}; -%\draw (-0.6,+0.65) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut lst1}}; -\end{tikzpicture} -\end{footnotesize} -\end{center} - -\begin{warningbox} - With lists it is best to use informative names for indexing, as their members are heterogenous usually containing loosely related/connected data. Names make code easier to understand and mistakes more visible. Using names also makes code more robust to future changes in the position of list members in lists created upstream of our own \Rlang code. Below, we use both positional indices and names to highlight the similarities between lists and vectors. -\end{warningbox} - -Lists can behave as vectors with heterogeneous elements as members, as we will describe next. Lists can also be nested, so tree-like structures are also possible (see section \ref{sec:calc:lists:nested} on page \pageref{sec:calc:lists:nested}). - -%{ \tikzstyle{every node}=[draw=black,thick,anchor=west,fill=blue!10] -% \tikzstyle{root}=[dashed,fill=gray!50] -%\sffamily -%\centering -%\footnotesize -%\begin{tikzpicture}[% -% grow via three points={one child at (0.5,-0.55) and -% two children at (0.5,-0.55) and (0.5,-1.1)}, -% edge from parent path={(\tikzparentnode.south) |- (\tikzchildnode.west)}] -% \node [root] {lst1} -% child { node {\$ x: int [1:6] 1 2 3 4 5 6}} -% child { node {\$ y: chr "a"}} -% child { node {\$ z: logi [1:2] TRUE FALSE}}; -%\end{tikzpicture} -%} - -\begin{faqbox}{How to create an empty list?} - In the same way as \code{numeric()} by default creates a \code{numeric} vector of length zero, \Rfunction{list()} by default creates a \code{list} object with no members. - -<>= -list() -@ -\end{faqbox} - -\subsection{Member extraction, deletion and insertion} - -In\index{lists!member extraction|(}\index{lists!member indexing|see{lists, member extraction}}\index{lists!deletion and addition of members|(} section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing} we saw that the extraction operator \Roperator{[ ]} applied to a vector, returns a vector, longer or shorter, possibly of length one, or even length zero. Similarly, applying operator \Roperator{[ ]} to a list returns a list, possibly of different length: \code{lst1["x"]} or \code{lst[1]} return a list containing only one member, the numeric vector stored at the first position of \code{lst1}. In the last statement in the chunk below, \code{lst1[c(1, 3)]} returns a list of length two as expected. - -<>= -lst1["x"] -lst1[1] -lst1[c(1, 3)] -@ - -As with vectors negative positional indices remove members instead of extracting them. See page \pageref{par:calc:lists:rm} for a safer approach to deletion of list members. - -<>= -lst1[-1] -lst1[c(-1, -3)] -@ - -Using operator \Roperator{[[ ]]} (double square brackets) for indexing a list extracts the element stored in the list, in its original mode. In the example below, \code{lst1[["x"]]} and \code{lst1[[1]]} return a numeric vector. We might say that extraction operator \Roperator{[[ ]]} reaches ``deeper'' into the list than operator \Roperator{[ ]}. Operator \Roperator{\$}, used in the second statement below, provides a shorthand notation, equivalent to using \Roperator{[[ ]]} with a constant \code{character} value as argument. - -<>= -lst1$x -lst1[["x"]] -lst1[[1]] -@ - -\begin{warningbox} - The default behavior also differs in that only \Roperator{\$} does partial matching of the member name (recognizes incomplete names), which makes dangerous its use in scripts or package code. -\end{warningbox} - -\begin{explainbox}\label{box:extraction:opers} -We mentioned above that indexing by name can be done either with double square brackets, \Roperator{[[ ]]}, or with \Roperator{\$}. Operators \Roperator{[ ]} and \Roperator{[[ ]]} work like normal \Rlang functions, accepting as arguments passed to them both constant values and variables for indexing. In contrast, \Roperator{\$} mainly intended for use when typing at the console, accepts only bare member names on its \emph{rhs}. With \Roperator{[[ ]]} the name of the variable or column is given as a character string, enclosed in quotation marks, or as a variable with mode \code{character}. A number as a positional index is also accepted. - -<>= -lst1 <- list(abcd = 123, xyzw = 789) -lst1[[1]] -lst1[["abcd"]] -vct1 <- "abcd" -lst1[[vct1]] -@ - -When using \Roperator{\$}, the name is entered as a constant, without quotation marks, and cannot be a variable or a number. - -<>= -lst1$abcd -lst1$ab -lst1$a -@ - -Both in the case of lists and data frames (see section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}), when using double square brackets, by default an exact match is required between the name in the object and the name used for indexing. In contrast, with \Roperator{\$}, an unambiguous partial match is silently accepted. For interactive use, partial matching is helpful in reducing typing. However, in scripts, and especially \Rlang code in packages, it is best to avoid the use of \Roperator{\$} as partial matching to a wrong variable present at a later time, e.g., when someone else revises the script, can lead to very difficult-to-diagnose errors. - -In addition, as \Roperator{\$} is implemented by first attempting a match to the name and then calling \Roperator{[[ ]]}, using \Roperator{\$} for indexing can result in slightly slower performance compared to using \Roperator{[[ ]]}. It is possible to set \Rlang option \code{warnPartialMatchDollar} so that partial matching triggers a warning when using \Roperator{\$} to extract a member, which can be very useful when debugging. -\end{explainbox} - -<>= -is.vector(lst1[1]) -is.list(lst1[1]) -is.vector(lst1[[1]]) -is.list(lst1[[1]]) -@ - -The two extraction operators can be used together as shown below, with \code{lst1[[1]]} extracting the vector from \code{lst1} and \code{[3]} extracting the member at position 3 of the vector. - -<>= -lst1[[1]][3] -@ - -Extraction\label{par:calc:list:member:assign} operators can be used on the \emph{lhs} as well as on the \emph{rhs} of an assignment, and lists can be empty, i.e, be of length zero. The example below makes use of this to build a list step by step. - -<>= -lst2 <- list() -lst2[["x"]] <- 1:3 -lst2[["y"]] <- "ab" -lst2[["z"]] <- c(TRUE, FALSE) -@ - -\begin{playground} -Compare \code{lst2} to \code{lst1}, used for the examples above. Then run the code below and compare them again. Try to understand why \code{lst2} has changed as it did. Pay also attention to possible changes to the members' names. - -<>= -lst2[["y"]] <- lst2[["x"]] -@ -\end{playground} - -\begin{explainbox} -\emph{Lists}, as usually defined in languages like \Clang, are based on pointers to memory locations, with pointers stored at each node. These pointers chain or link the different member nodes (this allows, for example, sorting of lists in place by modifying the pointers). In such implementations, indexing by position is not possible, or at least requires ``walking'' down the list, node by node. \Rlang does not implement pointers to ``addresses'', or locations, in memory. In \Rlang, \code{list} members can be accessed through positional indexes or member names, similarly to vectors. Of course, insertions and deletions in the middle of a list, shift the position of members and change which member is pointed at by indexes for positions past the modified location. The names, in contrast remain valid. - -<>>= -list(a = 1, b = 2, c = 3)[-2] -@ -\end{explainbox} - -Three frequent operations on lists are concatenation, insertions and deletions.\index{lists!insert into}\index{lists!append to} The same functions as with vectors are used: functions \Rfunction{c()} for concatenation and \Rfunction{append()} to append and insert lists. Lists can be combined only with other lists, otherwise the use is similar as with vectors (see pages \pageref{par:calc:concatenate}--\pageref{par:calc:append:end}). - -<>= -lst3 <- append(lst1, list(yy = 1:10, zz = letters[5:1]), after = 2) -lst3 -@ - -To\label{par:calc:lists:rm} delete a member from a list we assign \code{NULL} to it. - -<>= -lst1$y <- NULL -lst1 -@ - -To investigate the members contained in a list, function \Rfunction{str()} (\emph{structure}), used above, is convenient, especially when lists have many members. Structure formats lists more compactly than \code{print()} applied directly to a list.\label{par:calc:str} - -<>= -print(lst1) -str(lst1) -@ - -\index{lists!deletion and addition of members|)}\index{lists!member extraction|)} - -\subsection{Nested lists}\label{sec:calc:lists:nested} - -Lists\index{lists!nested} can be nested, i.e., lists of lists can be constructed to an arbitrary depth. In the example below \code{lst4} and \code{lst5} are members of \code{lst6}, i.e., \code{lst4} and \code{lst5} are nested within \code{lst6}. - -<>= -lst4 <- list("a", "aa", 10) -lst5 <- list("b", TRUE) -lst6 <- list(A = lst4, B = lst5) # nested -str(lst6) -@ - -A nested\index{lists!nested} list can alternatively be constructed within a single statement in which several member lists are created. Here we combine the first three statements in the earlier chunk into a single one. - -<>= -lst7 <- list(A = list("a", "aa", 10), B = list("b", TRUE)) -str(lst7) -@ - -A list can contain a combination of \code{list} and \code{vector} members. - -<>= -lst8 <- list(A = list("a", "aa", 10), - B = list("b", TRUE), - C = c(1, 3, 9), - D = 4321) -str(lst8) -@ - -\begin{explainbox} -The logic behind extraction of members of nested lists using indexing is the same as for simple lists, but applied recursively---e.g., \code{lst7[[2]]} extracts the second member of the outermost list, which is another list. As, this is a list, its members can be extracted using again the extraction operator: \code{lst7[[2]][[1]]}. It is important to remember that these concatenated extraction operations are written so that the leftmost operator is applied to the outermost list. - -The example above uses the \Roperator{[[ ]]} operator, but the left to right precedence also applies to concatenated calls to \Roperator{[ ]} and to calls combining both operators. -\end{explainbox} - -\begin{playground} -What\index{lists!nested} do you expect each of the statements below to return? \emph{Before running the code}, predict what value and of which mode each statement will return. You may use implicit or explicit calls to \Rfunction{print()}, or calls to \Rfunction{str()} to visualize the structure of the different objects. - -% not handled correctly by knitr, works at console. -<>= -LST9 <- list(A = list("a", "aa", "aaa"), B = list("b", "bb")) -# str(LST9) -LST9[2:1] -LST9[1] -LST9[[1]][2] -LST9[[1]][[2]] -LST9[2] -LST9[2][[1]] -@ - -\end{playground} - -\begin{explainbox}\index{lists!structure} -When dealing with deep lists, it is sometimes useful to limit the number of levels of nesting returned by \Rfunction{str()} by means of a \code{numeric} argument passed to parameter \code{max.levels}. - -<>= -str(lst8, max.level = 1) -@ - -\end{explainbox} - -Sometimes we need to flatten a list\index{lists!flattening}\index{lists!nested}, or a nested structure of lists within lists. Function \Rfunction{unlist()} is what should be normally used in such cases. - -The list \code{lst10} is a nested system of lists, but all the ``terminal'' members are character strings. In other words, terminal nodes are all of the same mode, allowing the list to be ``flattened'' into a character vector.\qRfunction{is.list()} - -<>= -lst10 <- list(A = list("a", "aa", "aaa"), B = list("b", "bb")) -vct1 <- unlist(lst10) -vct1 -is.list(lst10) -is.list(vct1) -mode(lst10) -mode(vct1) -names(lst10) -names(vct1) -@ - -The returned value is a vector with named member elements. We use function \Rfunction{str()} to figure out how this vector relates to the original list. The names, always of mode character, are based on the names of list elements when available, while characters depicting positions as numbers are used for anonymous nodes. We can access the members of the vector either through numeric indexes or names. - -<>= -str(vct1) -vct1[2] -vct1["A2"] -@ - -\begin{playground} -Function \Rfunction{unlist()}\index{lists!convert into vector} has two additional parameters, with default argument values, which we did not modify in the example above. These parameters are \code{recursive} and \code{use.names}, both of them expecting a \code{logical} value as an argument. Modify the statement \code{c.vec <- unlist(c.list)}, by passing \code{FALSE} as an argument to these two parameters, in turn, and in each case, study the value returned and how it differs with respect to the one obtained above. -\end{playground} - -Function \Rfunction{unname()} can be used to remove names safely---i.e., without risk of altering the mode or class of the object. - -<>= -unname(vct1) -unname(lst10) -@ -\index{lists|)} - -<>= -rm(list = setdiff(ls(pattern="*"), to.keep)) -@ - -\section{Data frames}\label{sec:R:data:frames} -\index{data frames|(}\qRclass{data.frame} -\index{worksheet@`worksheet'|see{data frame}} -Data frames are a special type of list, in which all members have the same length, giving origin to a matrix-like object, in which columns can belong to different classes. Most commonly the member ``columns'' are vectors or factors, but they can also be matrices with the same number of rows as the enclosing data frame, or lists with the same number of members as rows in the enclosing data frame. - -Data frames are central to most data manipulation and analysis procedures in \Rlang. They are commonly used to store observations, with \code{numeric} columns holding data for continuous variables and \code{factor} columns data for categorical variables. Binary variables can be stored in \code{logical} columns. Text data can be stored in \code{character} columns. Date and time can be stored in columns of specific classes, such as \code{POSIXct}. In the diagram below, column \code{treatment} is a factor with two levels encoding two conditions, \code{hot} and \code{cold}. Columns \code{height} and \code{weight} are numeric vectors containing measurements. - -\begin{center} -\begin{footnotesize} -\begin{tikzpicture}[font=\sffamily, my shape/.style={ - rectangle split, rectangle split parts=#1, draw, anchor=north, minimum size=12mm}, -array/.style={matrix of nodes,nodes={draw, minimum size=1mm, fill=black},column sep=2pt, row sep=0.5mm, nodes in empty cells, -row 1/.style={nodes={draw=none, fill=none, minimum size=1mm}}}] - -\matrix[array] (array) { -1 & 2 & 3 \\ -\rule{10mm}{.1pt} & \rule{10mm}{.1pt} & \rule{10mm}{.1pt}\\}; - -\begin{scope}[on background layer] -\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east); -\end{scope} - -\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut df1}}; -\draw (array-2-1.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-17mm, yshift=-3mm, above] (nameh) {\rotatebox{180}{treatment\strut}}; -\draw (array-2-2.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-14.4mm, yshift=-3mm, above] (namec) {\rotatebox{180}{height\strut}}; -\draw (array-2-3.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-14.5mm, yshift=-3mm, above] (namew) {\rotatebox{180}{weight\strut}}; -%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index}; -\draw (array-1-3.east)--++(0:12.5mm) node [right]{\code{integer} positional indices}; -\draw (array-2-3.east)--++(0:8mm) node [right]{\textsl{heterogeneous} class, \textsl{same} length}; -\draw (namew)--++(0:15mm) node [right]{\code{character} column names}; -% - \node [my shape=4, rectangle split, fill=green!20] at (-1.3,-.25) - {hot\strut\nodepart{two}cold\strut\nodepart{three}hot\strut\nodepart{four}\ldots\strut}; - \node [my shape=4, fill=blue!20] at (0,-.25) - {10.2\strut\nodepart{two}\phantom{1}8.3\strut\nodepart{three}12.0\strut\nodepart{four}\ldots\strut}; - \node [my shape=4, fill=blue!20] at (1.3,-.25) - {2.2\strut\nodepart{two}3.3\strut\nodepart{three}2.5\strut\nodepart{four}\ldots\strut}; -%\draw (-0.6,+0.65) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut a.list}}; -\end{tikzpicture} -\end{footnotesize} -\end{center} - -Data frames are created with constructor function \Rfunction{data.frame()} with a syntax similar to that used for lists.\qRfunction{colnames()}\qRfunction{rownames()}\qRfunction{is.data.frame()} - -<>= -df1 <- data.frame(treatment = factor(rep(c("hot", "cold"), 3)), - height = c(10.2, 8.3, 12.0, 9.0, 11.2, 8.7), - weight = c(2.2, 3.3, 2.5, 2.8, 2.4, 3.0)) -df1 -colnames(df1) -rownames(df1) -str(df1) -class(df1) -mode(df1) -is.data.frame(df1) -is.list(df1) -@ - -We can see above that when printed each row of a \code{data.frame} is preceded by a row name. Row names are character strings, just like column names. The \Rfunction{data.frame()} constructor adds by default row names representing running numbers. Default row names are rarely of much use, except to track insertions and deletions of rows during debugging. - -\begin{playground} -As the expectation is that all member variables (or ``columns'') have equal length, if vectors of different lengths are supplied as arguments, the shorter vector(s) is/are recycled, possibly several times, until the required full length is reached, as shown below for \code{treatment}. - -<>= -df2 <- data.frame(treatment = factor(c("hot", "cold")), - height = c(10.2, 8.3, 12.0, 9.0, 11.2, 8.7), - weight = c(2.2, 3.3, 2.5, 2.8, 2.4, 3.0)) -@ - -Are \code{df1} crated above and \code{df2} created here equal? - -\end{playground} - -With function \Rfunction{class()} we can query the class of an \Rlang object (see section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}). As we saw in the previous chunk, \code{list} and \code{data.frame} objects belong to two different classes. However, their \code{mode} is the same. Consequently, data frames inherit the methods and characteristics of lists, as long as they have not been hidden by new ones defined for data frames (for an explanation of \emph{methods}, see section \ref{sec:methods} on page \pageref{sec:methods}). - -Extraction of individual member variables or ``columns'' can be done like in a list with operators \Roperator{[[ ]]} and \Roperator{\$} (see call out in \pageref{box:extraction:opers}). - -<>= -df1$height -df1[["height"]] -df1[[2]] -class(df1[["height"]]) -@ - -In the same way as with lists, we can add member variables to data frames. Recycling takes place if needed. - -<>= -df1$x2 <- 6:1 -df1[["x3"]] <- "b" -str(df1) -@ - -\begin{playground} -We have added two columns to the data frame, and in the case of column \code{x3} recycling took place. This is where lists and data frames differ substantially in their behavior. In a data frame, although class and mode can be different for different member variables (columns), they are required to be vectors or factors of the same length (or a matrix with the same number of rows, or a list with the same number of members). In the case of lists, there is no such requirement, and recycling never takes place when adding a member. Compare the values returned below for \code{LST1}, to those in the example above for \code{df1}. - -<>= -LST1 <- list(x = 1:6, y = "a", z = c(TRUE, FALSE)) -str(LST1) -LST1$x2 <- 6:1 -LST1$x3 <- "b" -str(LST1) -@ -\end{playground} - -\begin{faqbox}{How to make a list of data frames?} -We create a list of data frames in the same way as we create a nested list of lists, or in fact of a list of any other \Rlang objects. See page section \ref{sec:calc:lists:nested} on page \pageref{sec:calc:lists:nested}. - -<>= -list(df1, df2) -@ -\end{faqbox} - -\begin{faqbox}{How to create an empty data frame?} -In the same way as \code{numeric()} by default creates a \code{numeric} vector of length zero, \Rfunction{data.frame()} by default creates a \code{data.frame} with zero rows and no columns. - -<>= -data.frame() -@ -\end{faqbox} - -\begin{faqbox}{How to add new column to a data frame (to the front and end)?} -In the same way as me can assign a new member to a list using the extraction operator \Roperator{[[ ]]}, we can add a new column to a data frame (see page \pageref{par:calc:list:member:assign}). In this case, if the column name does not already exist, the assigned vector or factor is appended as the last column (no recycling applied to short vectors or factors unless of length one). - -<>= -DF1 <- data.frame(A = 1:5, B = factor(5:1)) -DF1[["C"]] <- 11:15 -DF1 -@ - -To add a column at the front, we can use function \Rfunction{cbind()} (column bind). - -<>= -DF2 <- data.frame(A = 1:5, B = factor(5:1)) -cbind(C = 11:15, DF2) -@ -\end{faqbox} - -Being two dimensional and rectangular in shape, data frames, in relation to indexing and dimensions behave similarly to a matrix. They have two margins, rows and columns, and two indices identify the location of a member ``cell''. We provide some examples here, but please consult section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing} and section \ref{sec:matrix:array} on page \pageref{sec:matrix:array} for additional details. - -Matrix-like notation allows simultaneous extraction from multiple columns, which is not possible with lists. The value returned is in most cases a ``smaller'' data frame as in this example. - -<>= -df1[2:3, 1:2] -@ - -<>= -# first column, df1[[1]] preferred -df1[ , 1] -# first column, df1[["x"]] or df1$x preferred -df1[ , "treatment"] -# first row -df1[1, ] -# first two rows of the third and fourth columns -df1[1:2, c(FALSE, FALSE, TRUE, TRUE, FALSE)] -# the rows for which comparison is true -df1[df1$treatment == "hot" , ] -# the heights > 8 -df1[df1$height > 8, "height"] -@ - -As explained earlier for vectors (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}), indexing can be present both on the right-hand side and left-hand side of an assignment, allowing the replacement individual values as well as rectangular chunks. - -The next few examples do assignments to ``cells'' of \code{df1}, either to one whole column, or individual values. The last statement in the chunk below copies a number from one location to another by using indexing of the same data frame both on the right side and left side of the assignment.\qRoperator{[[ ]]}\qRoperator{[ ]} - -<>= -df1[1, 2] <- 99 -df1 -df1[ , 2] <- -99 -df1 -df1[["height"]] <- c(10, 12) -df1 -df1[1, 2] <- df1[6, 3] -df1 -df1[3:6, 2] <- df1[6, 3] -df1 -@ - -Similarly as with matrices, if we extract a single column from a data frame using matrix-like indexing, it is by default simplified into a vector or factor, i.e., the column-dimension is dropped. By passing \code{drop = FALSE}, we can prevent this. Contrary to matrices, rows are not simplified in the case of data frames. - -<>= -is.data.frame(df1[1, ]) -is.data.frame(df1[ , 2]) -is.data.frame(df1[ , "treatment"]) -is.data.frame(df1[1:2, 2:3]) -is.vector(df1[1, ]) -is.vector(df1[ , 2]) -is.factor(df1[ , "treatment"]) -is.vector(df1[1:2, 2:3]) -is.data.frame(df1[ , 1, drop = FALSE]) -is.data.frame(df1[ , "treatment", drop = FALSE]) -@ - -\begin{warningbox} -In contrast to matrices and data frames, the extraction operator \Roperator{[ ]} of tibbles---defined in package \pkgname{tibble}--- never simplifies returned one-column tibbles into vectors (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble} for details on the differences between data frames and tibbles). -\end{warningbox} - -\begin{advplayground} -Usually data frames are created from lists or by passing individual vectors and factors to the constructors. It is also possible to construct data frames starting from matrices, other data frames and named vectors and combinations of them. In these cases additional nuances become important. We give only some examples here, as the details are well described in \code{help(data.frame)}. - -We use a named numeric vector, and a factor. The names are moved from the vector to the rows of the data frame! Consult \code{help(data.frame)} for an explanation. - -<>= -vct1 <- c(one = 1, two = 2, three = 3, four = 4) -fct1 <- as.factor(c(1, 2, 3, 2)) -df1 <- data.frame(fct1, vct1) -df1 -df1$my.vector -@ - -If we protect the vector with \Rlang's identity function \Rfunction{I()} the names are not removed from the vector as can be seen by extracting the column from the data frame. - -<>= -df2 <- data.frame(fct1, I(vct1)) -df2 -df2$my.vector -@ - -If we start with a matrix instead of a vector, the matrix is by default split into separate columns in the data frame. If the matrix has no column names, new ones are created. - -<>= -mat1 <- matrix(1:12, ncol = 3) -df4 <- data.frame(fct1, mat1) -df4 -@ - -If we protect the matrix with function \Rfunction{I()}, it is not split, and the whole matrix becomes a column in the data frame. - -<>= -df5 <- data.frame(fct1, I(mat1)) -df5 -df5$my.matrix -@ - -If we start with a list, each member with a suitable number of elements, each member becomes a column in the data frame. In the case of a too short one, recycling is applied. - -<>= -lst1 <- list(a = 4:1, b = letters[4:1], c = "n", d = "z") -df6<- data.frame(fct1, lst1) -df6 -@ - -The behaviour is quite different if we protect the list with \Rfunction{I()}: then the list is added in whole as a variable or column in the data frame. In this case the length or number of members in the list itself must match the number of rows in the data frame, while the length of the individual members of the list can vary. This is similar to the default behaviour of tibbles, but \Rlang data frames require explicit use of \Rfunction{I()} (see chapter \ref{chap:R:data} on page \pageref{chap:R:data} for details about package \pkgname{tibble}). - -<>= -df7<- data.frame(fct1, I(lst1)) -df7 -df7$my.list -@ - -What is this exercise about? Do check the documentation carefully and think of uses where the flexibility gained by use of function \Rfunction{I()} to protect arguments passed to the \Rfunction{data.frame()} constructor can be useful. In addition, write code to extract individual members of embedded matrices and lists using indexing in a single \Rlang statement in each case. Finally test if the behavior is the same when assigning new member variables (or ``columns'') to an existing data frame. -\end{advplayground} - -\subsection{Sub-setting data frames}\label{sec:calc:df:subset} -When\index{data frames!subsetting}\index{data frames!``filtering rows''} the names of data frames are long, complex conditions become awkward to write using indexing---i.e., subscripts. In such cases \Rfunction{subset()} is handy because evaluation is done in the ``environment'' of the data frame, i.e., the names of the columns are recognized if entered directly when writing the condition. Function \Rfunction{subset()} ``filters'' rows, usually corresponding to observations or experimental units. The condition is computed for each row, and if it returns \code{TRUE}, the row is included in the returned data frame, and excluded if \code{FALSE}. - -We create a data frame with six rows and three columns. For column \code{y}, we rely on \Rlang automatically extending \code{"a"} by repeating it six times, while for column \code{z}, we rely on \Rlang automatically extending \code{c(TRUE, FALSE)} by repeating it three times. - -<>= -df8 <- data.frame(x = 1:6, y = "a", z = c(TRUE, FALSE)) -subset(df8, x > 3) -@ - -\begin{advplayground} -What is the behavior of \code{subset()} when the condition is \code{NA}? Find the answer by writing code to test this, for a case where tests for different rows return \code{NA}, \code{TRUE} and \code{FALSE}. -\end{advplayground} - -When calling functions that return a vector, data frame, or other structure, the extraction operators \Roperator{[ ]}, \Roperator{[[ ]]}, or \Roperator{\$} can be appended to the rightmost parenthesis of the function call, in the same way as to the name of a variable holding the same data. - -<>= -subset(df8, x > 3)[ , -3] -subset(df8, x > 3)[ , "x", drop = FALSE] -subset(df8, x > 3)[ , "x"] -@ - -\begin{advplayground} -When do extraction operators applied to data frames return a vector or factor, and when do they return a data frame? Please, experiment with your own code examples to work out the answer. -\end{advplayground} - -\begin{explainbox} -In the case of \Rfunction{subset()} we can select columns directly as shown below, while for most other functions, extraction using operators \Roperator{[ ]}, \Roperator{[[ ]]} or \Roperator{\$} is needed. - -<>= -subset(df8, x > 3, select = 2) -@ - -<>= -subset(df8, x > 3, select = x) -@ - -<>= -subset(df8, x > 3, select = "x") -@ -\end{explainbox} - -None of the examples in the last four code chunks alters the original data frame \code{df8}. We can store the returned value using a new name if we want to preserve \code{df8} unchanged, or we can assign the result to \code{df8}, deleting in the process, the previously stored value. - -\begin{warningbox} -In the examples above, the names in the expression passed as the second argument to \code{subset()} were searched within \code{df8} and found. However, if not found in the data frame, objects with matching names are searched for in the global environment (as variables outwith the data frame, visible from the user's workspace). There being no variable \code{A} in the data frame \code{df8}, vector \code{A} from the environment is silently used in the chunk below resulting in a returned data frame with no rows as \code{A > 3} returns \code{FALSE}. - -<>= -A <- 1 -subset(df8, A > 3) -@ - -This also applies to the expression passed as argument to parameter \code{select}, here shown as a way of selecting columns based on names stored in a character vector. - -<>= -columns <- c("x", "z") -subset(df8, select = columns) -@ - -The use of \Rfunction{subset()} is convenient, but more prone to bugs compared to directly using the extraction operator \code{[ ]}. This same ``cost'' to achieving convenience applies to functions like \Rfunction{attach()} and \Rfunction{with()} described below. The longer time that a script is expected to be used, adapted and reused, the more careful we should be when using any of these functions. An alternative way of avoiding excessive verbosity is to keep the names of data frames short. -\end{warningbox} - -A frequently used way of deleting a column by name from a data frame is to assign \code{NULL} to it---i.e., in the same way as members are usually deleted from \code{list}s. This approach modifies \code{df9} in place, rather than returning a modified copy of \code{df9}. - -<>= -df9 <- df8 -head(df9) -df9[["y"]] <- NULL -head(df9) -@ - -Alternatively, we can use negative indexing to remove columns from a copy of a data frame. In this example we remove a single column. As base \Rlang does not support negative indexing by name with the extraction operator, we need to find the numerical index of the column to delete. (See the examples above using \code{subset()} with bare names to delete columns.) - -<>= -df8[ , -which(colnames(df8) == "y")] -@ - -Instead of using the equality test, we can use the operator \code{\%in\%} or function \code{grepl()} to create a \code{logical} vector useful to delete or select multiple columns in a single statement. - -\begin{playground} -In the previous code chunk we deleted the last column of the data frame \code{df8}, but using the extraction operator, we modified only the returned copy of \code{df8}, leaving \code{df8} unchanged. Thus we reuse it here for a surprising trick. You should first untangle how it changes the positions of columns and rows, and afterwards think how and why indexing with the extraction operator \Roperator{[ ]} on both sides of the assignment operator \Roperator{<-} can be useful when working with data. - -<>= -df8[1:6, c(1,3)] <- df8[6:1, c(3,1)] -df8 -@ -\end{playground} - -\begin{warningbox} -Although in this last example we used numeric indexes to make it more interesting, in practice, especially in scripts or other code that will be reused, do use column or member names instead of positional indexes whenever possible. This makes code much more reliable, as changes elsewhere in the script could alter the order of columns and \emph{invalidate} numerical indexes. In addition, using meaningful names makes programmers' intentions easier to understand. -\end{warningbox} - -\subsection{Summarizing and splitting data frames}\label{sec:calc:df:split}\label{sec:calc:df:aggregate} -Function\index{data frames!summarizing} \Rfunction{summary()} can be used to obtain a summary from objects of most \Rlang classes, including data frames. It is also possible to use \Rloop{sapply()}, \Rloop{lapply()} or \Rloop{vapply()} to apply any suitable function to data by columns (see section \ref{sec:data:apply} on page \pageref{sec:data:apply} for a description of these functions and their use). - -<>= -summary(df8) -@ - -\index{data frames!splitting} -\Rlang function \Rfunction{split()} makes it possible to split a data frame into a list of data frames, based on the levels of a factor, even if the rows are not ordered according to factor levels. - -We create a data frame with six rows and three columns. In the case of column \code{z}, we rely on \Rlang to automatically extend \code{c("a", "b")} by repeating it three times so as to fill the six rows. - -<>= -df10 <- data.frame(x1 = 1:6, x2 = c(1, 5, 4, 2, 6, 3), z = c("a", "b")) -@ - -<>= -split(df10, df10$z) -@ - -\begin{explainbox} -The same operation can be specified using a one-sided formula \code{\textasciitilde z} to indicate the grouping. - -<>= -split(df10, ~ z) -@ - -\end{explainbox} - -Function \Rfunction{unsplit()} can be used to reverse splitting done by \Rfunction{split()}. - -\begin{explainbox} -\Rfunction{split()} is sometimes used in combination with apply functions (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}) to compute group or treatment summaries. However, in most cases it is simpler to use \Rfunction{aggregate()} for computing such summaries. -\end{explainbox} - -Related to splitting a data frame is the calculation of summaries based on a subset of cases, or more commonly summaries for all observations but after grouping them based on the values in a column or the levels of a factor. - -\begin{faqbox}{How to summarize one variable from a data frame by group?} -To summarize a single variable by group we can use \Rfunction{aggregate()}. - -<>= -aggregate(x = iris$Petal.Length, by = list(iris$Species), FUN = mean) -@ - -\end{faqbox} - -\begin{faqbox}{How to summarize numeric variables from a data frame by group?} -To summarize variables we can use \Rfunction{aggregate()} (see section \ref{sec:dplyr:group:wise} on page \pageref{sec:dplyr:group:wise} for an alternative approach using package \pkgnameNI{dplyr}). - -<>= -aggregate(x = iris[ , sapply(iris, is.numeric)], by = list(iris$Species), FUN = mean) -@ - -For these data as the only non-numeric variable is \code{Species} we could have also used formula notation as shown below. -\end{faqbox} - -\begin{explainbox} -There\index{data frames!summarizing} is also a formula-based \Rfunction{aggregate()} method (or ``variant'') available (\Rlang \emph{formulas} are described in depth in section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas}). In \Rfunction{aggregate()}, the left hand side (\emph{lhs}) of the formula indicates the variable to summarize and the right hand side (\emph{rhs}) the factor used to split the data before summarizing them. - -<>= -aggregate(x1 ~ z, FUN = mean, data = df10) -@ - -We can summarize more than one column at a time. -<>= -aggregate(cbind(x1, x2) ~ z, FUN = mean, data = df10) -@ - -If all the columns not used for grouping are valid input to the function passed as argument to \code{FUN} the formula can be simplified using \code{.} with meaning ``all columns except those on the \emph{rhs} of the formula''. -<>= -aggregate(. ~ z, FUN = mean, data = df10) -@ - -\end{explainbox} - -Function \Rfunction{aggregate()} can be also used to aggregate time series data based on time intervals (see \code{help(aggregate)}). - -\subsection{Re-arranging columns and rows} -\index{data frames!ordering rows}\index{data frames!ordering columns} -As with members of vectors and lists, to change the position of columns or row in a data frame we use the extraction operator and indexing by name or position. In a matrix-like object, such as data frames the first index corresponds to rows and the second to columns. - -The most direct way of changing the order of columns and/or rows in data frames (as for matrices and arrays) is to use subscripting. Once we know the original position and target position we can use column names or positions as indexes on the right-hand side, listing all columns to be retained, even those remaining at their original position. - -<>= -df11 <- data.frame(A = 1:10, B = 3, C = c("A", "B")) -head(df11, 2) -df11 <- df11[ , c("B", "A", "C")] -head(df11, 2) -@ - -\begin{warningbox} -When using the extraction operator \Roperator{[ ]} on both the left-hand-side and right-hand-side with a \code{numeric} vector as argument to swap two columns, the vectors or factors are swapped, while the names of the columns are not! -To retain the correspondence between column naming and column contents after swapping or rearranging the columns \emph{using numeric indices}, we need to separately move the names of the columns. This may seem counter intuitive, unless we think in terms of positions being named rather than the contents of the columns being linked to the names.\qRfunction{colnames()}\qRfunction{colnames()<-} - -<>= -df11 <- data.frame(A = 1:10, B = 3, C = c("A", "B")) -head(df11, 2) -df11[ , 1:2] <- df11[ , 2:1] -head(df11, 2) -colnames(df11)[1:2] <- colnames(df11)[2:1] -head(df11, 2) -@ - -\end{warningbox} - -Taking into account that \Rfunction{order()} returns the indexes needed to sort a vector (see page \pageref{box:vec:sort}), we can use \Rfunction{order()} to generate the indexes needed to sort the rows of a data frame. In this case, the argument to \Rfunction{order()} is usually a column of the data frame being arranged. However, any vector of suitable length, including the result of applying a function to one or more columns, can be passed as an argument to \Rfunction{order()}. Function \Rfunction{order()} is not useful for sorting columns of data frames \emph{based on data from the columns} as it requires a vector across columns as input, which is possible only when all columns are of the same class. (In the case of \Rclass{matrix} and \Rclass{array} this approach can be applied to any of their dimensions as all their elements homogenously belong to one class.) - -\begin{faqbox}{How to order columns or rows in a data frame?} -We use column names or numeric indexes with the extraction operator \Roperator{[ ]} only on the \emph{rhs} of the assignment. For example, to arrange the columns of data set \code{iris} in decreasing alphabetical order, we use \Rfunction{sort()} as shown, or \Rfunction{order()} (see page \pageref{box:vec:sort}). - -<>= -sorted_cols_iris <- iris[ , sort(colnames(iris), decreasing = TRUE)] -head(sorted_cols_iris, 6) -@ - -Similarly we use values in a column as argument to \Rfunction{order()} to obtain the \code{numeric} indices to sort rows. - -<>= -sorted_rows_iris <- iris[order(iris$Petal.Length), ] -head(sorted_rows_iris, 6) -@ - -\end{faqbox} - -\begin{advplayground}\index{data frames!ordering rows} -Create a new data frame containing three numeric columns with three different haphazard sequences of values and a factor with two levels. Call these columns \code{A}, \code{B}, \code{C} and \code{F}. 1) Sort the rows of the data frame so that the values in \code{A} are in decreasing order. 2) Sort the rows of the data frame according to increasing values of the sum of \code{A} and \code{B} without adding a new column to the data frame or storing the vector of sums in a variable. In other words, do the sorting based on sums calculated on-the-fly. 1) Sort the rows by level of factor \code{F}, and 2) by level of factor \code{F} and by values in \code{B} within each factor level. Hint: revisit the exercise on page \pageref{calc:ADVPG:order:sort} were the use of \Rfunction{order()} on factors is described. -\end{advplayground} - -\subsection{Re-encoding or adding variables} - -It is common that some variables need to be added to an existing data frame based on existing variables, either as a computed value or based on mapping for example treatments to sample codes already in a data frame. In the second case, named\index{named vectors!mapping with} vectors can be used to replace values in a variable or to add a variable to a data frame. - -Mapping is possible because the length of the value returned by the extraction operator \Roperator{[ ]} is given by the length of the indexing vector (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}). Although we show toy-like examples, this approach is most useful with data frames containing many rows. - -If the existing variable is a character vector or factor, we need to create a named vector with the new values as data and the existing values as names. - -<>= -df12 <- - data.frame(genotype = rep(c("WT", "mutant1", "mutant2"), 2), - value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2)) -mutant <- c(WT = FALSE, mutant1 = TRUE, mutant2 = TRUE) -df12$mutant <- mutant[df12$genotype] -df12 -@ - -If the existing variable is an \code{integer} vector, we can use a vector without names, being careful that the positions in the \emph{mapping} vector match the values of the existing variable - -<>= -df13 <- data.frame(individual = rep(1:3, 2), - value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2)) -genotype <- c("WT", "mutant1", "mutant2") -df13$genotype <- genotype[df13$individual] -df13 -@ - -\begin{advplayground} -Add a variable named \code{genotype} to the data frame below so that for individual \code{4} its value is \code{"WT"}, for individual \code{1} its value is \code{"mutant1"}, and for individual \code{2} its value is \code{"mutant2"}. - -<>= -DF1 <- data.frame(individual = rep(c(2, 4, 1), 2), - value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2)) -@ -\end{advplayground} - -\subsection{Operating within data frames}\label{sec:calc:df:with} - -In the case of computing new values from existing variables named vectors are of limited use. Instead, variables in a data frame can be added or modified with \Rlang functions \Rscoping{transform()}, \Rscoping{with()} and \Rscoping{within()}. These functions can be thought as convenience functions as the same computations can be done using the extraction operators to access individual variables, in the lhs, rhs or both lhs and rhs (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}). - -In the case of \Rscoping{with()} only one, possibly compound code statement is affected and this statement is passed as an argument. As before, we need to fully specify the left-hand side of the assignment. The value returned is the one returned by the statement passed as an argument, in the case of compound statements, the value returned by the last contained simple code statement to be executed. Consequently, if the intent is to modify the container, assignment to an individual member variable (column in this case) is required. - -In this example, column \code{A} of \code{df14} takes precedence, and the returned value is the expected one. - -<>= -df14 <- data.frame(A = 1:10, B = 3) -df14$C <- with(df14, (A + B) / A) # add column -head(df14, 2) -@ - -In the case of \Rscoping{within()}, assignments in the argument to its second parameter affect the object returned, which is a copy of the container (In this case, a whole data frame), which still needs to be saved through assignment. Here the intention is to modify it, so we assign it back to the same name, but it could have been assigned to a different name so as not to overwrite the original data frame. - -<>= -df14$C <- NULL -df15 <- within(df14, C <- (A + B) / A) # midified copy -head(df15, 2) -@ - -In the example above, using \code{within()} makes little difference compared to using \Rscoping{with()} with respect to the amount of typing or clarity, but with multiple member variables being operated upon, as shown below, \Rscoping{within()} has an advantage resulting in more concise and easier to understand code. - -<>= -df16 <- within(df14, - {C <- (A + B) / A - D <- A * B - E <- A / B + 1} - ) -head(df16, 2) -@ - -\begin{explainbox} -Repeatedly pre-pending the name of a \emph{container} such as a list or data frame to the name of each member variable being accessed can make \Rlang code verbose and difficult to understand. Functions \Rscoping{attach()} and its matching \Rscoping{detach()} allow us to change where \Rlang first looks for the names of objects we include in a code statement. When using a long name for a data frame, entering a simple calculation can easily result in a difficult to read statement. Here even with a very short name, the difference is noticeable. - -<>= -df14$C <- (df14$A + df14$B) / df14$A -df14$D <- df14$A * df14$B -df14$D <- df14$A / df14$B + 1 -head(df14, 2) -@ - -Using\index{data frames!attaching}\label{par:calc:attach} \Rscoping{attach()} we can alter where \Rlang looks up names and consequently simplify the statement. With \Rscoping{detach()} we can restore the original state. It is important to remember that here we can only simplify the right-hand side of the assignment, while the ``destination'' of the result of the computation still needs to be fully specified on the left-hand side of the assignment operator. We include below only one statement between \Rscoping{attach()} and \Rscoping{detach()} but multiple statements are allowed. Furthermore, if variables with the same name as the columns exist in the search path, these will take precedence, something that can result in bugs or crashes, or as seen below, a message warns that variable \code{A} from the global environment will be used instead of column \code{A} of the attached \code{df17}. The returned value is, of course, not the desired one. - -<>= -df17 <- data.frame(A = 1:10, B = 3) -A -attach(df17) -A -detach(df17) -A -@ - -<>= -attach(df17) -df17$C <- (A + B) / A -detach(df17) -head(df17, 2) -@ - -Use of \Rscoping{attach()} and \Rscoping{detach()}, which work as a pair of ON and OFF switches, can result in an undesired after-effect on name lookup if the script terminates after \Rscoping{attach()} is executed but before \Rscoping{detach()} is called, as the attached object is not detached. In contrast, \Rscoping{with()} and \Rscoping{within()}, being self-contained, guarantee that cleanup takes place. Consequently, the usual recommendation is to give preference to the use of \Rscoping{with()} and \Rscoping{within()} over \Rscoping{attach()} and \Rscoping{detach()}. -\end{explainbox} - -<>= -rm(list = setdiff(ls(pattern="*"), to.keep)) -@ - -\section{Reshaping and editing data frames}\label{sec:calc:reshape} -\index{data frames!long vs.\ wide shape} - -As mentioned above, in most cases in \Rlang data rows represent measurement events or observations possibly on multiple response variables and factors describing groupings, i.e., a ``long'' shape. However, when measurements are repeated in time, columns rather frequently represent observations of the same response variable at different times, i.e., a ``wide'' shape. Other cases exist where reshaping is needed. Function \Rfunction{reshape()} can convert wide data frames into long data frames and vice versa. See section \ref{sec:data:reshape} on page \pageref{sec:data:reshape} on package \pkgnameNI{tidyr} for an alternative approach to reshaping data with a friendlier user interface. - -We start by creating a data frame of hypothetical data measured on two occasions. With these data, for example if we wish to compute growth of each subject, by computing the difference in \code{weight} and in \code{height} between the two times, one approach is to reshape the data frame into a wider shape and subsequently subtract the columns. - -<>= -# artifical data -df1 <- data.frame(id = rep(1:4, rep(2,4)), - Time = factor(rep(c("Before","After"), 4)), - Weight = rnorm(n = 4, mean = c(20.1, 30.8)), - Height = rnorm(n = 4, mean = c(9.5, 14.2))) -df1 -# make it wider -df2 <- reshape(df1, timevar = "Time", idvar = "id", direction = "wide") -df2 -# possible further calculation -within(df2, - { - Height.growth <- Height.After - Height.Before - Weight.growth <- Weight.After - Weight.Before - }) -@ - -Alternatively, we may want to convert \code{df1} into a longer shape, with a single column with measurements, and a new column indicating whether the measured variable was \code{height} or \code{weight}. For this operation to succeed, we need to add a column with a unique value for each row in \code{df1}, and one easy way is to copy row names into a column. The names of the parameters of function \Rfunction{reshape()} are meaningful only when dealing with time. Thus, reading the code below becomes rather difficult. It is also to note that the user is responsible of passing the values to \code{times} in the correct order. - -<>= -df1$ID <- rownames(df1) # unique ID for each row -# make it longer -reshape(df1, - idvar = "ID", - timevar = "Quantity", - times = c("Weight", "Height"), - v.names = "Value", - direction = "long", - varying = c("Weight", "Height")) -@ - -To edit a data frame programmatically, one can use the approaches already discussed, using the extraction operators \Roperator{[ ]} or \Roperator{[[ ]]} on the \emph{lhs} of \Roperator{<-} to replace member elements. This in combination with functions like \Rfunction{gsub()} makes it possible to ``edit'' the contents of data frames. - -Methods \Rfunction{View()}, \Rfunction{edit()} and \Rfunction{fix()} can be used interactively to display and edit \Rlang objects. When using \Rpgrm from within IDEs like \RStudio, calling these functions with a data frame as argument opens in most cases the IDE's own worksheet-like data editors, and for other types of objects a text editor pane. Output is not included for this chunk, as the use of these functions requires user interaction. Please, run these examples in \Rpgrm and in an IDE like \RStudio. - -<>= -View(cars) -edit(cars) -@ - -\begin{explainbox} -These functions can be used at the \Rlang console also when \Rpgrm is used on its own, but the editors activated are different ones. In any case, the use of scripts has made the interactive use of \Rpgrm at the console less frequent and the need to edit \Rlang objects previously saved in the user's current workspace nearly disappear. \Rfunction{View()}, \Rfunction{edit()} and \Rfunction{fix()} are unusual in that their definitions are dependent on system variables that at least when using \Rpgrm on its own, can be modified by the user. -\end{explainbox} - -\index{data frames|)} - -<>= -rm(list = setdiff(ls(pattern="*"), to.keep)) -@ - -\section{Attributes of \Rlang objects}\label{sec:calc:attributes} -\index{attributes|(} - -\Rlang objects can have attributes. Attributes are named \emph{slots} normally used to store ancillary data such as object properties functioning as additional fields where to store additional information in any \Rlang object. There are no restrictions on the class of what is assigned to an attribute. They can be used to store metadata accompanying the data stored in an object, which is important for reproducible research and data sharing. They can be set and read by user code and they are also used internally by \Rlang among other things to store the class an object belongs to, column and row names in data frames and matrices and the labels of levels in factors. Although most \Rlang objects have attributes, they are rarely displayed explicitly when an object is printed, while the structure of objects as displayed by function \Rfunction{str()} includes them. - -Although we rarely need to set or extract values stored in attributes explicitly, many of the features of \Rlang that we take for granted are implemented using attributes: columns names in data frames are stored in an attribute. Matrices are vectors with additional attributes. - -<>= -df1 <- data.frame(x = 1:6, y = c("a", "b"), z = c(TRUE, FALSE, NA)) -df1 -attributes(df1) -str(df1) -@ - -Attribute \code{"comment"} is meant to be set by users to store a character string---e.g., to store metadata as text together with data. As comments are frequently used, \Rlang has functions for accessing and setting comments. \qRfunction{comment()}\qRfunction{comment()<-} - -<>= -comment(df1) -comment(df1) <- "this is stored as a comment" -comment(df1) -@ - -Functions like \Rfunction{names()}, \Rfunction{dim()} or \Rfunction{levels()} return values retrieved from attributes stored in \Rlang objects, whereas \Rfunction{names()<-}, \Rfunction{dim()<-} or \Rfunction{levels()<-} set (or unset with \code{NULL}) the value of the respective attributes. Dedicated query and set functions do not exist for all attributes. Functions \Rfunction{attr()}, \Rfunction{attr()<-} and \Rfunction{attributes()} can be used with any attribute. With \Rfunction{attr()} we query, and with \Rfunction{attr()<-} we set individual attributes by name. With \Rfunction{attributes()} we retrieve all attributes of an object as a named \code{list}. In addition, method \Rfunction{str()} displays all components and structure of \Rlang objects including their attributes. - -Continuing with the previous example, we can retrieve and set the value stored in the \code{"comment"} attribute using these functions. In the second statement we delete the value stored in the attribute by assigning \code{NULL} to it. - -<>= -attr(df1, "comment") -attr(df1, "comment") <- NULL -attr(df1, "comment") -comment(df1) # same as previous line -@ - -The \code{"names"} attribute of \code{df1} was set by the \code{data.frame()} constructor when it was created above. In the next example, in the first statement we retrieve the names, and implicitly print them. In the second statement, read from right to left, we retrieve the names, convert them to upper case and save them back to the same attribute. \qRfunction{colnames()}\qRfunction{colnames()<-} - -<>= -names(df1) -colnames(df1) # same as names() -colnames(df1) <- toupper(colnames(df1)) -colnames(df1) -attr(df1, "names") # same as previous line -@ - -\begin{advplayground} - In general, \Rlang objects do not have by default names assigned to members. As seen on page \pageref{par:calc:vector:map} we can give names to vector members during construction with a call to \Rfunction{c()} or we can assign names (set attribute \code{names}) with function \Rfunction{names()<-} to existing vectors. Lists behave almost the same as vectors, although members of nested objects can also be named. Data frames have attributes \code{names} and \code{row.names}, that can be accessed with functions \Rfunction{names()} or \Rfunction{colnames()}, and function \Rfunction{rownames()}, respectively. The attributes can be set with functions \Rfunction{names()<-} or \Rfunction{colnames()<-}, and \Rfunction{rownames()<-}. The \Rfunction{data.frame()} constructor sets (column) names and row names by default. The \Rfunction{matrix()} constructor by default does not set \code{dimnames} or \code{names} attributes. When names are assigned to a \code{matrix} with \Rfunction{names()<-}, the matrix behaves like a vector, and the names are assigned to individual members. Functions \Rfunction{dimnames()<-}, \Rfunction{colnames()<-} and \Rfunction{rownames()<-} are used to assign names to columns and rows. The matching functions \Rfunction{dimnames()}, \Rfunction{colnames()} and \Rfunction{rownames()} are used to access these values. - - When no names have been set, \Rfunction{names()}, \Rfunction{colnames()}, \Rfunction{rownames()}, and \Rfunction{dimnames()} return \code{NULL}. In contrast, \Rfunction{labels()}, intended to be used for printing, returns made-up names based on positions. - - Run the examples below and write similar examples for \code{list} and \code{data.frame}. For \code{matrix}, write an additional statement that uses \Rfunction{dimnames()<-} to set row and column names simultaneously. - -<>= -VCT1 <- 5:10 -names(VCT1) -labels(VCT1) -names(VCT1) <- letters[5:10] -names(vct1) -labels(VCT1) -@ - -<>= -MAT1 <- matrix(1:10, ncol = 2) -dimnames(MAT1) -labels(MAT1) -colnames(MAT1) <- c("a", "b") -colnames(MAT1) -dimnames(MAT1) -labels(MAT1) -@ -\end{advplayground} - -We can add a new attribute, under our own control, as long as its name does not clash with those of existing attributes. - -<>= -attr(df1, "my.attribute") <- "this is stored in my attribute" -attributes(df1) -@ - -\begin{explainbox} -The attributes used internally by \Rlang can be directly modified by user code. In most cases this is unnecessary as \Rlang provides pairs of functions to query and set the relevant attributes. This is true for the attributes \code{dim}, \code{names} and \code{levels}. In the example below we read the attributes from a matrix. - -<>= -mat1 <- matrix(1:10, ncol = 2) -attributes(mat1) -dim(mat1) -dimnames(mat1) -@ - -<>= -labels(mat1) -mat1 -@ - -<>= -attr(mat1, "dim") -attr(mat1, "dim") <- c(2, 5) -mat1 -@ - -<>= -attr(mat1, "dim") <- NULL -is.vector(mat1 ) -mat1 -@ - -In this case we could have used \Rfunction{dim()} instead of \Rfunction{attr()}. -\end{explainbox} - -\begin{warningbox} -There is no restriction to the creation, setting, resetting and reading of attributes, but not all functions and operators that can be used to modify objects will preserve non-standard attributes. This can be a problem when using some \Rlang packages, such as the \pkgname{tidyverse}. So, using private attributes is a double-edged sword that usually is worthwhile considering only when designing a new class together with the corresponding methods for it. A good example of extensive use of class-specific attributes are the values returned by model fitting functions like \Rfunction{lm()} (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}). -\end{warningbox} - -<>= -rm(list = setdiff(ls(pattern="*"), to.keep)) -@ - -\index{attributes|)} - -\section{Saving and loading data} - -\subsection{Data sets in \Rlang and packages} -\index{data!loading data sets|(}\index{data!saving data sets|(} -To be able to present more meaningful examples, we need some real data. Here we use \code{cars}, one of the many data sets included in base \Rpgrm. Function \Rfunction{data()} is used to load data objects that are included in \Rlang or contained in packages (whether a call to \Rfunction{data()} is needed or not depends on how the package where the data is defined was configured). It is also possible to import data saved in files with \textit{foreign} formats, defined by other software or commonly used for data exchange. Package \pkgname{foreign}, included in the \Rlang distribution, as well as contributed packages make available functions capable of reading and decoding various foreign formats. How to read or import ``foreign'' data is discussed in \Rlang documentation in \emph{R Data Import/Export}, and in this book, in chapter \ref{chap:R:data:io} on page \pageref{chap:R:data:io}. It is also good to keep in mind that in \Rlang, URLs (Uniform Resource Locators) are accepted as arguments to the \code{file} or \code{path} parameter of many functions (see section \ref{sec:files:remote} on page \pageref{sec:files:remote}). - -In the next example we load data available in \Rlang package \pkgname{datasets} as \Rlang objects by calling function \Rfunction{data()}. The loaded \Rlang object \code{cars} is a data frame. (Package \pkgname{datasets} is part of the \Rpgrm distribution and always available). - -<>= -data(cars) -@ - -%Once we have a data set available, the first step is usually to explore it, and we will do this with \code{cars} in section \ref{sec:calc:looking:at:data} on page \pageref{sec:calc:looking:at:data}. -%\index{data!loading data sets|)} - -\subsection{.rda files}\label{sec:data:rda} - -By\index{file formats!RDA ``R data, multiple objects''} default, at the end of a session, the current workspace containing the results of one's work is saved into a file called \code{.RData}. In addition to saving the whole workspace, it is possible to save one or more \Rlang objects present in the workspace to disk using the same file format (with file name tag \code{.rda} or \code{.Rda}). One or more objects, belonging to any mode or class can be saved into a single file using function \Rfunction{save()}. Reading the file restores all the saved objects into the current workspace with their original names. These files are portable across most \Rlang versions---i.e., old formats can be read and written by newer versions of \Rpgrm, although the newer, default format may be not readable with earlier \Rpgrm versions. Whether compression is used, and whether the ``binary'' data is encoded into ASCII characters, allowing maximum portability at the expense of increased size can be controlled by passing suitable arguments to \Rfunction{save()}. - -We create a data frame object and then save it to a file. The file name used can be any valid one in the operating system, however to ensure compatibility with multiple operating systems, it is good to use only ASCII characters. Although not enforced, using the name tag \code{.rda} or \code{.Rda} is recommended. - -<>= -df1 <- data.frame(x = 1:5, y = 5:1) -df1 -save(df1, file = "df1.rda") -@ - -We delete the data frame object and confirm that it is no longer present in the workspace (see page \pageref{par:calc:remove} for details about \Rfunction{remove()} and \Rfunction{objects()}). - -<>= -remove(df1) -objects(pattern = "df1") -@ - -We read the file we earlier saved to restore the object.\qRfunction{load()} - -<>= -load(file = "df1.rda") -objects(pattern = "df1") -df1 -@ - -The default format used is binary and compressed, which results in smaller files. - -\begin{playground} -In the example above, only one object was saved, but one can simply give the bare names of additional objects as arguments separated by commas ahead of \code{file}. Just try saving more than one data frame to the same file. Then the data frames plus a few vectors. After creating each file, clear the workspace and then restore from the file the objects you saved. -\end{playground} - -Sometimes it is easier to supply the names of the objects to be saved as a vector of \code{character} strings passed as an argument to parameter \code{list} (in spite of the name the argument passed must be a \code{vector}, not a \code{list}). One use case is saving a group of objects based on their names. In this case one can use \Rfunction{ls()} to obtain a vector of \code{character} strings with the names of objects matching a simple \code{pattern} or a complex \emph{regular expression} (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}). The example below uses this approach in two steps, first saving in variable \code{objcts} a \code{character} \code{vector} with the names of the objects matching a pattern, and then using this saved vector as an argument to parameter \code{list} in the call to \Rfunction{save()}. - -<>= -dfs <- ls(pattern = "*.df") -save(list = dfs, file = "my-dfs.rda") -@ - -The two statements above can be combined into a single statement by nesting the function calls. - -<>= -save(list = ls(pattern = "*.df"), file = "my-dfs.rda") -@ - -\begin{playground} -Practice using different patterns with \Rfunction{objects()}. You do not need to save the objects to a file. Just have a look at the list of object names returned. -\end{playground} - -As a coda, I show how to clean up by deleting the two files we created. Function \Rfunction{file.remove()} can be used to delete files stored in the operating system file system, usually on a hard disk drive or a solid state drive, as long as the user has enough rights. No confirmation is requested, so care not to delete valuable files is required. Function \Rfunction{unlink()}, is not an exact equivalent, as it can also delete folders and supports recursion through nested folders. The name \emph{unlink} is borrowed from that of the equivalent function in \osnameNI{Unix} and \osnameNI{Linux}. - -<>= -file.remove(c("my-dfs.rda", "df1.rda")) -@ - -\subsection{.rds files}\label{sec:data:rds} - -The\index{file formats!RDS ``R data, single object''} RDS format can be used to save individual objects instead of multiple objects (usually using file name tag \code{.rds}). They are read and saved with functions \Rfunction{readRDS()} and \Rfunction{saveRDS()}, respectively. The value returned by a call to \Rfunction{readRDS()} is the object read from the file on disk. When RDS files are read, different from when RDA files are loaded, assigning the object read to a name is frequently the first step. This name can be any valid \Rlang name. Of course, it is also possible to use the object returned by \Rfunction{readRDS()} as an argument to a function by nesting the function calls. - -<>= -saveRDS(df1, "df1.rds") -@ - -If we read the file at the \Rpgrm console, by default the read \Rlang object will be printed at the console. - -<>= -readRDS("df1.rds") -@ - -If we assign the read object to a different name, it is possible to check if the object read is identical to the one saved. - -<>= -df2 <- readRDS("df1.rds") -identical(df1, df2) -@ - -As above, we clean up by deleting the file. - -<>= -file.remove("df1.rds") -@ - -\subsection{\code{dput()}} - -In\index{file formats!R data ``deparsed object''} general the use of \code{.rda} and {.rds} files is preferred. Function \Rfunction{dput()} is sometimes used to share data as part of a code chunk at StackOverflow, mostly as a convenient way of converting a data frame or list into plain text that can be pasted into the code chunk listing to reconstruct the object. If no argument is passed to parameter \code{file}, the result of deparsing an object is printed at the \Rlang console. - -<>= -dput(df1) -@ - -There exists a companion function \Rfunction{dget()} to recreate the object. -\index{data!saving data sets|)}\index{data!loading data sets|)} - -\begin{warningbox} - Output to, and input from, text-based file formats as well as to and from various binary formats \emph{foreign} to \Rlang is described in chapter \ref{chap:R:data:io} on page \pageref{chap:R:data:io}. -\end{warningbox} - -<>= -rm(list = setdiff(ls(pattern="*"), to.keep)) -@ - -\section{Plotting} -\index{plots!base R graphics} -In most cases the most effective way of obtaining and overview of a data set is by plotting it using multiple approaches. The base-\Rlang generic method \Rfunction{plot()} can be used to plot different data. It is a generic method that has specializations suitable for different kinds of objects (see section \ref{sec:script:objects:classes:methods} on page \pageref{sec:script:objects:classes:methods} for a brief introduction to objects, classes and methods). In this section we only very briefly demonstrate the use of the most common base-\Rlang graphics functions. They are well described in the book \citebooktitle{Murrell2019} \autocite{Murrell2019}. We will not describe the Lattice (based on S's Trellis) approach to plotting \autocite{Sarkar2008}. Instead we describe in detail the use of the \emph{layered grammar of graphics} and plotting with package \ggplot in chapter \ref{chap:R:plotting} on page \pageref{chap:R:plotting}. - -\subsection{Plotting data} -It is possible to pass two variables (here columns from a data frame) directly as arguments to the \code{x} and \code{y} parameters of \Rfunction{plot()}. - -<>= -opts_chunk$set(opts_fig_narrow_square) -@ - -<>= -plot(x = cars$speed, y = cars$dist) -@ - -We can also use \Rfunction{with()} or \Rfunction{attach()} as described in section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}. (Same plot as above, not shown.) - -<>= -with(cars, plot(x = speed, y = dist)) -@ - -However, it is best to use a \emph{formula} to specify the variables to be plotted on the $x$ and $y$ axes, passing additionally as an argument to parameter \code{data} a data frame containing these variables. The formula \code{dist \textasciitilde\ speed}, is read as \code{dist} explained by \code{speed}---i.e., \code{dist} is mapped to the $y$-axis as the dependent variable and \code{speed} to the $x$-axis as the independent variable. The names used in the formula, are looked up as columns in the \code{data,frame} argument passed to \code{data}, thus similarly as when using \Rfunction{with()}. As described in section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} the same syntax is used to describe models to be fitted to observations. (Same plot as above, not shown.) - -<>= -plot(dist ~ speed, data = cars) -@ - -Within \Rlang there exist different specializations, or ``flavors,'' of method \Rfunction{plot()} that become active depending on the class of the variables passed as arguments: passing two numerical variables results in a scatter plot as seen above. In contrast passing one factor and one numeric variable to \code{plot()} results in a box-and-whiskers plot being produced. To exemplify this we need to use a different data set, here \code{chickwts} as \code{cars} does not contain any factors. Use \code{help("chickwts")} to learn more about this data set, also included in \Rpgrm . - -<>= -plot(weight ~ feed, data = chickwts) -@ - -\subsection{Graphical output} -Graphical\index{file formats!PDF}\index{file formats!PNG} output, such as produced by \Rfunction{plot()}, is rendered by means of \emph{graphical output devices}. -When \Rlang is used interactively, a software device is opened automatically to output the graphical output to a physical device, usually the computer screen. The name of the \Rlang software device used may depend on the operating system (e.g., \osname{MS-Windows} or \osname{Linux}), or on the IDE (e.g., \RStudio). - -In \Rlang, software graphical devices not necessarily generate output on a physical device like a printer, as several of these devices translate the plotting commands into a file format and save it to disk. Several different graphical devices are available in \Rlang and they differ in the kind of output they produce: raster or bitmap files (e.g., TIFF, PNG and JPEG formats), vector graphics files (e.g., SVG, EPS and PDF), or output to a physical device like the screen of a computer. Additional devices are available through contributed \Rlang packages. - -\RStudio makes it possible to export plots into graphic files through a menu-based interface in the \emph{Plots} viewer tab. This interface uses some of the \Rlang devices that are available at the console and through scripts. For the sake of reproducibility, it is preferable to include the \Rlang commands used to export plots in the scripts used for data analysis. - -Devices follow the paradigm of ON and OFF switches, opening and closing a destination for \code{print()}, \code{plot()} and related functions. Some devices producing a file as output, save their output one plot at a time to single-page graphic files while others write the file only when the device is closed, possibly as a multi-page file. - -When opening a device the user supplies additional information. For the PDF and SVG devices that produce output in a vector-graphics format, width and height of the output are specified in \emph{inches}. A default file name is used unless we pass a \code{character} string as an argument to parameter \code{file}. - -<>= -pdf(file = "output/my-file.pdf", width = 6, height = 5, onefile = TRUE) -plot(dist ~ speed, data = cars) -plot(weight ~ feed, data = chickwts) -dev.off() -@ - -Raster devices return bitmaps and \code{width} and \code{height} are specified in most cases in \emph{pixels}. - -<>= -png(file = "output/my-file.png", width = 600, height = 500) -plot(weight ~ feed, data = chickwts) -dev.off() -@ - -The approach of direct output to a software device is used in base \Rlang by \Rfunction{plot()} and its companions \Rfunction{text()}, \Rfunction{lines()}, and \Rfunction{points()}. \Rfunction{plot()} outputs a graphs onto the other three functions can add to. The addition of plot components, as shown below, is done directly to the output device, i.e., when output is to the computer screen the partial plot is visible at each step. - -<>= -png(file = "output/my-file.png", width = 600, height = 500) -plot(dist ~ speed, data = cars) -text(x = 10, y = 110, labels = "some texts to be added") -dev.off() -@ - -This is not the only approach available in \Rpgrm for building complex plotys. As we will see in chapter \ref{chap:R:plotting} on page \pageref{chap:R:plotting}, an alternative approach is to build a \emph{plot object} as a list of member components, that can be saved as any other \Rlang object. This object is later rendered as a whole on a graphical device by calling \code{print()} once. - -\index{data!exploration at the R console|)} -\index{data sets!their storage|)} - -\section{Further reading} -For\index{further reading!using the R language} further reading on the aspects of \Rlang discussed in the current chapter, I suggest the book \citetitle{Matloff2011} \autocite{Matloff2011}, with emphasis on the \Rlang language and programming. An in depth description of plotting and graphic devices in \Rlang is available in the book \citetitle{Murrell2019} \autocite{Murrell2019}. - -<>= -rm(list = setdiff(ls(pattern="*"), to.keep)) -@ - -<>= -knitter_diag() -R_diag() -other_diag() -@ - ->>>>>>> Stashed changes diff --git a/R.intro.Rnw b/R.intro.Rnw index a4d2514e..a35d3d8b 100644 --- a/R.intro.Rnw +++ b/R.intro.Rnw @@ -1,233 +1,3 @@ -<<<<<<< Updated upstream -% !Rnw root = using-r.main.Rnw - -<>= -# opts_chunk$set(opts_fig_wide) -opts_knit$set(unnamed.chunk.label = 'intro-chunk') -opts_knit$set(concordance=TRUE) -@ - -\chapter{\Rlang: the Language and the Program}\label{chap:R:introduction} - -\begin{VF} -In a world of \ldots\ relentless pressure for more of everything, one can lose sight of the basic principles---simplicity, clarity, generality---that form the bedrock of good software. - -\VA{Brian W. Kernighan and Rob Pike}{\emph{The Practice of Programming}, 1999}\nocite{Kernighan1999} -\end{VF} - -\section{Aims of this chapter} - -I share some facts about the history and design of the \Rlang language so that you can gain a good vantage point from which to grasp the logic behind \Rlang's features, making it easier to understand and remember them. You will learn the distinction between the \Rpgrm program itself and the front-end programs, like \RStudio, frequently used together with \Rpgrm. - -You will also learn how to interact with \Rpgrm when sitting at a computer. You will learn the difference between typing commands interactively and reading each partial result from \Rlang on the screen as you enter them, versus using \Rlang scripts containing multiple commands stored in a file to execute or run a ``job'' that saves results to another file for later inspection. - -I describe the steps taken in a typical scientific or technical study, including the data analysis work flow and the roles that \Rpgrm can play in it. I share my views on the advantages and disadvantages of textual command languages such as \Rlang compared to menu-driven user interfaces, frequently used in other statistics software. I discuss the role of textual languages and \emph{literate programming} in the very important question of reproducibility of data analyses and mention how I have used them while writing and typesetting this book. - -\section{What is \Rlang?} - -\subsection{R as a language} -\index{R as a language@{\Rlang as a language}} -\Rlang is a computer language designed for data analysis and data visualization, however, in contrast to some other scripting languages, it is, from the point of view of computer programming, a complete language---it is not missing any important feature. In other words, no fundamental operations or data types are lacking \autocite{Chambers2016}. I attribute much of its success to the fact that its design achieves a very good balance between simplicity, clarity and generality. \Rlang excels at generality thanks to its extensibility at the cost of only a moderate loss of simplicity, while clarity is ensured by enforced documentation of extensions and support for both object-oriented and functional approaches to programming. The same three principles can be also easily followed by user code written in \Rlang. - -In the case of languages like \Cpplang, \Clang, \pascallang and \langname{FORTRAN} multiple software implementations exist (different compilers and interpreters, i.e., pieces of software that translate programs encoded in these languages into \emph{machine code} instructions for computer processors to run). So in addition to different flavours of each language stemming from different definitions, e.g., versions of international standards, different implementations of the same standard may have, usually small, unintentional and intentional differences. - -Most people think\index{R as a language@{\Rlang as a language}}\index{R as a program@{\Rlang as a program}} of \Rpgrm as a computer program, similar to \pgrmname{SAS} or \pgrmname{SPPS}. \Rpgrm is indeed a computer program---a piece of software--- but it is also a computer language, implemented in the \Rpgrm program. At the moment, differently to most other computer languages, this difference is not important as the \Rpgrm program is the only widely used implementation of the \Rlang language. - -\Rlang started as a partial implementation of the then relatively new \Slang language \autocite{Becker1984,Becker1988}. When designed, \Slang, developed at Bell Labs in the U.S.A., provided a novel way of carrying out data analyses. \Slang evolved into \Splang \autocite{Becker1988}. \Splang was available as a commercial program, most recently from TIBCO, U.S.A. \Rlang started as a poor man's home-brewed implementation of \Slang, for use in teaching, developed by Robert Gentleman and Ross Ihaka at the University of Auckland, in New Zealand \autocite{Ihaka1996}. Initially \Rpgrm, the program, implemented a subset of the \Slang language. The \Rpgrm program evolved until only relatively few differences between \Slang and \Rlang remained. These remaining differences are intentional---thought of as significant improvements. In more recent times \Rlang overtook \Splang in popularity. The \Rlang language is not standardised, and no formal definition of its grammar exists. Consequently, the \Rlang language is defined by the behavior of its implementation in the \Rpgrm program. - -What makes \Rlang different from \pgrmname{SPSS}, \pgrmname{SAS}, etc., is that \Slang was designed from the start as a computer programming language. This may look unimportant for someone not actually needing or willing to write software for data analysis. However, in reality it makes a huge difference because \Rlang is easily extensible, both using the \Rlang language for implementation and by calling from \Rlang functions and routines written in other computer programming languages such as \Clang, \Cpplang, \langname{FORTRAN}, \pythonlang or \javalang. This flexibility means that new functionality can be easily added, and easily shared with a consistent \Rlang-based user interface. In other words, instead of having to switch between different pieces of software to do different types of analyses or plots, one can usually find a package that will make new tools seamlessly available within \Rlang. - -The name ``base \Rlang\index{base R@{base \Rlang}}'' is used to distinguish \Rlang itself, as in the \Rpgrm executable included in the \Rpgrm distribution and its default packages, from \Rlang in a broader sense, which includes contributed packages. A few packages are included in the \Rpgrm distribution, but most \Rlang packages are independently developed extensions and separately distributed. The number of freely available open-source \Rlang packages available is huge, in the order of 20\,000. - -The most important advantage of using a language like \Rlang is that instructions to the computer are given as text. This makes it easy to repeat or \emph{reproduce} a data analysis. Textual instructions serve to communicate to other people what has been done in a way that is unambiguous. Sharing the instructions themselves avoids a translation from a set of instructions to the computer into text readable to humans---say the materials and methods section of a paper. - -\begin{explainbox} -Readers with programming experience, will notice that some features of \Rlang differ from those in other programming languages. \Rlang does not have the strict type checks of \langname{Pascal} or \Cpplang. It has operators that can take vectors and matrices as operands. Reliable and fast \Rlang code, tends to rely on different \emph{idioms} than well-written \langname{Pascal} or \Cpplang code. -\end{explainbox} - -\subsection{R as a computer program} -\index{R as a program@{\Rpgrm as a program}} -\index{Windows@{\textsf{Windows}}|see{\textsf{MS-Windows}}} -The \Rpgrm program itself is open-source, i.e., its source code is available for anybody to inspect, modify and use. A small fraction of users will directly contribute improvements to the \Rpgrm program itself, but it is possible, and those contributions are important in making \Rpgrm extremely reliable. The executable, the \Rpgrm program we actually use, can be built for different operating systems and computer hardware. The members of the \Rpgrm developing team aim to keep the results obtained from calculations done on all the different builds and computer architectures as consistent as possible. The idea is to ensure that computations return consistent results not only across updates to \Rpgrm but also across different operating systems like \osname{Linux}, \osname{Unix} (including \osname{OS X}), and \osname{MS-Windows}, and computer hardware. - -\begin{figure} - \centering - \includegraphics[width=0.85\textwidth]{figures/R-console-r} - \caption[The R console]{The \Rpgrm console where the user can type textual commands one by one. Here the user has typed \code{print("Hello")} and \textit{entered} it by ending the line of text by pressing the ``enter'' key. The result of running the command is displayed below the command. The character at the head of the input line, a ``$>$'' in this case, is called the command prompt, signaling where a command can be typed in. Commands entered by the user are displayed in red, while results returned by \Rlang are displayed in blue. ``\code{[1]}'' can be ignored here, its meaning is explained on page \pageref{par:print:vec:index}}.\label{fig:intro:console} -\end{figure} - -The \Rpgrm program does not have a full-fledged graphical user interface (GUI), or menus from which to start different types of analyses. Instead, the user types the commands at the \Rpgrm console and the result is displayed starting on the next line (Figure \ref{fig:intro:console}). The same textual commands can also be saved into a text file, line by line, and such a file, called a ``script'' can substitute for the direct typing of the same sequence of commands at the console (writing and use of \Rlang scripts are explained in chapter \ref{chap:R:scripts} on page \pageref{chap:R:scripts}). When we work at the console, typing-in commands one by one, we use \Rlang \emph{interactively}. When we run a script, we may say that we run a ``batch job.'' The two approaches described above are available in the \Rpgrm program itself. - -\begin{explainbox} -As \Rpgrm is essentially a command-line application, it can be used on what nowadays are frugal computing resources, equivalent to a personal computer of three decades ago. \Rpgrm can run even on the Raspberry Pi\index{Raspberry Pi}, a micro-controller board with the processing power of a modest smart phone (see \url{https://r4pi.org/}). At the other end of the spectrum, on really powerful servers, \Rpgrm can be used for the analysis of big data sets with millions of observations. How powerful a computer is needed for a given data analysis task depends on the size of the data sets, on how patient one is, on the ability to select efficient algorithms and on writing ``good'' code. -\end{explainbox} - -\section{Using \Rlang}\label{sec:intro:using:R} - -\subsection{Editors and IDEs} -Integrated Development Environments (IDEs)\index{integrated development environment}\index{IDE|see{integrated development environment}} are normally used when developing computer programs. IDEs provide a centralized user interface from within which the different tools used to create and test a computer program can be accessed and used in coordination. Most IDEs include a dedicated editor capable of syntax highlighting (automatically colouring ``code words'' based on their role in the programming language), and even able to report some mistakes in advance of running the code. One could describe such an editor as the equivalent of a word processor with spelling and grammar checking, that can alert about spelling and syntax errors for a computer language like \Rlang instead of for a natural language like English. IDEs frequently add other features that help navigation of the programme source code and make easy the access to documentation. - -It is nowadays very common to use an IDE as a front-end or middleman between the user and the \Rpgrm program. Computations are still done in the \Rpgrm program, which is \emph{not} built-in in the IDEs. Of the available IDEs for \Rpgrm, \RStudio is currently the most popular by a wide margin. Recent versions of \RStudio support \pythonlang in addition to \Rlang. - -\begin{explainbox} - Readers with programming experience may be already familiar with Microsoft's free \pgrmname{Visual Studio Code} or the open-source \pgrmname{Eclipse} IDEs for which add-ons supporting \Rpgrm are available. -\end{explainbox} - -The main window of IDEs is in most cases divided into windows or panes, possibly with tabs. In \RStudio one has access to the \Rpgrm console, a text editor, a file-system browser, a pane for graphical output, and access to several additional tools such as for installing and updating extension packages. Although \RStudio supports very well the development of large scripts and packages, it is currently, in my opinion, also the best possible way of using \Rpgrm at the console as it has the \Rpgrm help system very well integrated both in the editor and \Rlang console. Figure \ref{fig:intro:rstudio} shows the main window displayed by \RStudio after running the same script as shown at the \Rpgrm console (Figure \ref{fig:intro:script}) and at the operating system command prompt (Figure \ref{fig:intro:shell}). By comparing these three figures it is clear that \RStudio is really only a software layer between the user and an unmodified \Rpgrm executable. In \RStudio the script was sourced by pressing the ``Source'' button at the top of the editor panel. \RStudio, in response to this, generated the code needed to source the file and ``entered'' it at the console (\ref{fig:intro:rstudio}, lower left screen panel, text in purple), the same console where we can directly type this same \Rpgrm command if we wish. - -\begin{figure} - \centering - \includegraphics[width=\linewidth]{figures/Rstudio-script} - \caption[Script in Rstudio]{The \RStudio interface after running the script that is visible in tab \texttt{my-script.R} of the editor pane (top left). Here I used the ``Source'' button to run the script and \Rpgrm printed the results to the \Rpgrm console in the lower left pane. The lower right pane shows a list of files, including the script open in the editor. The upper right pane displays a list of the objects currently visible in the user workspace, object \code{a}, which was created by the code in the second line of the \Rlang script.}\label{fig:intro:rstudio} -\end{figure} - -\begin{explainbox} -When a script is run, if an error is triggered, \RStudio automatically finds the location of the error, a feature you will find useful when running code from exercises in this book. Other features are beyond what one needs for simple everyday data analysis and aimed at package development and report generation. Tools for debugging, code profiling, bench marking of code and unit tests, make it possible to analyze and improve performance as well help with quality assurance and certification of \Rlang packages, and exceed what you will needed for this book's exercises and simple data analysis. \RStudio also integrates support for file version control, which is not only useful for package development, but also for keeping track of the progress or concurrent work with collaborators in the analysis of data. -\end{explainbox} - -The ``desktop'' version of \RStudio that one installs and uses locally, runs on most modern operating systems, such as \osname{Linux}, \osname{Unix}, \osname{OS X}, and \osname{MS-Windows}. There is also a server version that runs on \osname{Linux}, as well as a cloud service (\url{https://posit.cloud/}) providing shared access to such a server. The \RStudio server is used remotely through a web browser. The user interface is almost the same in all cases. Desktop and server versions are both distributed as unsupported free software and as supported commercial software. - -\RStudio and other IDEs support saving of their state and some settings per working folder under the name of \emph{project}, so that work on a data analysis can be interrupted and later continued, even on a different computer. As mentioned in section \ref{sec:R:workspace} on page \pageref{sec:R:workspace}, when working with \Rlang we keep related files in a folder. - -In this book I provide only a minimum of guidance on the use of \RStudio, and no guidance for other IDEs. To learn more about \RStudio, please, read the documentation available through \RStudio's help menu and keep at hand a printed copy of the \RStudio cheat sheet while learning how to use it. This and other useful \Rlang-related cheatsheets can be downloaded at \url{https://posit.co/resources/cheatsheets/}. Additional instructions on the use of \RStudio, including a video, are available through the Resources menu entry at the book website at \url{https://www.learnr-book.info/}. - -\subsection{\Rlang sessions and workspaces}\label{sec:R:workspace} - -We use \emph{session} to describe the interactive execution from start to finish of one running instance of the \Rpgrm program. We use \emph{workspace} to name the imaginary space were all objects currently available in an \Rpgrm session are stored. In \Rpgrm the whole workspace can be stored in a single file on disk at the end or during a session and restored later into another session, possibly on a different computer. Usually when working with \Rpgrm we dedicate a folder in disk storage to store all files from a given data analysis project. We normally keep in this folder files with data to read in, scripts, a file storing the whole contents of the workspace, named by default \code{.Rdata} and a text file with the history of commands entered interactively, named by default \code{.Rhistory}. The user's files within this folder can be located in nested folders. There are no strict rules on how the files should be organised or on their number. The recommended practice is to avoid crowded folders and folders containing unrelated files. It is a good idea to keep in a given folder and workspace the work in progress for a single data-analysis project or experiment, so that the workspace can be saved and restored easily between sessions and work continued from where one left it independently of work done in other workspaces. The folder where files are currently read and saved is in \Rpgrm documentation called the \emph{current working directory}. When opening an \code{.Rdata} file the current working directory is automatically set to the location where the \code{.Rdata} file was read from. - -\begin{warningbox} -\RStudio projects are implemented as a folder with a name ending in \code{.Rprj}, located under the same folder where scripts, data, \code{.Rdata} and \code{.Rhistory} are stored. This folder is managed by \RStudio and should be not modified or deleted by the user. Only in the very rare case of its corruption, it should be deleted, and the \RStudio project created again from scratch. Files \code{.Rdata} and \code{.Rhistory} should not be deleted by the user, except to reset the \Rlang workspace. However, this is unnecessary as it can be also easily achieved from within \Rpgrm. -\end{warningbox} - -\subsection{Using \Rlang interactively} - -Decades ago users communicated with computers through a physical terminal (keyboard plus text-only screen) that was frequently called a \emph{console}\index{console}. A text-only interface to a computer program, in most cases a window or a pane within a graphical user interface, is still called a console. In our case, the \Rpgrm console (Figure \ref{fig:intro:console}). This is the native user interface of \Rpgrm. - -Typing commands at the \Rpgrm console is useful when one is playing around, rather aimlessly exploring things, or trying to understand how an \Rpgrm function or operator we are not familiar with works. Once we want to keep track of what we are doing, there are better ways of using \Rpgrm, which allow us to keep a record of how an analysis has been carried out. The different ways of using \Rpgrm are not exclusive of each other, so most users will use the \Rpgrm console to test individual commands and plot data during the first stages of exploration. As soon as we decide how we want to plot or analyze the data, it is best to start using scripts. This is not enforced in any way by \Rpgrm, but scripts are what really brings to light the most important advantages of using a programming language for data analysis. In Figure \ref{fig:intro:console} we can see how the \Rpgrm console looks. The text in red has been typed in by the user, except for the prompt \code{\textcolor{red}{$>$}}, and the text in blue is what \Rpgrm has displayed in response. It is essentially a dialogue between user and \Rpgrm. The console can \emph{look} different when displayed within an IDE like \RStudio, but the only difference is in the appearance of the text rather than in the text itself (cf.\ Figures \ref{fig:intro:console} and \ref{fig:intro:console:rstudio}). - -\begin{figure} - \centering - \includegraphics[width=\linewidth]{figures/r-console-rstudio} - \caption[The R console]{The \Rpgrm console embedded in \RStudio. The same commands have been typed in as in Figure \ref{fig:intro:console}. Commands entered by the user are displayed in purple, while results returned by \Rpgrm are displayed in black.}\label{fig:intro:console:rstudio} -\end{figure} - -The two previous figures showed the result of entering a single command. Figure \ref{fig:intro:console:capture} shows how the console looks after the user has entered several commands, each as a separate line of text. - -\begin{figure} - \centering - \includegraphics[width=\linewidth]{figures/r-console-capture} - \caption[The R console]{The \Rpgrm console after several commands have been entered. Commands entered by the user are displayed in red, while results returned by \Rpgrm are displayed in blue.}\label{fig:intro:console:capture} -\end{figure} - -The examples in this book require only the console window for user input. Menu-driven programs are not necessarily bad, they are just unsuitable when there is a need to set very many options and choose from many different actions. They are also difficult to maintain when extensibility is desired, and when independently developed modules of very different characteristics need to be integrated. Textual languages also have the advantage, to be addressed in later chapters, that command sequences can be stored in human- and computer-readable text files. Such files constitute a record of all the steps used, and in most cases, makes it trivial to manually reproduce the same steps at a later time. Scripts are a very simple and handy way of communicating to other users how a given data analysis has been done or can be done. - -\begin{explainbox} -In the console one types commands at the \code{>} prompt. When one ends a line by pressing the return or enter key, if the line can be interpreted as an \Rlang command, the result will be printed at the console, followed by a new \code{>} prompt. -If the command is incomplete, a \code{+} continuation prompt will be shown, and you will be able to type in the rest of the command. For example if the whole calculation that you would like to do is $1 + 2 + 3$, if you enter in the console \code{1 + 2 +} in one line, you will get a continuation prompt where you will be able to type \code{3}. However, if you type \code{1 + 2}, the result will be calculated, and printed. -\end{explainbox} - -For example, one can search for a help page at the \Rpgrm console. - -Below are the first code example and first playground in the book. This first example is for illustration only, and you can return to them later as only in page \pageref{sec:R:install} I discuss how to install or get access to the \Rpgrm program. - -<>= -help("sum") -?sum -@ - -\begin{playground} -Look at help for some other functions like \code{mean()}, \code{var()}, \code{plot()} and, why not, \Rfunction{help()} itself! - -<>= -help(help) -@ -\end{playground} - -\begin{warningbox} -When trying to access help related to \Rlang extension packages trough \Rlang's built in help, make sure the package is loaded into the current \Rlang session, as described on page \pageref{sec:packages:install}, before calling \Rfunction{help()}. -\end{warningbox} - -When using \RStudio there are easier ways of navigating to a help page than calling function \Rfunction{help()} by typing its name, for example, with the cursor on the name of a function in the editor or console, pressing the \textsf{F1} key opens the corresponding help page in the help pane. Letting the cursor hover for a few seconds over the name of a function at the \Rpgrm console will open ``bubble help'' for it. If the function is defined in a script or another file that is open in the editor pane, one can directly navigate from the line where the function is called to where it is defined. In \RStudio one can also search for help through the graphical interface. The \Rlang manuals can also be accessed most easily through the Help menu in \RStudio or \pgrmname{RGUI}. - -\subsection{Using \Rlang in a ``batch job''} - -To run a script\index{scripts}\index{batch job} we need first to prepare a script in a text editor. Figure \ref{fig:intro:script} shows the console immediately after running the script file shown in the text editor. As before, red text, the command \code{source("my-script.R")}, was typed by the user, and the blue text in the console is what was displayed by \Rpgrm as a result of this action. The title bar of the console, shows ``R-console,'' while the title bar of the editor shows the \emph{path} to the script file that is open and ready to be edited followed by ``R-editor.'' - -\begin{figure} - \centering - \includegraphics[width=\linewidth]{figures/R-console-script} - \caption[Script sourced at the R console]{Screen capture of the \Rpgrm console and editor just after running a script. The upper pane shows the \Rpgrm console, and the lower pane, the script file in an editor. }\label{fig:intro:script} -\end{figure} - -\begin{warningbox} -When working at the command prompt, most results are printed by default. However, within scripts one needs to use function \Rfunction{print()} explicitly when a result is to be displayed. -\end{warningbox} - -A true ``batch job'' is not run at the \Rpgrm console but at the operating system command prompt, or shell. The shell is the console of the operating system---\osname{Linux}, \osname{Unix}, \osname{OS X}, or \osname{MS-Windows}. Figure \ref{fig:intro:shell} shows how running a script at the Windows command prompt looks. A script can be run at the operating system prompt to do time-consuming calculations with the output saved to a file. One may use this approach on a server, say, to leave a large data analysis job running overnight or even for several days. - -\begin{figure} - \centering - \includegraphics[width=\linewidth]{figures/windows-cmd-script} - \caption[Script at the Windows cmd promt]{Screen capture of the \osname{MS-Windows} command console just after running the same script. Here we use \code{Rscript} to run the script; the exact syntax will depend on the operating system in use. In this case, \Rpgrm prints the results at the operating system console or shell, rather than in its own \Rpgrm console.}\label{fig:intro:shell} -\end{figure} - -Within \RStudio desktop it is possible to access the operating system shell through the tab named ``Terminal'' and through the menu. It is also possible to run jobs in the background in tab ``Background jobs'', i.e., while simultaneously using the \Rpgrm console. This is made possible by concurrently running two or more instances of the \Rpgrm program. - -\section{Reproducible data analysis with \Rlang} -\index{reproducible data analysis|(} -Statistical concepts and procedures are not only important after data are collected but also crucial at the design stage of any data-based study. Rather frequently, we deal with existing data from the real world or from model simulations already at the planning stage of an experiment or survey. Statistics provides the foundation for the design of experiments and surveys, data analysis and data visualization. This is similar to the role played by grammar and vocabulary to communication in a natural language like English. Statistics makes possible decision-making based on partial evidence (or samples), but it is also a means of communication. Data visualization also plays a key role in the written and oral communication of study conclussions. \Rlang is useful throughout all stages of the research process, from design of studies to communication of the results. - -During recent years the lack of reproducibility in scientific research, frequently described as a \emph{reproducibility crisis}, has been broadly discussed and analysed \autocite{Gandrud2015}. One of the problems faced when attempting to reproduce scientific and technical studies, is reproducing the data analysis. More generally, under any situation where accountability is important, from scientific research to decision making in commercial enterprises, industrial quality control and safety, and environmental impact assessments, being able to reproduce a data analysis reaching the same conclusions from the same data is crucial. Thus, an unambiguous description of the steps taken for an analysis is a requirement. Currently, most approaches to reproducible data analysis are based on automating report generation and including, as part of the report, all the computer commands used. - -A reliable record of what commands have been run on which data is especially difficult to keep when issuing commands through menus and dialogue boxes in a graphical user interface or by interactively typing commands as text at a console. Even working interactively at the \Rpgrm console using copy and paste to include commands and results in a report typed in a word processor is error prone, and laborious. The use and archiving of \Rlang scripts alleviates this difficulty. - -However, a further requirement to achieve reproducibility is the consistency between the saved and reported output and the \Rlang commands reported as having been used to produce them, saved separately when using scripts. This creates an error-prone step between data analysis and reporting. To solve this problem an approach to data analysis inspired in what is called \emph{literate programming} \autocite{Knuth1984a}, was developed: running an especially formatted script that produces a document, used to report the analysis, that includes the \Rlang code used for the analysis, the results of running this code and any explanatory text needed to describe the methodology used and interpret the results of the analysis. - -Although a system capable of producing such reports with \Rlang, called \pkgname{Sweave} \autocite{Leisch2002}, has been available for a couple decades, it was rather limited and not supported by an IDE, making its use rather tedious. Package \pkgname{knitr} \autocite{Xie2013} further developed the approach and together with its integration into \RStudio made the use of this type of reports much easier. Less sophisticated reports, called \Rlang \emph{notebooks}, formatted as HTML files can be created directly from ordinary \Rlang scripts containing no especial formatting. Notebooks are HTML files that show as text the code used interspersed with the results, and can contain embedded the actual source script used to generate them. - -Package \pkgname{knitr} supports the writing of reports with the textual explanations encoded using either \Markdown or \Latex as markup for text-formatting instructions. While \Markdown (\url{https://daringfireball.net/projects/markdown/}) is an easy to learn and use text markup approach, \Latex @Lamport1994 is based on \TeX @Knuth, the most powerful typesetting engine freely available. There are different flavours of \Markdown, including \Rmarkdown (see \url{https://rmarkdown.rstudio.com/}) with special support for \Rlang code. \Quarto (see \url{https://quarto.org/}) was recently released as an enhancement of \Rmarkdown (see \url{https://rmarkdown.rstudio.com/}), improving typesetting and styling and providing a single system capable of generating a broad selection of outputs. When used together with \Rlang, \Quarto relies on package \pkgname{knitr} for the key step in the conversion, so in a strict sense \Quarto does not replace it. - -Because of the availability of these approaches to the generation of reports, the \Rlang language is extremely useful when reproducibility is important. Both \pkgname{knitr} and \Quarto are powerful and flexible enough to write whole books, such as this very book you are now reading, produced with \Rpgrm, \pkgname{knitr} and \LaTeX. All pages in the book are generated directly, all plots and other \Rlang output included are generated by \Rpgrm and included automatically in the typeset version. All diagrams are generated by \LaTeX during the typesetting step. The only exception are the figures in this chapter that have been manually captured from the computer screen. Why am I using this approach? First because I want to make sure that every bit of code as you will see printed, runs without error. In addition, I want to make sure that the output displayed below every line or chunk of \Rlang language code is exactly what \Rpgrm returns. Furthermore, it saves a lot of work for me as author, as I can just update \Rpgrm and all the packages used to their latest version, and build the book again, after any changes needed to keep it up to date and free of errors. By using these tools and markup in plain text files, the indices, cross-references, citations and list of references are all generated automatically. - -Although the use of these tools is very important, they are outside the scope of this book and well described in other books dedicated to them \autocite{Gandrud2015,Xie2013}. When using \Rlang in this way, a good command of \Rlang as a language for communication with both humans and computers is very useful. -\index{reproducible data analysis|)} - -\section{Getting ready to use \Rlang}\label{sec:R:install} - -As the book is designed with the expectation that readers will run code examples as the read the text, you have to ensure access to the \Rpgrm before reading the next chapter. It is likely that your school, employer or teacher has already enabled access to \Rpgrm. If not, or if you are reading the book on your own, you should install \Rpgrm or secure access to an on-line service. Using \RStudio or another IDE can facilitate the use of \Rpgrm, but all the code in the remaining chapters makes only use of \Rpgrm and packages available through CRAN. Chapters - -I have written an \Rlang package, named \pkgname{learnrbook}, containing original data and computer-readable listings for all code examples and exercises in the book. It also contains code and data that makes it easier to instal the packages used in later chapters. Its name is \pkgname{learnrbook} and is available through CRAN. \textbf{It is not necessary for you to install this or any other packages until section \ref{sec:packages:install} on page \pageref{sec:packages:install}, where I explain how to install and use \Rlang \Rlang packages.} - -\begin{faqbox}{Are there any resources to support the \emph{Learn R: As a Language} book?} -Please, visit \url{https://www.learnr-book.info/} to find additional material related to this book, including additional free chapters. Up-to-date instructions for software installation are provided on-line at this and other sites, as these instructions are likely to change after the publication of the book. -\end{faqbox} - -\begin{faqbox}{How to install the \textsf{R} program in my computer?} -Installation of \Rpgrm varies depending on the operating system and computer hardware, and is in general similar to that of other software under a given operating system distribution. For most types of computer hardware the current version of \Rpgrm is available through the Comprehensive \Rlang Archive Network (CRAN) at \url{https://cran.r-project.org/}. Especially in the case of Linux distributions, \Rpgrm can frequently be installed as a component of the operating system distribution. There are some exceptions, such as the \textsl{R4Pi}\index{Raspberry Pi} distribution of \Rpgrm for the Raspberry Pi, which is maintained independently (\url{https://r4pi.org/}). - -Installers for Linux, Windows and MacOS are available through CRAN (\url{https://cran.r-project.org/}) together with brief but up-to-date installation instructions. -\end{faqbox} - -\begin{faqbox}{How to install the \textsf{RStudio} IDE in my computer?} -\RStudio installers are available at Posit's web site (\url{https://posit.co/products/open-source/rstudio/}) of which the free version is suitable for running the code examples and exercises in the book. In many cases the IT staff at your employer or school will install them, or they may be already included in the default computer setup. -\end{faqbox} - -\begin{faqbox}{How to get access to \textsf{RStudio} as a cloud service?} -An alternative, that is very well suited for courses or learning as part of a group is the \RStudio cloud service, recently renamed Posit cloud (\url{https://posit.co/products/cloud/cloud/}). For individual use a free account is in many cases enough and for groups a low cost teacher's account works very well. -\end{faqbox} - -\section{Further reading} -Suggestions\index{further reading!shell scripts in Unix and Linux} for further reading are dependent on how you plan to use \Rlang. If you envision yourself running batch jobs under \pgrmname{Linux} or \pgrmname{Unix}, you would profit from learning to write shell scripts. Because \pgrmname{bash} is widely used nowadays, \citebooktitle{Newham2005} \autocite{Newham2005} can be recommended. If you aim at writing \Rlang code that is going to be reused, and have some familiarity with \Clang, \Cpplang or \javalang, reading \citetitle{Kernighan1999} \autocite{Kernighan1999} will provide a mostly language-independent view of programming as an activity and help you master the all-important tricks of the trade. The history of \Rlang, and its relation or \Slang, is best told by those who were involved at early stages of its development, \citeauthor{Chambers2016} (\citeyear[][, Chapter 2]{Chambers2016}, and \citeauthor{Ihaka1998} (\citeyear{Ihaka1998}). - - - -<>= -knitter_diag() -R_diag() -other_diag() -@ - -======= % !Rnw root = using-r.main.Rnw <>= @@ -455,5 +225,3 @@ knitter_diag() R_diag() other_diag() @ - ->>>>>>> Stashed changes diff --git a/R.learning.Rnw b/R.learning.Rnw index 873ec4bd..e1009f3f 100644 --- a/R.learning.Rnw +++ b/R.learning.Rnw @@ -1,186 +1,3 @@ -<<<<<<< Updated upstream -\chapter{Using the Book to Learn \Rlang} - -\begin{VF} -The important part of becoming a programmer is learning to think like a programmer. You don't need to know the details of a programming language by heart, you can just look that stuff up.\vspace{1.5ex} - -The treasure is in the structure, not the nails. - -\VA{P. Burns}{\emph{Tao Te Programming}, 2012}\nocite{Burns2012} -\end{VF} - -\section{Aims of this chapter} - -In this chapter I describe how I imagine the book can be used most effectively to learn the \Rlang language. Learning \Rlang and remembering what one has previously learnt and forgotten makes it also necessary to use this book and other sources as reference. Learning to use \Rlang effectively, also involves learning how search for information and learning how to ask questions from other users, for example, through on-line forums. Thus, I also give advice on how to find answers to \Rlang-related questions and how to use the available documentation. - -\section{Approach and structure} - -Depending on previous experience, reading \emph{Learn R: As a Language} will be about exploring a new world or revisiting a familiar one. In both cases \emph{Learn R: As a Language} aims to be a travel guide, neither a traveler's account, nor a cookbook of \Rlang recipes. It can be used as a course book, supplementary reading or for self instruction, and also as a reference.\vspace{1ex} My hope is that as a guide to the use of \Rlang, this book will remain useful to readers' as they gain experience and develop skills. - -\noindent -\emph{I encourage readers to approach \Rlang like a child approaches his or her mother tongue when learning to speak: do not struggle, just play, and fool around with \Rlang! If the going gets difficult and frustrating, take a break! If you get a new insight, take a break to enjoy the victory!\vspace{1ex} -} - -In \Rlang, like in most ``rich'' languages, there are multiple ways of coding the same operations. I have included code examples that aim to strike a balance between execution speed and readability. One could write equivalent \Rlang books using substantially different code examples. Keep this is mind when reading the book and using \Rlang. Keep also in mind that it is impossible to remember everything about \Rlang and as a user you will frequently need to consult the documentation, even while doing the exercises in this book. The \Rlang language, in a broad sense, is vast because it can be expanded with independently developed packages. Learning to use \Rlang mainly consists of learning the basics plus developing the skill of finding your way in \Rlang, its documentation and on-line question and answer forums. - -Readers should not aim at remembering all the details presented in the book, this is impossible for most of us. Later use of this and other books, and documentation effectively as references, depends on a good grasp of a broad picture of how \Rlang works and on learning how to navigate the documentation; i.e., it is more important to remember abstractions and in what situations they are used, and function names, than the details of how to use them. Developing a sense of when one needs to be careful not to fall in a ``language trap'' is also important. - -The book can be used both as a text book for learning \Rlang and as a reference. It starts with simple concepts and language elements progressing towards more complex language structures and uses. Along the way readers will find, in each chapter, descriptions and examples for the common (usual) cases and the exceptions. Some books hide the exceptions and counterintuitive features from learners to make the learning easier, I instead have included these but marked them using icons and marginal bars. There are two reasons for choosing this approach. First, the boundary between boringly easy and frustratingly challenging is different for each of us, and varies depending on the subject dealt with. So, I hope the marks will help readers predict what to expect, how much effort to put in each section and even what to read and what to skip. Second, if I had hidden the tricky bits of the \Rlang language, I would have made reader's later use of \Rlang more difficult. It would have also made the book less useful as a reference. - -The book contains many code examples as well as exercises. I expect readers will run code examples and try as many variations of them as needed to develop an understanding of the ``rules'' of the \Rpgrm language, e.g., how the function or feature exemplified works. This is what long-time users of \Rlang do when facing an unfamiliar feature or a gap in their understanding. - -Readers new to \Rlang should read at least chapters \ref{chap:R:introduction} to \ref{chap:R:functions} sequentially. Possibly, skipping parts of the text and exercises marked as advanced. However, I expect to be most useful to these readers, not to completely skip the description of unusual features and special cases, but rather to skim enough from them so as to get an idea of what special situations they may face as \Rlang users. Exercises should not be skipped, as they are a key component of the didactic approach used. - -Readers already familiar with \Rlang will be able to read the chapters in the book in any order, as the need arises. Marginal bars and icons, and the backwards and forward cross references among sections, make possible for readers to \emph{select suitable path} within the book both when learning \Rlang and when using the book as a reference. - -I expect \emph{Learn R: As a Language} to remain useful as a reference to those readers who use it to learn \Rlang. It will be also useful as a reference to readers already familiar with \Rlang. To support the use of the book as a reference, I have been thorough with indexing, including many carefully chosen terms, their synonyms and the names of all \Rlang objects and constructs discussed, collecting them in three alphabetical indexes: \emph{General index}, \emph{Index of R names by category}, and \emph{Alphabetic index of R names} starting at pages \pageref{idx:general}, \pageref{idx:rcats} and \pageref{idx:rindex}, respectively. I have also included back and forward cross references linking related sections throughout the whole book. - -\section{Typographic and naming conventions} - -\subsection{Call outs} - -Marginal bars and icons are used in the book to inform about what content is advanced or included with a specific aim. The following icons and colours are used. - -%\begin{infobox} -%Signals ancillary information, in most cases unrelated to \Rlang as a language. -%\end{infobox} - -\begin{explainbox} -Signals in-depth explanations of specific \Rlang features or general programming concepts. Several of these explanations make reference to programming concepts or features of the \Rlang language that are explained later in the book. Readers new to \Rlang and computer programming can safely skip these call outs on first reading of the book. To become proficient in the use of \Rlang these readers are expected to return at a later time without hurry, preferably with a cup of coffee or tea to these call outs. Readers with more experience, like those possibly reading individual chapters or using the book as a reference, will find these in-depth explanations useful. -\end{explainbox} - -\begin{warningbox} -Signals important bits of information that must be remembered when using \Rlang---i.e., explanations of some unusual, but important, feature of the language or concepts that in my experience are easily missed by those new to \Rlang. -\end{warningbox} - -\begin{faqboxNI}{Frequently asked question} -Signals my answer to a question that I expect to be useful to readers based on the popularity of similar or related questions posted in on-line forums. When reading through the book, they highlight things that are worth special attention. When using the book as a reference, they help find solutions to frequently encountered difficulties. -\end{faqboxNI} - -\begin{playground} -Signals \emph{playgrounds} containing open-ended exercises---ideas and pieces of \Rlang code to play with at the \Rlang console. I expect readers to run these examples both as is and after creating variations by editing the code, studying the output, or diagnosis messages, returned by \Rlang in each case. Numbered by chapter for easy reference. -\end{playground} - -\begin{advplayground} -Signals \emph{advanced playgrounds} sections that require more time to play with before grasping concepts than regular \emph{playground} sections. Numbered by chapter together with other playgrounds. -\end{advplayground} - -\subsection{Code conventions and syntax highlighting} - -Small sections of program code interspersed within the main text, receive the name of \emph{code chunks}. In this book \Rlang code chunks are typeset in a typewriter font, using colour to highlight the different elements of the syntax, such as variables, functions, constant values, etc. \Rlang code elements embedded in the text are similarly typeset but always black. For example in the code chunk below \code{mean()} and \code{print()} are functions; 1, 5 and 3 are constant numeric values, and \code{z} is the name of a variable where the result of the computation done in the first line of code is stored. The line starting with \code{\#\#} shows what is printed or shown when executing the second statement: \code{[1] 1}. In the book \code{\#\#} is used as a marker to signal output from \Rlang, it is not part of the output. As \code{\#} is the marker for comments in the \Rlang language, prepending \code{\#} to the output makes it possible to copy and pasted into the \Rlang console the whole contents of the code chunks as they appear in the book. - -<>= -z <- mean(1, 5, 3) -print(z) -@ - -When explaining general concepts I use short abstract names, while for real-life examples I use descriptive names. Although not required, for clarity, I use abstract names that hint at the structure of objects stored, such as \code{mat1} for a matrix, \code{vct4} for a vector and \code{df3} for a data frame. This convention resembles that followed by the base \Rlang documentation. - -Code in playgrounds either works in isolation or if it depends on objects created in the examples in the main text, this is mentioned within the playground. In playgrounds I use names in capital letters so that they are distinct. The code outside playgrounds does reuse objects created earlier in the same section, and occasionally in earlier sections of the same chapter. - -\subsection{Diagrams} - -To describe data objects I use diagrams inspired in Joseph N. Hall's PEGS (Perl Graphical Structures) \autocite{Hall1997}. I use colour fill to highlight the type of the stored objects. I use the ``signal'' sign for the names of whole objects and of their component members, the former with a thicker border. Below is an example from chapter \ref{chap:R:as:calc}. - -\begin{center} -\begin{footnotesize} -\begin{tikzpicture}[font=\sffamily, -array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=codeshadecolor},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells, -row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}, -row 1 column 1/.style={nodes={draw}}}] - -\matrix[array] (array) { -1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\ - & & & & & & & & & \\}; -\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {}; - -\begin{scope}[on background layer] -\fill[blue!10] (array-1-1.north west) rectangle (array-1-10.south east); -\end{scope} - -\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\textcolor{blue}{\ \code{}\strut}}; -\draw (array-1-1.north)--++(90:3mm) node [above] (first) {First index}; -\draw (array-1-10.east)--++(0:3mm) node [right]{positional indices}; -\draw (array-2-10.east)--++(0:3mm) node [right]{Members or \textcolor{blue}{\code{}}}; -\node [align=center, anchor=south] at (array-2-9.north west|-first.south) (8) {element at index 9}; -\draw (8)--(box); -% -\end{tikzpicture} -\end{footnotesize} -\end{center} - -For code structure I use diagrams based on boxes and arrows, and to describe the flow of code execution, the usual flow charts. - -In the different diagrams, I use the notation \textcolor{blue}{\code{}}, \textcolor{blue}{\code{}}, \textcolor{blue}{\code{}}, etc., as generic placeholders indicating \emph{any valid value}, \emph{any valid \Rlang statement}, \emph{any valid \Rlang name}, etc. - -\section{Findings answers to problems} - -\subsection{What are the options} - -First of all do not panic! Every programmer, even those with decades of experience, get stuck with problems from time to time, and can run out of ideas for a while. This is normal, and happens to all of us. - -It is important to learn how to find answers as part of the routine of using \Rlang. First of all one can read the documentation of the function or object that one is trying to use, which in many cases also includes use examples. \Rlang's help pages tell how to use individual functions or objects. In contrast, \Rlang's manual \emph{An Introduction to R} and books describe what functions or overall approaches to use for different tasks. - -Reading the documentation and books not always helps. Sometimes one can become blind to the obvious, by being too familiar with a piece of code, as it also happens when writing in a natural language like English. A second useful step is, thus, looking at the code with ``different eyes'', those of a friend or workmate, or your own eyes a day or a week later. - -One can also seek help in specialized on-line forums or from peers or ``local experts''. If searching in forums for existing questions and answers fails to yield a useful answer, one can write a new question in a forum. - -When searching for answers, asking for advice or reading books, one can be confronted with different ways of approaching the same tasks. Do not allow this to overwhelm you; in most cases it will not matter which approach you use as many computations can be done in \Rpgrm, as in any computer language, in several different ways, still obtaining the same result. Use the alternative that you find easier to understand. - -\subsection{R's built-in help} - -Every object available in base \Rlang or exported by an \Rlang extension package (functions, methods, classes, data) is documented in \Rlang's help system. Sometimes a single help page documents several \Rlang objects. Not only help pages are always available, but they are structured consistently with a title, short description, and frequently also a detailed description. In the case of functions, parameter names, their purpose and expected arguments are always described, as well as the returned value. Usually at the bottom of help pages, several examples of the use of the objects or functions are given. How to access \Rpgrm help is described in section \ref{sec:intro:using:R} on page \pageref{sec:intro:using:R}. - -In addition to help pages, \Rpgrm's distribution includes useful manuals as PDF or HTML files. These manuals are also available at \url{https://rstudio.github.io/r-manuals/} restyled for easier reading in web browsers. In addition to help pages, many packages, contain \emph{vignettes} such as User Guides or articles describing the algorithms used and/or containing use case examples. In the case of some packages, a web site with documentation in HTML format is also available. Package documentation can be also accessed through CRAN. The DESCRIPTION of packages provides contact information for the maintainer, links to web sites, and instructions on how to report bugs. Similar information plus a short description are frequently also available in a README file. - -Error messages tend to be terse in \Rpgrm, and may require some lateral thinking and/or ``experimentation'' to understand the real cause behind problems. Learning to interpret error messages is necessary to become a proficient user of \Rpgrm, so forcing errors and warnings with purposely written ``bad'' code is a useful exercise. - -\subsection{Online forums}\label{sec:intro:net:help} - -\subsubsection*{Netiquette} -When posting requests for help, one needs to abide by what is usually described as ``netiquette'', which in many respects also applies to asking in person or by e-mail help from a peer or local expert. Preference among sources of information depends on what one finds easier to use. Consideration towards others' time is necessary but has to be sbalanced against wasting too much of one's own time. - -In\index{netiquette}\index{network etiquette} most internet forums, a certain behavior is expected from those asking and answering questions. Some types of misbehavior, like use of offensive or inappropriate language, will usually result in the user losing writing rights in a forum. Occasional minor misbehavior, usually results in the original question not being answered and instead the problem highlighted in a comment. In general following the steps listed below will greatly increase your chances of getting a detailed and useful answer. - -\begin{itemize} - \item Do your homework: first search for existing answers to your question, both online and in the documentation. (Do mention that you attempted this without success when you post your question.) - \item Provide a clear explanation of the problem, and all the relevant information. The version of \Rpgrm, operating system, and any packages loaded and their versions can be important. - \item If at all possible, provide a simplified and short, but self-contained, code example that reproduces the problem (sometimes called a \emph{reprex}). - \item Be polite. - \item Contribute to the forum by answering other users' questions when you know the answer. -\end{itemize} - -\begin{explainbox} -Carefully preparing a reproducible example\index{reproducible example} (``reprex'') is crucial. A \emph{reprex} is a self-contained and as simple as possible piece of computer code that triggers (and so demonstrates) a problem. If possible, when data are needed, a data set included in base \Rpgrm or artificial data generated within the reprex code should be used. If the problem can only be reproduced with one's own data, then one needs to provide a minimal subset of it that still triggers the problem. - -While preparing a \emph{reprex} one has to simplify the code, and sometimes this step makes clear the nature of the problem. Always, before posting a reprex online, check it with the latest versions of \Rpgrm and any package being used. If sharing data, be careful about confidential information and either remove or mangle it. - -I must say that about two out of three times I prepare a \emph{reprex}, it allows me to find the root of the problem and a solution or a work-around on my own. Preparing a \emph{reprex} takes some effort but it is worthwhile even if it ends up not being posted on-line. - -\Rlang package \pkgname{reprex} and its \RStudio add-in simplify the creation of reproducible code examples, by creating and copying to the clipboard a reprex encoded in \Markdown and ready to be pasted into a question at \stackoverflow or into an issue at \GitHub. See \url{https://reprex.tidyverse.org/} for details. -\end{explainbox} - -\subsubsection*{StackOverflow} - -Nowadays, \stackoverflow (\url{http://stackoverflow.com/}) is the best question-and-answer (Q\,\&\,A) support site for \Rpgrm. Within the \stackoverflow site there is an \Rlang collective. In most cases, searching for existing questions and their answers, will be all that you need to do. If asking a question, make sure that it is really a new question. If there is some question that looks similar, make clear how your question is different. - -\stackoverflow has a user-rights system based on reputation, and questions and answers can be up- and down-voted. Questions with the most up-votes are listed at the top of searches, and the most voted answers to each question are also displayed first. Who asks a question is expected to accept correct answers. If the questions or answers one writes are up-voted one gains reputation (expressed as number). As one accumulates reputation, gets badges and additional rights, such as editing other users' questions and answers or later on, even deleting wrong answers or off-topic questions from the system. This sounds complicated, but works extremely well at ensuring that the base of questions and answers is relevant and correct, without relying heavily on nominated \emph{moderators}. When using \stackoverflow, do contribute by accepting correct answers, up-voting questions and answers that you find useful, down-voting those you consider poor, and flagging or correcting errors you may discover. - -Being careful in the preparation of a reproducible example\index{reproducible example}\index{reprex|see{reproducible example}} is important both when asking a question at \stackoverflow and when reporting a bug to the maintainer of any piece of software. For the question to be reliably answered or the problem to be fixed, the person answering a question, needs to be able to reproduce the problem, and after modifying the code, needs to be able to test if the problem has been solved or not. However, even if you are facing a problem caused by your misunderstanding of how \Rlang works, the simpler the example, the more likely that someone will quickly realize what your intention was when writing the code that produces a result different from what you expected. Even when it is not possible to create a reprex, one needs to ask clearly only one thing per question. - -The code of conduct (\url{https://stackoverflow.com/conduct}) and help that explains expected behavior (\url{https://stackoverflow.com/help}) are available at the site and worthwhile reading before using the site actively for the first time. - -\subsubsection*{Contacting the author} - -The best way to get in contacting with me about this book is by rasing an issue at \url{https://github.com/aphalo/learnr-book-crc/issues}. Issues can be used both to ask for support questions related to the book, report mistakes and suggest changes to the text, diagrams and/or example code. Edits to the manuscript of this book can be submitted as pull requests. - -Issues are raised by filling-in an on-line form, at a web page that also contains brief instructions. Git issues are a very efficient way of keeping track of corrections that need to be done. As support questions usually reveal unclear explanations or other problems, raising issues to ask them facilitates the tasks of improving and keeping the book up-to-date. - -\section{Further reading} - -At the end of each chapter a section like this one gives suggestions for further reading on related subjects. To understand what programming as an activity is you can read \citetitle{Burns2012} \autocite{Burns2012}. It will make easier the learning of programming in \Rlang, both practically and emotionally. In \citeauthor{Burns2012}'s words ``This is a book about what goes on in the minds of programmers''. - -======= \chapter{Using the Book to Learn \Rlang} \begin{VF} @@ -361,5 +178,3 @@ Issues are raised by filling-in an on-line form, at a web page that also contain \section{Further reading} At the end of each chapter a section like this one gives suggestions for further reading on related subjects. To understand what programming as an activity is you can read \citetitle{Burns2012} \autocite{Burns2012}. It will make easier the learning of programming in \Rlang, both practically and emotionally. In \citeauthor{Burns2012}'s words ``This is a book about what goes on in the minds of programmers''. - ->>>>>>> Stashed changes diff --git a/R.plotting.Rnw b/R.plotting.Rnw index 525ea50e..85e238f4 100644 --- a/R.plotting.Rnw +++ b/R.plotting.Rnw @@ -1,4 +1,3 @@ -<<<<<<< Updated upstream % !Rnw root = appendix.main.Rnw <>= @@ -3449,3455 +3448,3 @@ other_diag() %See section \ref{sec:plot:composition} on page \pageref{sec:plot:composition} on plot composition for an explanation of the code below. -======= -% !Rnw root = appendix.main.Rnw - -<>= -opts_chunk$set(opts_fig_wide) -opts_knit$set(concordance=TRUE) -opts_knit$set(unnamed.chunk.label = 'plotting-chunk') -@ - -\chapter{R Extensions: Grammar of Graphics}\label{chap:R:plotting} - -\begin{VF} -The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing. - -\VA{Edward Tufte's answer to Charlotte Thralls}{\emph{An Interview with Edward R. Tufte}, 2004}\nocite{Zachry2004} -\end{VF} - -%\dictum[Edward Tufte]{The commonality between science and art is in trying to see profoundly---to develop strategies of seeing and showing.} - -\index{geometries ('ggplot2')|see{grammar of graphics, geometries}} -%\index{geom@\texttt{geom}|see{grammar of graphics, geometries}} -%\index{functions!geom@\texttt{geom}|see{grammar of graphics, geometries}} -\index{statistics ('ggplot2')|see{grammar of graphics, statistics}} -%\index{stat@\texttt{stat}|see{grammar of graphics, statistics}} -%\index{functions!stat@\texttt{stat}|see{grammar of graphics, statistics}} -\index{scales ('ggplot2')|see{grammar of graphics, scales}} -%\index{scale@\texttt{scale}|see{grammar of graphics, scales}} -%\index{functions!scale@\texttt{scale}|see{grammar of graphics, scales}} -\index{coordinates ('ggplot2')|see{grammar of graphics, coordinates}} -\index{themes ('ggplot2')|see{grammar of graphics, themes}} -%\index{theme@\texttt{scale}|see{grammar of graphics, themes}} -%\index{function!theme@\texttt{scale}|see{grammar of graphics, themes}} -\index{facets ('ggplot2')|see{grammar of graphics, facets}} -\index{annotations ('ggplot2')|see{grammar of graphics, annotations}} -\index{aesthetics ('ggplot2')|see{grammar of graphics, aesthetics}} - -\section{Aims of this chapter} - -Three main data plotting systems are available to \Rlang users: base \Rlang, package \pkgname{lattice} \autocite{Sarkar2008} and package \pkgname{ggplot2} \autocite{Wickham2016}, the last one being the most recent and currently most popular system available in \Rlang for plotting data. Even two different sets of graphics primitives (i.e., those used to produce the simplest graphical elements such as lines and symbols) are available in \Rlang, those in base \Rlang and a newer one in the \pkgname{grid} package \autocite{Murrell2011}. - -In this chapter you will learn the concepts of the layered grammar of graphics, on which package \pkgname{ggplot2} is based. You will also learn how to build several types of data plots with package \pkgname{ggplot2}. As a consequence of the popularity and flexibility of \pkgname{ggplot2}, many contributed packages extending its functionality have been developed and deposited in public repositories. However, I will focus mainly on package \pkgname{ggplot2} only briefly describing a few of these extensions. - -\section{Packages used in this chapter} - -<>= -citation(package = "ggplot2") -@ - -If the packages used in this chapter are not yet installed in your computer, you can install them as shown below, as long as package \pkgname{learnrbook} is already installed. - -<>= -install.packages(learnrbook::pkgs_ch_ggplot) -@ - -To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages). - -<>= -library(learnrbook) -library(scales) -library(ggplot2) -library(ggrepel) -library(gginnards) -library(broom) -library(ggpmisc) -library(ggbeeswarm) -library(lubridate) -library(tibble) -library(dplyr) -library(patchwork) -@ - -<>= -theme_set(theme_gray(14)) -@ - -<>= -# set to TRUE to test non-executed code chunks and rendering of plots -eval_plots_all <- FALSE -@ - -\section{The components of a plot} -I\index{data visualization!concepts} start by briefly presenting concepts central to data visualisation, following the \citetitle{Koponen2019} \autocite{Koponen2019}. Plots are a medium used to convey information, like text. It is worthwhile keeping this in mind. As with text, the design of plots needs to consider what needs to be highlighted to convey the take home message. The style of the plot should match the expectations and the plot-reading abilities of the expected audience. One needs to be careful to avoid ambiguities and most importantly of all not to miss-inform. Data visualisations like text need to be planned, revised, commented upon, revised again until the best way of expressing our message is found. The flexibility of the grammar of graphics supports very well this approach to designing and producing high quality data visualizations for different audiences. - -Of course, when exploring data fancy details of graphical design are irrelevant, but flexibility remains important as it makes it possible to look at data from many differing angles, highlighting different aspects of them. In the same way as boiler-plate text and text templates have specific but limited uses, all-in-one functions for producing plots do not support well the design of original data visualizations. They tend to get the job done, but lack the flexibility needed to do the best job of communicating information. Being this a book about languages, the focus of this chapter is in the layered grammar of graphics. - -The plots described in this chapter are classified as \emph{statistical graphics}\index{statistical graphics} within the broader field of data visualisation. Plots such as scatter plots include points (geometric objects) that by their position, shape, colour or some other property directly convey information. The location of these points in the plot ``canvas'' or ``plotting area'', given by the values of their $x$ and $y$ coordinates describes properties of the data and any deviation in the mapping of observations to coordinates is misleading, because deviations from the expected mapping conveys wrong/false information to the audience. - -A \emph{data label}\index{data visualization!data labels} is connected to an observation but its position can be displaced as long as its link to the corresponding observation can be inferred, e.g., by the direction of an arrow or even simple proximity. Data labels provide ancillary information, such as the name of a gene or place. - -\emph{Annotations}\index{data visualization!annotations}, are additions to a plot that have no connection to individual observations, but rather with all observations taken together, e.g., a text like n = 200 indicating the number of observations, usually included in a corner or margin of a plot free of observations. - -Axis and tick labels, legends and keys make it possible for the reader to retrieve the original values represented in the plot as graphical elements. Other features of visualisations even when not carrying additional information affect the easy with which a plot can be read and accessibility to readers with visual constraints such as colour blindness. These features include the size of text and symbols, thickness of lines, choice of font face, choice of colour palette, etc. - -Because of the different length of time available for the audience to interact with visualizations, in general, plots designed to be included in books and journals are unsuitable for oral presentations, and vice versa. It is important to keep in mind the role played by plots in informing the audience, and what information can be expected to be on interest to different audiences and under different situations. The grammar of graphics and its extensions provide enough flexibility to tailor the design of plots to different uses and also to easily create variations of a given plot. - -\section{The grammar of graphics}\label{sec:plot:intro} -\index{grammar of graphics!elements|(} -What separates \ggplot from base \Rlang and trellis/lattice plotting functions is the use of a layered grammar of graphics\index{grammar of graphics} (the reason behind `gg' in the name of package \pkgname{ggplot2}). What is meant by grammar in this case is that plots are assembled piece by piece using different ``nouns'' and ``verbs'' \autocite{Cleveland1985,Wickham2010}. Instead of using a single function with many arguments, plots are assembled by combining different elements with operators \code{+} and \verb|%+%|. Furthermore, the construction is mostly semantics-based and to a large extent, how plots look when printed, displayed, or exported to a bitmap or vector-graphics file is controlled by themes. - -Plotting can be thought as translating or mapping the observations or data into a graphical language. Properties of graphical (or geometrical) objects are used to represent different aspects of the data. An observation can consist of multiple recorded values. Say an observation of air temperature may be defined by a position in 3-dimensional space and a point in time, in addition to the temperature itself. An observation for the size and shape of a plant can consist of height, stem diameter, number of leaves, size of individual leaves, length of roots, fresh mass, dry mass, etc. For example, an effective way of studying and/or to communicating the relationship between height and stem diameter in plants, is to plot observations as points using cartesian coordinates\index{grammar of graphics!cartesian coordinates}, \emph{mapping} stem diameter to the $x$ axis and the height to the $y$ axis. - -The grammar of graphics makes it possible to design plots by combining various elements in ways that are nearly orthogonal. In other words, the majority of the possible combinations of ``words'' yield valid plots as long the rules of the grammar are respected. This flexibility makes \ggplot extremely powerful as types of plots not considered when the \ggplot package was designed can be easily created. - -\begin{warningbox} -When a ggplot is built, the whole plot and its components are created as \Rlang objects that can be saved in the workspace or written to a file as \Rlang objects. These objects encode a recipe for constructing the plot, not its final graphical representation. The graphical representation is generated when the object is printed, explicitly or not. Thus, the same \code{"gg"} plot object can be rendered into different bitmap and vector graphic formats for display and/or printing. -\end{warningbox} - -The transformation of a set of data or observations into a rendered graphic with package \pkgname{ggplot2} can be represented as a flow of information, but also as a sequence of actions. However, what avoids that the flexibility from becoming a burden on users is that in most cases adequate defaults are used when the user does not provide explicit ``instructions''. The recipe to build a plot needs to specify a) the data to use, b) which variable to map to which graphical property (or aesthetic), c) which layers to add and which geometric representation to use, d) the scales that establish the link between data values and aesthetic values, e) a coordinate system (affecting only aesthetics $x$, $y$ and possibly $z$), f) a theme to use. The result from constructing a plot object using the grammar of graphics is an \Rlang object containing a ``recipe for a plot'', including the data, which behaves similarly to other \Rlang objects. - -\subsection{The words of the grammar} -Before building a plot step by step, I introduce the different components of a ggplot recipe, or the words of the grammar of graphics. - -\paragraph{Data} -The\index{grammar of graphics!data} data to be plotted must be available as a \code{data.frame} or \code{tibble}, with data stored so that each row represents a single observation event, and the columns are different values observed in that single event. In other words, in long form (so-called ``tidy data'') as described in chapter \ref{chap:R:data}. The variables to be plotted can be \code{numeric}, \code{factor}, \code{character}, and time or date stored as \code{POSIXct}. (Some extensions to \pkgname{ggplot2} add support for other types of data such as time series). - -\paragraph{Mapping} -When\index{grammar of graphics!mapping of data} constructing a plot, data variables have to be mapped to aesthetics\index{plots!aesthetics} (or graphic properties). Most plots will have an $x$ dimension, which is one of the \emph{aesthetics}, and a variable containing numbers (or categories) mapped to it. The position on a 2D plot of, say, a point, will be determined by $x$ and $y$ aesthetics, while in a 3D plot, three aesthetics need to be mapped $x$, $y$ and $z$. Many aesthetics are not related to coordinates, they are properties, like color, size, shape, line type, or even rotation angle, which add an additional dimension on which to represent the values of variables and/or constants. - -\paragraph{Statistics} -Statistics\index{grammar of graphics!statistics} are ``words'' that represent calculation of summaries or some other operation on the values in the data. When \emph{statistics} are used for a computation, the returned value is passed to a \emph{geometry}, and consequently adding a \emph{statistics} also adds a layer to the plot. For example, \ggstat{stat\_smooth()} fits a smoother, and \ggstat{stat\_summary()} applies a summary function such as \code{mean(()}. Most statistics are applied by group when data have been grouped by mapping additional aesthetics such as color to a factor. - -\paragraph{Geometries} -\sloppy% -Geometries\index{grammar of graphics!geometries} are ``words'' that describe the graphics representation of the data: for example, \gggeom{geom\_point()}, plots a point or symbol for each observation or summary value, while \gggeom{geom\_line()}, draws line segments between observations. Some geometries rely by default on statistics, but most ``geoms'' default to the identity statistics. Each time a \emph{geometry} is used to add a graphical representation of data to a plot, one says that a new \emph{layer} has been added. The\index{plots!layers} grammar of graphics allows plots to contain multiple layers. The name \emph{layer} reflects the fact that each new layer added is plotted on top of the layers already present in the plot, or rather when a plot is printed the layers will be generated in the order they were added to the plot object. For example, one layer in a plot can display the observations, another layer a regression line fitted to them, and a third one may contain annotations such an equation or a text label. - -\paragraph{Positions} -Positions\index{grammar of graphics!positions} are ``words'' that determine the displacement or not of graphical plot elements relative to their original $x$ and $y$ coordinates. They are one of the arguments accepted by \emph{geometries}. Position \ggposition{position\_identity()} introduces no displacement, and for example, \ggposition{position\_stack()} makes it possible to create stacked bar plots and stacked area plots. Positions will be discussed together with geometries as they are always subordinate to them. - -\paragraph{Scales} -Scales\index{grammar of graphics!scales} give the ``translation'' or mapping between data values and the aesthetic values to be actually plotted. Mapping a variable to the ``color'' aesthetic (also recognized when spelled as ``colour'') only tells that different values stored in the mapped variable will be represented by different colors. A scale, such as \ggscale{scale\_color\_continuous()}, will determine which color in the plot corresponds to which value in the variable. Scales can also define transformations on the data, which are used when mapping data values to aesthetic values. All continuous scales support transformations---e.g., in the case of $x$ and $y$ aesthetics, positions on the plotting region or graphic viewport will be affected by the transformation, while the original values are used for tick labels along the axes or in keys for shapes, colors, etc. Scales are used for all aesthetics, including continuous variables, such as numbers, and categorical ones such as factors. The grammar of graphics allows only one scale per \emph{aesthetic} and plot. This restriction is imposed by design to avoid ambiguity (e.g., it ensures that the red color will have the same ``meaning'' in all plot layers where the \code{color} \emph{aesthetic} is mapped to data). Scales have limits that are set automatically unless supplied explicitly. - -\paragraph{Coordinate systems} -The\index{grammar of graphics!coordinates} most frequently used coordinate system when plotting data, the cartesian system, is the default for most \emph{geometries}. In the cartesian system, $x$ and $y$ are represented as distances on two orthogonal (at 90$^\circ$) axes. Additional coordinate systems are available in \pkgname{ggplot2} and through extensions. For example, in the polar system of coordinates, the $x$ values are mapped to angles around a central point and $y$ values to the radius. Setting limits to a coordinate system changes the region of the plotting space visible in the plot, but does not discard observations. In other words, when using \emph{statistics}, observations located outside the coordinate limits, i.e., not visible in the rendered plot, will still be included in computations when excluded by coordinate limits but will be ignored when excluded by scale limits. - -\paragraph{Themes} -How\index{grammar of graphics!themes} the plots look when displayed or printed can be altered by means of themes. A plot can be saved without adding a theme and then printed or displayed using different themes. Also, individual theme elements can be changed, and whole new themes defined. This adds a lot of flexibility and helps in the separation of the data representation aspects from those related to the graphical design. - -\paragraph{Operators} -The\index{grammar of graphics!operators} elements described above are assembled into a ggplot object using operator \Roperator{+} and exceptionally using \Roperator{\%+\%}. The choice of these operators makes sense, as ggplot objects are built by sequentially adding members or elements to them. -\index{grammar of graphics!elements|)} - -\begin{warningbox} -The functions corresponding to the different elements of the grammar of graphics have distinctive names with the first few letters hinting at their roles: aesthetics mappings (\code{aes}), geometric elements (\code{geom\_\ldots}), statistics (\code{stat\_\ldots}), scales (\code{scale\_\ldots}), coordinate systems (\code{coord\_\ldots}), and themes (\code{theme\_\ldots}). -\end{warningbox} - -\subsection{The workings of the grammar}\label{sec:plot:workings} -\index{grammar of graphics!plot structure|(} -\index{grammar of graphics!plot workings|(} -A \code{"gg"} plot object is an \Rlang object of mode \code{"list"} containing the recipe and data to construct a plot. It is self contained in the sense that the only requirement for rendering it into a graphical representation is the availability of package \pkgname{ggplot2}. A \code{"gg"} object contains the data in one or more data frames and instructions encoded as functions and parameters, but not yet a rendering of the plot into graphical objects. Both data transformations and rendering of the plot into drawing instructions (encoded as graphical objects or \emph{grobs}) take place at the time of printing or exporting the plot, e.g., when saving a bitmap to a file. - -To understand ggplots one should first think in terms of the graphical organization of the plot: there is always a background layer onto which other layers composed by different graphical objects are laid. Each layer contains related graphical objects originating from the same data. The last layer added is the topmost and the first one added the lowermost. Graphical objects in upper layers occlude those in the layers below them if their locations overlap. Although frequently layers in a ggplot share the same data and the same mappings to aesthetics, this is not a requirement. It is possible to build ggplots with independent layers, although always with shared scales and plotting area. - -%%% Drawing of a plot with layers - -A second perspective on ggplots is that of the process of converting the data into a graphical representation that can be printed on paper or viewed on a computer screen. The transformations applied to the data to achieve this can be thought as a data flow process divided into stages. The diagram in Figure \ref{fig:ggplot:stages} represents a single self-contained layer in a plot. The data supplied by the user is transformed in stages into instructions to draw a graphical representation. In \pkgname{ggplot2} and its documentation graphical features are called \emph{aesthetics}, with the correspondence between values in the data and values of the aesthetic controlled by \emph{scales}. The values in the data are summarized by \emph{statistics}. However, when no summaries are needed, layers make use of \Rfunction{stat\_indentity()}, which copies its input to its output unchanged. -\emph{Geometries} provide the ``recipe'' used to generate graphical objects from the mapped data.\vspace{2ex} - -\begin{figure} -{\sffamily -\centering -\resizebox{\linewidth}{!}{% - \begin{tikzpicture}[auto] - \node [b] (data) {layer\\ data}; - \node [cc, right = of data] (mapping1) {\textbf{start}}; - \node [b, right = of mapping1] (statistic) {statistic}; - \node [cc, right = of statistic] (mapping2) {\textbf{after\\ stat}}; - \node [b, right = of mapping2] (geometry) {geometry + scale}; - \node [cc, right = of geometry] (mapping3) {\textbf{after\\ scale}}; - \node [b, right = of mapping3] (render) {layer\\ grobs}; - - \path [ll] (mapping1) -- (data) node[near end,above]{a}; - \path [ll] (statistic) -- (mapping1) node[near end,above]{b}; - \path [ll] (mapping2) -- (statistic) node[near end,above]{c}; - \path [ll] (geometry) -- (mapping2) node[near end,above]{d}; - \path [ll] (mapping3) -- (geometry) node[near end,above]{e}; - \path [ll] (render) -- (mapping3) node[near end,above]{f}; - \end{tikzpicture}}} - \caption[Stages of data flow in a ggplot layer]{Abstract diagram of data transformations in a ggplot layer showing the stages at which mappings between variables and graphic aesthetics take place.}\label{fig:ggplot:stages} -\end{figure} - - -Function \code{aes()} is used to define mappings to aesthetics. The default for \Rfunction{aes()} is for the mapping to take place at the \textbf{start} (leftmost circle in the diagram above), mapping names in the user data to aesthetics such as x, y, colour, shape, etc. The statistic can alter the mapped data, but in most cases not which aesthetics they are mapped to. Statistics can add default mappings for additional aesthetics. In addition the default mappings of the data returned by the statistic can be modified by user code at this later stage, \textbf{after stat}. Default mappings can be modified again at the \textbf{after scale} stage. - -\begin{explainbox} -Statistics always return a mapping to the same aesthetics that they require as input. However, the values mapped to these aesthetics at the \textbf{after stat} stage are in most cases different from those at \textbf{start}. Many statistics return additional variables, which are not mapped by default to any aesthetic. These variables facilitate variations on how results from a given type of data summary are added to plots, including the use of a geometry different from the default set by the statistic. In this case the user has to override default mappings at the \textbf{after stat} stage. The additional variables returned by statistcs are listed in their documentation. (See section \ref{sec:plot:mappings} on page \pageref{sec:plot:mappings} for details.) -\end{explainbox} - -\begin{warningbox} -As mentioned above, all ggplot layers include a statistic and a geometry. From the perspective of the construction of a plot using the grammar, both \code{stats} and \code{geoms} are layer constructor functions. While \code{stats} take a \code{geom} as one of their arguments, \code{geoms} take a \code{stat} as one of their arguments. Thus, in both cases a \code{stat} and a \code{geom} are added as a layer, and their role and position in the data flow remain the same, i.e., the diagram in Figure \ref{fig:ggplot:stages} applies independently of how the layers are added to the plot. The default statistic of many geometries is \ggstat{stat\_identity()} making their behavior when added to a plot as if the layer they create contained no statistics. -\end{warningbox} - -There are some statistics in \pkgname{ggplot2} that have companion geometries that can be used (almost) interchangeably. This tends to lead into confusion, and in this book, only geometries that have as default \ggstat{stat\_identity()} are described as geometries in section \ref{sec:plot:geometries}. In the case of those that by default use other statistics, like \gggeom{geom\_smooth()} only the companion statistic, \gggeom{stat\_smooth()} for this example, are described in section \ref{sec:plot:statistics}. - -A ggplot can have a single layer or many layers, but when ggplots have more than one layer, the data flow, computations and generation of graphical objects takes place independently for each layer. As mentioned above, most ggplots do not have fully independent layers, but the layers share the same data and aesthetic mappings at the \textbf{start}. Ahead of this point computations in layers are always independent of those in other layers, except that for a given aesthetic only one scale is allowed per plot. - -\index{grammar of graphics!plot workings|)} -\index{grammar of graphics!plot structure|)} - -\subsection{Plot construction} -\index{grammar of graphics!plot construction|(} - -As the use of the grammar is easier to demonstrate by example than to explain with words, I will show how to build plots of increasing complexity, starting from the simplest possible. All elements of a plot have defaults, although in some cases these defaults result in empty plots. Defaults make it possible to create a plot very succinctly. When building a plot step by step, the different viewpoints described in the previous section are relevant: the static structure of the plot's \Rlang object, the final graphic output, and the transformations that the data undergo ``in transit'' from the recipe stored in an object to the graphic output. In this section I emphasize the syntax of the grammar and how it translates into a plot. - -Function \code{ggplot()} by default constructs an empty plot. This similar to how \code{character()}, \code{numeric()}, etc. construct empty vectors. This empty skeleton of a plot when printed is displayed as an grey rectangle. - -<>= -ggplot() -@ - -A data frame passed as argument to \code{data} without adding a mapping results in the same empty grey rectangle (not shown). Data frame \Rdata{mtcars} is a data set included in \Rlang (to read a description, type \code{help("mtcars")} at the \Rlang command prompt). - -<>= -ggplot(data = mtcars) -@ - -Once the data are available, a graphical or geometric representation needs to be selected. The geometry used, such as \code{geom\_point()} and \code{geom\_line()}, drawing separate points for the observations or connecting them with lines, respectively, defines the type of plot. A mapping defines which property of the geometric elements will be used to represent the values from a variable in the user's data. Most geometries require mappings to both $x$ and $y$ aesthetics, as they establish the position of the geometrical shapes like points or lines in the plotting area. Additional aesthetics like color make use of default scales and palettes. These defaults can be overridden with \code{scale} functions added to the plot (see section \ref{sec:plot:scales}). - -Mapping at the \textbf{start} stage, \code{disp} to $x$ and \code{mpg} to $y$ aesthetics, makes the ranges of the values available. They are used to find default limits for the $x$ and $y$ scales as reflected in the plot axes. The plotting area $x$ and $y$ now match the ranges of the mapped variables, expanded by a small margin. The axis labels also reflect the names of the mapped variables, however, there are no graphical element yet displayed for the individual observations.% ({\small\textsf{data $\to$ aes $\to$ \emph{ggplot object}}}) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) -@ - -Observations are made visible by the addition of a suitable \emph{geometry} or \code{geom} to the plot recipe. Below, adding \gggeom{geom\_point()} makes the observations visible as points or symbols. %({\small\textsf{data $\to$ aes $\to$ geom $\to$ \emph{ggplot object}}}) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() -@ - -\begin{warningbox} -In the examples above, the plots were printed automatically, which is the default at the \Rlang console. However, as with other \Rlang objects, ggplots can be assigned to a variable. - -<>= -p1 <- ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() -@ - -and printed at a later time, and saved to and read from files on disk. - -<>= -print(p1) -@ - -Layers and other elements can be also added to a saved ggplot as the saved objects are not the graphical representation of the plots themselves but instead a \emph{recipe} plus data needed to build them. -\end{warningbox} - -\begin{advplayground} -As for any \Rlang object \code{str()} displays the structure of \code{"gg"} objects. In addition, package \pkgname{ggplot2} provides a \code{summary()} method for \code{"gg"} plot objects. - -As you make progress through the chapter, use these methods to explore the \code{"gg"} plot objects you construct, paying attention to layers, and global vs.\ layer-specific data and mappings. You will learn how the plot components are stored as members of \code{"gg"} plot objects. -\end{advplayground} - -Although \emph{aesthetics} are usually mapped to variables in the data, constant aesthetic values can be passed as arguments to layer functions, consistently controlling a property of all elements in a layer. While variables in \code{data} can be both mapped using \code{aes()} as whole-plot defaults, as shown above, or within individual layers, constant values for aesthetics have to be set, as shown here, as named arguments passed directly to layer functions, instead of to a call to \code{aes()}. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point(color = "blue", shape = "square") -@ - -\begin{warningbox} -Mapping an aesthetic to a constant value within a call to \Rfunction{aes()} adds a column containing this value to the data frame received as input by the \code{stat()}. This value is not interpreted as an aesthetic value but instead as a data value. The plot above, but using a call to \Rfunction{aes()}. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point(mapping = aes(color = "blue", shape = "square")) -@ - -The plot contains red circles instead of blue squares! - -In principle, one could correct this plot by adding suitable \code{scales} but this would be still wasteful by unnecessarily storing many copies of the constant \code{"blue"} in the \code{"gg"} plot object. -\end{warningbox} - -While a geometry directly constructs during rendering a graphical representation of the observations or summaries in the data it receives as input, a \emph{statistics} or \code{stat} ``sits'' in-between the data and a \code{geom}, applying some computation, usually but not always, to produce a statistical summary of the data. Here \code{stat\_smooth()} fits a linear regression (see section \ref{sec:stat:LM:regression} on page \pageref{sec:stat:LM:regression}) and passes the resulting predicted values to \gggeom{geom\_line()}. Passing \code{method = "lm"} selects \code{lm()} as the model fitting function. Passing \code{formula = y ~ x} sets the model to be fitted. This plot has two layers, one from geometries \gggeom{geom\_point()} and one from \gggeom{geom\_line()}.%({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ \emph{ggplot object}}}) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() + - stat_smooth(geom = "line", method = "lm", formula = y ~ x) -@ - -The plots above relied on defaults for \emph{scales}, \emph{coordinates} and \emph{themes}. In the examples below the defaults are overridden by arguments that produce differently rendered plots. Adding \ggscale{scale\_y\_log10()} applies a logarithmic transformation to the values mapped to $y$. This works like plotting using graph paper with rulings spaced according to a logarithmic scale. Tick marks continue to be expressed in the original units, but statistics are applied to the transformed data. In other words, the transformation specified in the scale affects the values in advance of the \textbf{start} stage, before they are mapped to aesthetics and passed to \emph{statistics}. Thus, in this example the linear regression is fitted to \code{log10()} transformed $y$ values and the original $x$ values.%({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ scale $\to$ \emph{ggplot object}}}) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() + - stat_smooth(geom = "line", method = "lm", formula = y ~ x) + - scale_y_log10() -@ - -The range limits of a scale can be set manually, instead of automatically as by default. These limits create a virtual \emph{window into the data}: out-of-bounds (oob) observations, those outside the scale limits remain hidden and are not mapped to aesthetics---i.e., these observations are not included in the graphical representation or used in calculations. Crucially, when using \emph{statistics} the computations are only applied to observations that fall within the limits of all scales in use. These limits \emph{indirectly} affect the plotting area when the plotting area is automatically set based on the range of the (within limits) data---even the mapping to values of a different aesthetics may change when a subset of the data are selected by manually setting the limits of a scale. - -In contrast to \emph{scale limits}, \emph{coordinates}\index{grammar of graphics!cartesian coordinates} function as a \emph{zoomed view} into the plotting area, and do not affect which observations are visible to \emph{statistics}. The coordinate system, as expected, is also determined by this grammar element---below, adding cartesian coordinates, which are the default, but setting $y$ limits overrides the default ones. %({\small\textsf{data $\to$ aes $\to$ stat $\to$ geom $\to$ coordinate $\to$ theme $\to$ \emph{ggplot object}}}) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() + - stat_smooth(geom = "line", method = "lm", formula = y ~ x) + - coord_cartesian(ylim = c(15, 25)) -@ - -The next example uses a coordinate system transformation. When the transformation is applied to the coordinate system, it affects only the plotting---it sits between the \code{geom} and the rendering of the plot. The transformation is applied to the values that were returned by \emph{statistics}. The straight line fitted is plotted on the transformed coordinates as a curve, because the model was fitted to the untransformed data obtaining untransformed predicted values. The coordinate transformation is applied to these predicted values and plotted. (Other coordinate systems are described in sections \ref{sec:plot:sf} and \ref{sec:plot:circular} on pages \pageref{sec:plot:sf} and \pageref{sec:plot:circular}, respectively.) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() + - stat_smooth(geom = "line", method = "lm", formula = y ~ x) + - coord_trans(y = "log10") -@ - -Themes affect the rendering of plots at the time of printing---they can be thought of as style sheets defining the graphic design. A complete theme can override the default gray theme. The plot is the same, the observations are represented in the same way, the limits of the axes are the same and all text is the same. On the other, hand how these elements are rendered by different themes can be drastically different.% ({\small\textsf{data $\to$ aes $\to$ $\to$ geom $\to$ theme $\to$ \emph{ggplot object}}} - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() + - theme_classic() -@ - -Both the base font size and the base font family can be changed. The base font size controls the size of all text elements, as other sizes are defined relative to the base size. How the plot looks changes when using the same theme as in the previous example, but with a different base point size and font family for text elements. (The use of themes is discussed in section \ref{sec:plot:themes} on page \pageref{sec:plot:themes}.) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() + - theme_classic(base_size = 20, base_family = "serif") -@ - -How to set axis labels, tick positions and tick labels will be discussed in depth in section \ref{sec:plot:scales} on page \pageref{sec:plot:scales}. Function \code{labs()} is \emph{a convenience function} used to set the title and subtitle of a plot and to replace the default \code{name} of scales, in this case, those used for axis labels---the default \code{name} of scales is the name of the mapped variable. In the call to \code{labs()} the names of aesthetics are used as if they were formal parameters with character strings, or \Rlang expressions. as arguments. Below \code{x} and \code{y} are the names of the two \emph{aesthetics} to which two variables in \code{data} were mapped, \code{disp} and \code{mpg}. Formal parameters \code{title} and \code{subtitle} add these plot elements. (The escaped character \verb|\n| stands for new line, see section \ref{sec:calc:character} on page \pageref{sec:calc:character}.) - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point() + - labs(x = "Engine displacement (cubic inches)", - y = "Fuel use efficiency\n(miles per gallon)", - title = "Motor Trend Car Road Tests", - subtitle = "Source: 1974 Motor Trend US magazine") -@ - -As elsewhere in \Rlang, when a value is expected, either a value stored in a variable or a more complex statement returning a suitable value can be passed as an argument to be mapped to an \emph{aesthetic}. In other words, the values to be plotted do not need to be stored as variables (or columns) in the data frame passed as an argument to parameter \code{data}, they can also be computed from these variables. Below, miles-per-gallon, \code{mpg} are plotted against the engine displacement per cylinder by dividing \code{disp} by \code{cyl} within the call to \code{aes()}. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp / cyl, y = mpg)) + - geom_point() -@ - -Each of the elements of the grammar exemplified above is implemented in multiple functions, and in addition these functions accept arguments that can be used to modify their behavior. Multiple data objects as well as multiple mappings can coexist within a single \code{"gg"} plot object. Packages and user code can define new \emph{geometries}, \emph{statistics}, \emph{scales}, \emph{coordinates} and even implement new \emph{aesthetics}. Individual elements in a \emph{theme} can be modified and new complete \emph{themes} created, re-used and shared. I describe below how to use the grammar of graphics to construct different types of data visualizations, both simple and complex. Because the different elements interact, I introduce some of them first briefly in sections other than where I describe them in depth. -\index{grammar of graphics!plot construction|)} - -\subsection{Plots as \Rlang objects}\label{sec:plot:objects} -\index{grammar of graphics!plots as R objects|(} -\code{"gg"} plot objects and their components behave as other \Rlang objects. Operators and methods for the \code{"gg"} class are available. As above, a \code{"gg"} plot object saved as \code{p1} is used below. - -<>= -@ - -In the previous section operator \code{+} was used to assemble the plots from ``anonymous'' \Rlang objects. Saved or ``named'' objects can also be combined with \code{+}. - -<>= -p1 + stat_smooth(geom = "line", method = "lm", formula = y ~ x) -@ - -Above, plot elements were added one by one, with operator \code{+}. Multiple components can be also added in a single operation. Like individual components, sets of components stored in a list can be saved in a variable and added to multiple plots. This ensures consistency and makes coordinated alterations to a set of plots easier. \emph{Throughout this chapter I use this approach to achieve conciseness and to highlight what is different and what is not among plots in related examples.} - -<>= -p.ls <- list( - stat_smooth(geom = "line", method = "lm", formula = y ~ x), - scale_y_log10()) -@ - -<>= -p1 + p.ls -@ - -\begin{playground} - Reproduce the examples in the previous section, using \code{p1} defined above as a basis instead of building each plot from scratch. -\end{playground} - -\begin{warningbox} -\index{grammar of graphics!structure of plot objects|(} -The separation of plot construction and rendering is possible, because \code{"gg"} objects are self-contained. A copy of the data object passed as argument is saved within the plot object, similarly as in model-fit objects. In the example above, \code{p1} by itself could be saved to a file on disk and loaded into a clean \Rlang session, even on another computer, and rendered as long as package \ggplot and its dependencies are available. Another consequence of storing a copy of the data in the plot object, is that later changes to the data object used to create a \code{"gg"} object are \emph{not} reflected in newly rendered plots from this object: the "gg" object needs to be created anew. -\end{warningbox} - -\begin{explainbox} -The \emph{recipe} for a plot is stored in a \code{"gg"} plot object. Objects of class \code{"gg"} are of mode \code{"list"}. In \Rlang, lists can contain heterogeneous members and \code{"gg"} objects contain data, function definitions, and unevaluated expressions. In other words the data plus instructions to transform the data, to map them into graphic objects, and various aspects of the rendering from scale limits to type faces to use. (\Rlang lists are described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists}.) - -Top level members of the \code{"gg"} plot object \code{p1}, a simple plot, are displayed below with method \code{summary()}, which shows the components without making explicit the structure of the object. - -<>= -summary(p1) -@ - -Method \code{str()} shows the structure of objects and can be also used to advantage with ggplots (long output not shown). Alternatively, \code{names()} extracts the names of the top-level members of \code{p1}. - -<>= -names(p1) -@ -\end{explainbox} - -\begin{advplayground} -Explore in more detail the different members of object \code{p1}. For example, the code statement below extracts member \code{"layers"} from object \code{p1} and display its structure. - -<>= -str(p1$layers, max.level = 1) -@ - -How many layers are present in this case? -\end{advplayground} -\index{grammar of graphics!structure of plot objects|)} -\index{grammar of graphics!plots as R objects|)} - -\subsection{Scales and mappings}\label{sec:plot:mappings} -\index{grammar of graphics!mapping of data|(} -\index{grammar of graphics!aesthetics(} -In \ggplot a \emph{mapping} describes which variable in \code{data} is mapped to which \code{aesthetic}, or graphic feature of a plot, such as $x$, $y$, color, fill, shape, linewidth, etc. In \ggplot a \emph{scale} describes the correspondence between \emph{values} in the mapped variable and values of the graphic feature. Below, the numeric variable \code{cyl} is mapped to the \code{color} aesthetic. As the variable is \code{numeric} a continuous color scale is used. Out of the multiple continuous color scales available, \ggscale{scale\_color\_continuous()} is the default. - -<>= -p2 <- - ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = cyl)) + - geom_point() -p2 -@ - -Without changing the \code{mapping} a different-looking plot can be created by changing the scale used. Below, in addition a palette is selected with \code{option = "magma"} and the range of colours used from this palette adjusted with \code{end = 0.85}. - -<>= -p2 + scale_color_viridis_c(option = "magma", end = 0.85) -@ - -Changing the scale used for the color aesthetic, conceptually does not modify the plot, except for the colours used. There is a separation between the semantic structure of the plot and its graphic design. Still, how the audience interacts and perceives the plot depends on both of these concerns. - -Some scales, like those for \code{color}, exist in multiple ``flavors,'' suitable for numeric variables (continuous) or for factors (discrete) values. If \code{cyl} is converted into a \code{factor}, a discrete color scale is used instead of a continuous one. Out of the different discrete scales \ggscale{scale\_color\_discrete()} is used by default. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl))) + - geom_point() -@ - -If \code{cyl} is converted into an \code{ordered} factor, an ordinal color scale is used, by default \ggscale{scale\_color\_ordinal()} (plot not shown). - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = ordered(cyl))) + - geom_point() -@ - -The scales for other aesthetics function in a similar way as those for color. Scales are described in detail in section \ref{sec:plot:scales} on page \pageref{sec:plot:scales:continuous}. - -In the examples above for simple plots, based on data contained in a single data frame, mappings were established by passing the value returned by the call to \Rfunction{aes()} as the argument to parameter \code{mapping} of \Rfunction{ggplot()}. - -Arguments passed to \code{data} and/or \code{mapping} parameters of \Rfunction{ggplot()} work as defaults for all layers in a plot. In layer functions, statistics and geometries, arguments passed to parameters with these same names, override the whole-plot defaults if present. Consequently, the code below creates a plot, \code{p3}, identical to \code{p2} above. - -<>= -p3 <- - ggplot() + - geom_point(data = mtcars, - mapping = aes(x = disp, y = mpg, color = cyl)) -p3 -@ - -These examples demonstrate two different approaches that are equally convenient for simple plots with a single layer. However, if a plot has multiple layers based on the same data, the approach used for \code{p2} makes this clear and is concise. If each layer uses different data and/or different mappings, the second approach is necessary. - -\begin{explainbox} -In some cases, when flexibility is needed while constructing complex plots with multiple layers other \emph{idioms} can be preferable, e.g., when assembling a plot from ``pieces´´ stored in variables or built programmatically. - -The default mapping can also be added directly with the \code{+} operator, instead of being passed as an argument to \Rfunction{ggplot()}. - -<>= -ggplot(data = mtcars) + - aes(x = disp, y = mpg) + - geom_point() -@ - -It is also possible to have a default mapping for the whole plot, but no default data. - -<>= -ggplot() + - aes(x = disp, y = mpg) + - geom_point(data = mtcars) -@ - -A mapping saved in a variable (example below), as well as a mapping returned by a function call (shown above for \code{aes()}), can be passed as an argument to parameter \code{mapping} - -<>= -my.mapping <- aes(x = disp, y = mpg) -ggplot(data = mtcars, - mapping = my.mapping) + - geom_point() -@ - -In all these examples, the plot remains unchanged (not shown). However, the flexibility of the grammar allows the assembly of plots from separately constructed pieces and reusing these pieces by storing them in variables. These approaches can be very useful in scrips that construct consistently formatted sets of plots, or when the same mapping needs to be used consistently in multiple plots. -\end{explainbox} - -The mapping to aesthetics in the call to \Rfunction{aes()} does not have to be to a variable from \code{data} as in examples above. A a code statement that returns a value computed from one or more variables from \code{data} is also accepted. Computations during mapping helps avoid the proliferation of variables in the data frames containing observations. In this simple example, \code{mpg} in miles per gallon is converted into km per litre during mapping. - -<>= -ggplot(data = mtcars, - mapping =aes(x = disp, y = mpg * 0.43)) + - geom_point() -@ - -\begin{explainbox} -Operations applied to the \code{data} before they are plotted are usually implemented in \code{stats}. Sometimes it is convenient to directly modify the whole-plot default \code{data} before it reaches the layer's \code{stat} function. One approach is to pass a function to parameter \code{data} of the layer function. This argument must be the definition of a function accepting a data frame as its first argument and returning a data frame. When the argument to \code{data} is a function definition instead of the usual data frame, the function is applied to the plot's default data and the data frame returned by the function is used as the \code{data} in the layer. In the example below, an anonymous function defined in-line, extracts a subset of the rows. The observations in the extracted rows are highlighted in the plot by overplotting them with smaller yellow shapes. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point(size = 4) + - geom_point(data = function(x){subset(x = x, cyl == 4)}, - color = "yellow", size = 1.5) -@ - -The argument passed above to data is a function definition, not a function call. Thus, if a function is passed by name, no parentheses are used. No arguments can be passed to a function, except for the default \code{data} passed by position to its first parameter. Consequently, it is not possible to pass function \code{subset} directly. The anonymous function above is needed to be able to pass \code{cyl == 4} as argument. - -The plot's default data can also be operated upon using the \pkgname{magrittr} pipe operator, but not the pipe operator native to \Rlang (\Roperator{\textbar >}) or the dot-pipe operator from \pkgname{wrapr} (see section \ref{sec:data:pipes} on page \pageref{sec:data:pipes}). In this approach, the dot (\code{.}) placeholder at the head of the pipe stands for the plot's default \code{data} object. The code statement calls function \Rfunction{subset()} with \code{cyl == 4} passed as argument. The plot, not shown, is as in the example above. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point(size = 4) + - geom_point(data = . %>% subset(x = ., cyl == 4), color = "yellow", - size = 1.5) -@ - -A third possible approach is to test the condition within the call to \Rfunction{aes()}. In this approach it is not possible to extract a subset of rows. Making some observations invisible by reducing their size seems straightforward. However, setting \code{size = 0} draws a very small point, still visible. Out various possible approaches, setting size to \code{NA}, skips the rows, and \code{na.rm = TRUE} silences the expected warning. This is a roundabout approach to subsetting. Notice that \ggscale{scale\_size\_identity()} is also needed. The plot, not shown, when rendered does not differ from the two examples above. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg)) + - geom_point(size = 4) + - geom_point(color = "yellow", - mapping = aes(size = ifelse(cyl == 4, 1.5, NA)), - na.rm = TRUE) + - scale_size_identity() -@ - -As it is usual in \Rlang, multiple approaches can be used to the same end. -\end{explainbox} - -\begin{explainbox} -\emph{Late mapping}\index{grammar of graphics!mapping of data!late} of variables to aesthetics has been possible in \pkgname{ggplot2} for a long time using as notation enclosure of the name of a variable returned by a statistic between \code{..}, but this notation has been deprecated some time ago and replaced by \ggscale{stat()}. In both cases, this imposed a limitation: it was impossible to map a computed variable to the same aesthetic as input to the statistic and to the geometry in the same layer. There were also some other quirks that prevented passing some arguments to the geometry through the dots \code{...} parameter of a statistic. - -Since version 3.3.0 of \pkgname{ggplot2} the syntax used for mapping variables to aesthetics is based on functions \ggscale{stage()}, \ggscale{after\_stat()} and \ggscale{after\_scale()}. Function \ggscale{after\_stat()} replaces both \ggscale{stat()} and the \code{..} notation. -\end{explainbox} - -%Variables in the data frame passed as argument to \code{data} are mapped to aesthetics before they are received as input by a statistic (possibly \code{stat\_identity()}). The mappings of variables in the data frame returned by statistics are the input to the geometry. Those statistics that operate on \textit{x} and/or \text{y} return a transformed version of these variables, by default also mapped to these aesthetics. However, in most cases other variables in addition to \textit{x} and/or \text{y} are included in the \code{data} returned by a \emph{statistic}. Although their default mapping is coded in the statistic functions' definitions, the user can modify this default mapping explicitly within a call to \code{aes()} using \ggscale{after\_stat()}, which lets us differentiate between the data frame supplied by the user and that returned by the statistic. The third stage was not accessible in earlier versions of \pkgname{ggplot2}, but lack of access was usually not insurmountable. Now this third stage can be accessed with \ggscale{after\_scale()} making coding simpler. -% -%User-coded transformations of the data are best handled at the third stage using scale transformations. However, when the intention is to jointly display or combine different computed variables returned by a statistic we need to set the desired mapping of original and computed variables to aesthetics at more than one stage. -% -The documentation of \pkgname{ggplot2} gives several good examples of cases when the new mapping syntax is useful. I give here a different example, a polynomial fitted to data using \Rfunction{rlm()}. RLM is a procedure that automatically assigns before computing the residual sums of squares, weights to the individual residuals in an attempt to protect the estimated fit from the influence of extreme observations or outliers. When using this and similar methods it is of interest to plot the residuals together with the weights. One approach is to map weights to a gradient between two colours. The code below constructs a data frame containing artificial data that includes an extreme value or outlier. - -<>= -set.seed(4321) -X <- 0:10 -Y <- (X + X^2 + X^3) + rnorm(length(X), mean = 0, sd = mean(X^3) / 4) -df1 <- data.frame(X, Y) -df2 <- df1 -df2[6, "Y"] <-df1[6, "Y"] * 10 -@ - -In the first plot \ggscale{after\_stat()} is used to map variable \code{weights} computed by the statistic to the \code{colour} aesthetic. In \ggstat{stat\_fit\_residuals()}, \gggeom{geom\_point()} is used by default. This figure shows the raw residuals with no weights applied (mapped to $y$ by default), and the computed weights (with range 0 to 1) encoded by colours ranging between red and blue. - -<>= -ggplot(data = df2, mapping = aes(x = X, y = Y)) + - stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE), method = "rlm", - mapping = aes(colour = after_stat(weights)), - show.legend = TRUE) + - scale_color_gradient(low = "red", high = "blue", limits = c(0, 1), - guide = "colourbar") -@ - -In the second plot weighted residuals are mapped to the $y$ aesthetic, and weights, as above, to the color aesthetic. A call to \ggscale{stage()} can distinguish the mapping ahead of the statistic (\code{start}) from that after the statistic, i.e., ahead of the geometry. As above, the default geometry, \gggeom{geom\_point()} is used. The mapping in this example can be read as: the variable \code{X} from the data frame \code{df2} is mapped to the \textit{x} aesthetic at all stages. Variable \code{Y} from the data frame \code{df2} is mapped to the \textit{y} aesthetic ahead of the computations in \ggstat{stat\_fit\_residuals()}. After the computations, variables \code{y} and \code{weights} in the data frame returned by \ggstat{stat\_fit\_residuals()} are multiplied and mapped to the \textit{y} ahead of \gggeom{geom\_point()}.\label{chunk:plot:weighted:resid} - -<>= -ggplot(df2) + - stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE), - method = "rlm", - mapping = aes(x = X, - y = stage(start = Y, - after_stat = y * weights), - colour = after_stat(weights)), - show.legend = TRUE) + - scale_color_gradient(low = "red", high = "blue", limits = c(0, 1), - guide = "colourbar") -@ - -\begin{explainbox} -When fitting models to observations with \Rfunction{lm()}, the un-weighted residuals are used compute the sum of squares. Function \Rfunction{lm()} supports the use of weights supplied by the user. In \Rfunction{rlm()} the weights are computed automatically. -\end{explainbox} - -\index{grammar of graphics!mapping of data|)} -\index{grammar of graphics!aesthetics)} - -\section{Geometries}\label{sec:plot:geometries} -\index{grammar of graphics!geometries|(} - -Different geometries support different \emph{aesthetics} (Table \ref{tab:plot:geoms}). While \gggeom{geom\_point()} supports \code{shape}, and \gggeom{geom\_line()} supports \code{linetype}, both support \code{x}, \code{y}, \code{color} and \code{size}. In this section I describe frequently used \code{geometries} from package \ggplot and from a few packages that extend \ggplot. The graphic output from some code examples will not be shown, with the expectation that readers will run the code to see the plots. - -Mainly for historical reasons, \emph{geometries} accept a \emph{statistic} as an argument, in the same way as \emph{statistics} accept a \emph{geometry} as an argument. In this section I only describe \emph{geometries} which have as a default \emph{statistic} \code{stat\_identity}. In section \ref{sec:plot:stat:summaries} (page \pageref{sec:plot:stat:summaries}) I describe other \emph{geometries} together with the \emph{statistics} they use by default. - -\begin{table} - \caption[Geometries]{\ggplot geometries described in section \ref{sec:plot:geometries}, packages where they are defined, and the aesthetics supported. The default statistic is in all cases \ggstat{stat\_identity()}.}\vspace{1ex}\label{tab:plot:geoms} - \centering - \begin{tabular}{llp{8.25cm}} - \toprule - Geometry & Package & Aesthetics \\ - \midrule - \code{geom\_point} & \pkgnameNI{ggplot2} & x, y, shape, size, fill, color, alpha \\ - \code{geom\_point\_s} & \pkgnameNI{ggpp} & x, y, size, linetype, linewidth, fill, color, alpha \\ - \code{geom\_pointrange} & \pkgnameNI{ggplot2} & x, y, ymin, ymax, shape, size, linetype, linewidth, fill, color, alpha \\ - \code{geom\_errorbar} & \pkgnameNI{ggplot2} & x, ymin, ymax, linetype, linewidth, color, alpha \\ - \code{geom\_linerange} & \pkgnameNI{ggplot2} & x, ymin, ymax, linetype, linewidth, color, alpha \\ - \code{geom\_line} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, color, alpha \\ - \code{geom\_segment} & \pkgnameNI{ggplot2} & x, y, xend, yend, linetype, linewidth, color, alpha \\ - \code{geom\_step} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, color, alpha \\ - \code{geom\_path} & \pkgnameNI{ggplot2} & x, y, linetype, linewidth, color, alpha \\ - \code{geom\_curve} & \pkgnameNI{ggplot2} & x, y, xend or yend, linetype, linewidth, color, alpha \\ - \code{geom\_area} & \pkgnameNI{ggplot2} & x, y, (ymin = 0), linetype, linewidth, fill, color, alpha \\ - \code{geom\_ribbon} & \pkgnameNI{ggplot2} & x, ymin and ymax, linetype, linewidth, fill, color, alpha \\ - \code{geom\_align} & \pkgnameNI{ggplot2} & x or y, xmin or xmax, ymin or ymax, linetype, linewidth, fill, color, alpha \\ - \code{geom\_rect} & \pkgnameNI{ggplot2} & xmin, xmax, ymin, ymax, linetype, linewidth, fill, color, alpha \\ - \code{geom\_tile} & \pkgnameNI{ggplot2} & x, y, width, height, linetype, linewidth, fill, color, alpha \\ - \code{geom\_col} & \pkgnameNI{ggplot2} & x, y, width, linetype, linewidth, fill, color, alpha \\ - \code{geom\_rug} & \pkgnameNI{ggplot2} & x or y, linewidth, color, alpha \\ - \code{geom\_hline} & \pkgnameNI{ggplot2} & yintercept, linetype, linewidth, color, alpha \\ - \code{geom\_vline} & \pkgnameNI{ggplot2} & xintercept, linetype, linewidth, color, alpha \\ - \code{geom\_abline} & \pkgnameNI{ggplot2} & intercept, slope, linetype, linewidth, color, alpha \\ - \code{geom\_text} & \pkgnameNI{ggplot2} & x, y, label, face, family, angle, size, color, alpha \\ - \code{geom\_label} & \pkgnameNI{ggplot2} & x, y, label, face, family, (angle), size, fill, color, alpha \\ - \code{geom\_text\_npc} & \pkgnameNI{ggpp} & x, y, label, face, family, angle, size, color, alpha \\ - \code{geom\_label\_npc} & \pkgnameNI{ggpp} & x, y, label, face, family, (angle), size, fill, color, alpha \\ - \code{geom\_text\_repel} & \pkgnameNI{ggrepel} & x, y, label, face, family, angle, size, color, alpha \\ - \code{geom\_label\_repel} & \pkgnameNI{ggrepel} & x, y, label, face, family, size, fill, color, alpha \\ - \code{geom\_sf} & \pkgnameNI{ggplot2} & fill, color \\ - \code{geom\_table} & \pkgnameNI{ggpp} & x, y, label, size, color, angle \\ - \code{geom\_plot} & \pkgnameNI{ggpp} & x, y, label, vp.width, vp.height, angle \\ - \code{geom\_grob} & \pkgnameNI{ggpp} & x, y, vp.width, vp.height, label \\ - \code{geom\_blank} & \pkgnameNI{ggplot2} & --- \\ - \bottomrule - \end{tabular} -\end{table} - -\subsection{Point}\label{sec:plot:geom:point} -\index{grammar of graphics!point geometry|(} - -As seen in examples above, \gggeom{geom\_point()}, can be used to add a layer with observations represented by ``points'' or symbols. In \emph{scatter plots} the variables mapped to $x$ and $y$ aesthetics are both continuous (\code{numeric}) and in \emph{dot plots} one of them is discrete (\code{factor} or \code{ordered}) and the other continuous. The plots in the examples above have been scatter plots. - -\index{plots!scatter plot|(}The first examples of the use of \gggeom{geom\_point()} are for \textbf{scatter plots}, as \code{disp} and \code{mpg} are \code{numeric} variables. In the examples above a third variable, \code{cyl} was represented by colors. While the color aesthetic can be used with all \code{geoms}, other aesthetics can be used only with same \code{geoms}, for example the \code{shape} aesthetic can be used only with \gggeom{geom\_point()} and \code{geoms} like \gggeom{geom\_pointrange()} and \gggeom{geom\_point\_s()}. The values in the shape aesthetic are discrete, and consequently only discrete values can be mapped to it. - -<>= -p.base <- - ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, shape = factor(cyl))) + - geom_point() -p.base -@ - -\begin{playground} -Try a different mapping: \code{disp} $\rightarrow$ \code{color}, \code{cyl} $\rightarrow$ \code{x}, keeping the mapping \code{mpg} $\rightarrow$ \code{y} unchanged. Continue by using \code{help(mtcars)} and/or \code{names(mtcars)} to see what other variables are available, and then try the combinations that trigger your curiosity---i.e., explore the data. -\end{playground} - -Adding \ggscale{scale\_shape\_discrete()}, the scale already used by default, but passing \code{solid = FALSE} as argument creates a version of the same plot based on open shapes, still selected automatically. - -<>= -p.base + - scale_shape_discrete(solid = FALSE) -@ - -In contrast to ``filled'' shapes that obey both \code{color} and \code{fill}, open shapes obey only \code{color} similarly to solid shapes.\emph{aesthetics}. Function \code{scale\_shape\_manual} can be used to choose the shape used for each value in the mapped factor. Below, ``open'' shapes are used, as they reveal partial overlaps better than solid shapes (plot not shown).\label{chunk:filled:symbols} - -<>= -p.base + - scale_shape_manual(values = c("circle open", - "square oepn", - "diamond open")) -@ - -It is also possible to use characters as shapes. The character is centered on the position of the observation. As the numbers used as symbols are self-explanatory, the default guide is removed by passing \code{guide = "none"} (plot not shown).\label{chunk:plot:point:char} - -<>= -p.base + - scale_shape_manual(values = c("4", "6", "8"), guide = "none") -@ - -A variable from \code{data} can be mapped to more than one aesthetic, allowing redundant aesthetics. This makes possible figures that, even if using color, are readable when reproduced as black-and-white images and to viewers affected by color blindness. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, - shape = factor(cyl), color = factor(cyl))) + - geom_point() -@ - -\index{plots!scatter plot|)} -\index{plots!dot plot|(}The next examples of the use of \gggeom{geom\_point()} are for \textbf{dot plots}, as \code{disp} is a \code{numeric} variable but \code{factor(cyl)} is discrete. Dot plots are prone to have overlapping observations, and one way of making these points visible is to make them partly transparent by setting a constant value smaller than one for the \code{alpha} \emph{aesthetic}. - -<>= -ggplot(data = mtcars, - mapping = aes(x = factor(cyl), y = mpg)) + - geom_point(alpha = 1/3) -@ - -Function\label{par:plot:pos:jitter} \ggposition{position\_identity()}, which is the default, does not alter the coordinates or position of observations, as shown in all examples above. To make overlapping observations visible, instead of making the points semitransparent as above, it is possible randomly displace them along the axis mapped to the discrete variable, $x$ in this case. This is called \emph{jitter}, and can be added using \ggposition{position\_jitter()} as argument to formal parameter \code{position} of \code{geoms}. The amount of jitter is set by numeric arguments passed to \code{width} and/or \code{height}, given as a fraction of the distance between adjacent factor levels in the plot. - -<>= -ggplot(data = mtcars, - mapping = aes(x = factor(cyl), y = mpg)) + - geom_point(position = position_jitter(width = 0.25, heigh = 0)) -@ - -\begin{warningbox} - The name as a character string can be also used when no arguments need to be passed to the \emph{position} function, and for some positions by passing numerical arguments to specific parameters of geometries. However, the default width of $\pm0.5$ tends to be rarely optimal (plot not shown). - -<>= -ggplot(data = mtcars, - mapping = aes(x = factor(cyl), y = mpg), colour = factor(cyl)) + - geom_point(position = "jitter") -@ -\end{warningbox} - -\index{plots!dot plot|)} -\index{plots!bubble plot|(} -\textbf{Bubble plots} are scatter- or dot plots in which the size of points or bubbles varies following values of a continuous variable mapped to the \code{size} \emph{aesthetic}. There are two approaches to this mapping, values in the mapped variable either describe the area of the points or their radii. Although the radius is sometimes used, due to how visual perception works, using area is perceptually closer to a linear mapping compared to radii. Below, the weights of cars in tons are mapped to the area of the points. Open circles are used because of overlaps. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl), size = wt)) + - scale_size_area() + - geom_point(shape = "circle open", stroke = 1.5) -@ - -\begin{playground} -If a radius-based scale is used instead of an area-based one the perceived size differences are larger, i.e., the ``impression'' on the viewer is different. In the plot above, replace \code{scale\_size\_area()} with \code{scale\_size\_radius()}. - -Display the plot, look at it carefully. Check the numerical values of some of the weights of the cars, and assess if your perception of the plot matches the numbers behind it. -\end{playground} - -\index{plots!bubble plot|)} - -As a final example summarizing the use of \gggeom{geom\_point()}, the scatter plot below combines different \emph{aesthetics} and their \emph{scales}. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, shape = factor(cyl), - fill = factor(cyl), size = wt)) + - geom_point(alpha = 0.33, color = "black") + - scale_size_area() + - scale_shape_manual(values = c("circle filled", - "square filled", - "diamond filled")) -@ - -\begin{playground} -Play with the code in the chunk above. Remove or change each of the mappings and the scale, display the new plot, and compare it to the one above. Continue playing with the code until you are sure you understand what graphical element in the plot is added or modified by each individual argument or ``word'' in the code statement. -\end{playground} -\index{grammar of graphics!point geometry|)} - -It is common to draw error bars together with points representing means or medians. These can be added in a single layer with \gggeom{geom\_pointrange()} with values mapped to the \code{x}, \code{y}, \code{ymin} and \code{ymax} aesthetics, using \code{y} for the point and \code{ymin} and \code{ymax} for the ends of the line segment. Two other \emph{geometries}, \gggeom{geom\_range()} and \gggeom{geom\_errorbar()} draw only a segment or a segment with capped ends. These three \code{geoms} are frequently used together with \code{stats} that compute summaries by group. However, summary values calculated before plotting can alternatively be passed as \code{data}. - -\subsection{Rug}\label{sec:plot:rug} -\index{plots!rug marging|(} - -Rarely, rug plots are used by themselves. Instead they are usually an addition to scatter plots. An example of the use of \gggeom{geom\_rug()} follows. They make it easier to see the distribution of observations along the $x$- and/or $y$-axes. By default, rugs are drawn on the left and bottom edges of the plotting area. By passing \code{sides = "btlr"} they are drawn on the bottom, top, left and right margins. Any combination of the four characters can be used to control the drawing of the rugs. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl))) + - geom_point() + - geom_rug(sides = "btlr") -@ - -\begin{warningbox} - Rug plots are most useful when the local density of observations in a continuous variable is not too high, otherwise rugs become too cluttered and the ``rug threads'' overlap. When overlap is moderate, making the segments semitransparent by setting the \code{alpha} aesthetic to a constant value smaller than one, can make the variation in density easier to appreciate. When the number of observations is large, marginal density plots are preferred. -\end{warningbox} -\index{plots!rug marging|)} - -\subsection{Line and area}\label{sec:plot:line} - -\index{grammar of graphics!various line and path geometries|(}\index{plots!line plot|(} -\textbf{Line plots} are normally created using \gggeom{geom\_line()}, and, occasionally using \gggeom{geom\_path()}. These two \code{geoms} differ in the sequence they follow when connecting values: \gggeom{geom\_line()} connects observations based on the ordering of \code{x} values while \gggeom{geom\_path()} uses the order in the data. Aesthetic \code{linewidth} controls the thickness of lines and \code{linetype} the patterns of dashes and dots. - -In a line plot, observations, or the subset of observations corresponding to a group, are joined by straight lines. Below, a different data set, \Rdata{Orange}, with data on the growth of five orange trees (see \code{help(Orange)}) is used. By mapping \code{Tree} to \code{linetype} the observations become grouped, and a separate line is plotted for each tree. - -\label{plot:fig:lines} -<>= -ggplot(data = Orange, - mapping = aes(x = age, y = circumference, linetype = Tree)) + - geom_line() -@ -\index{plots!line plot|)} - -\begin{warningbox} -Before \ggplot 3.4.0 the \code{size} aesthetic controlled the width of lines. Aesthetic \code{linewidth} was added in \ggplot 3.4.0 and the use of the \code{size} aesthetic for lines deprecated. -\end{warningbox} - -\index{plots!step plot|(}% -Geometry \gggeom{geom\_step()} plots only vertical and horizontal lines to join the observations, creating a stepped line, or ``staircase´´. Parameter \code{direction} with default \code{"hv"}, controls the ordering of horizontal and vertical lines. - -<>= -ggplot(data = Orange, - mapping = aes(x = age, y = circumference, linetype = Tree)) + - geom_step() -@ -\index{plots!step plot|)} - -\begin{playground} -Using the following toy data, make three plots using \code{geom\_line()}, \code{geom\_path()}, and \code{geom\_step} to add a layer. How do they differ? - -<>= -toy.df <- data.frame(x = c(1,3,2,4), y = c(0,1,0,1)) -@ -\end{playground} - -\index{plots!filled-area plot|(} -While \gggeom{geom\_line()} draws a line joining observations, \gggeom{geom\_area()} supports, in addition, filling the area below the line according to the \code{fill} \emph{aesthetic}. In some cases, it is useful to stack the areas, e.g., when the values plotted represent parts of a bigger whole. In the next, contrived, example, the areas representing the growth of the five orange trees are stacked (visually summed) using \code{position = "stack"} in place of the default \code{position = "identity"}. The visibility of the lines for individual trees is improved by changing their color and width from the defaults. (Compare the $y$ axis of the figure below to that drawn using \code{geom\_line()} on page \pageref{plot:fig:lines}.) - -<>= -p1 <- # will be used again later - ggplot(data = Orange, - mapping = aes(x = age, y = circumference, fill = Tree)) + - geom_area(position = "stack", color = "white", linewidth = 1) -p1 -@ - -\gggeom{geom\_ribbon()} draws two lines based on the \code{x}, \code{ymin} and \code{ymax} \emph{aesthetics}, with the space between the lines filled according to the \code{fill} \emph{aesthetic}. \gggeom{geom\_polygon()} is similar to \gggeom{geom\_path()} but connects the first and last observations forming a closed polygon that obeys the \code{fill} aesthetic. - -\index{plots!filled-area plot|)} - -\index{plots!reference lines|(} -Finally,\label{sec:plot:vhline} three \emph{geometries} for drawing lines across the whole plotting area: \gggeom{geom\_hline()}, \gggeom{geom\_vline()} and \gggeom{geom\_abline()}. The first two draw horizontal and vertical lines, respectively, while the third one draws straight lines according to the \emph{aesthetics} \code{slope} and \code{intercept} determining the position. The lines drawn with these three geoms extend to the edge of the plotting area. - -\gggeom{geom\_hline()} and \gggeom{geom\_vline()} require a single parameter (or aesthetic), \code{yintercept} and \code{xintercept}, respectively. Different from other geoms, the data for these aesthetics can be passed as constant numeric vector containing multiple values. The reason for this is that these geoms are most frequently used to annotate plots rather than plotting observations. Vertical lines can be used to highlight time points, here the ages of 1, 2, and 3 years. - -<>= -p1 + - geom_vline(xintercept = 365 * 1:3, color = "gray75") + - geom_vline(xintercept = 365 * 1:3, linetype = "dashed") -@ - -\begin{playground} - Change the order of the two layers in the example above. How did the figure change? What order is best? Would the same order be the best for a scatter plot? And would it be necessary to add two \code{geom\_vline()} layers? -\end{playground} - -Similarly to \gggeom{geom\_hline()} and \gggeom{geom\_vline()}, \gggeom{geom\_abline()} draws a straight line, accepting as parameters (or as aesthetics) values for the \code{intercept}, $a$, and the \code{slope}, $b$. -\index{plots!reference lines|)} - -\index{plots!segments and arrows|(} -Disconnected straight-line segments and arrows, one for each observation or row in the data, can be plotted with \gggeom{geom\_segment()} which accepts \code{x}, \code{xend}, \code{y} and \code{yend} as mapped aesthetics. \gggeom{geom\_spoke()}, which uses a polar parametrization, uses a different set of aesthetics, \code{x}, \code{y} for origin, and \code{angle} and \code{radius} for the segment. Similarly, \gggeom{geom\_curve()} draws curved segments, with the curvature, control points, and angles controlled through parameters. These three \emph{geometries} support arrow heads at the ends of segments or curves, controlled through parameter \code{arrow} (not through an aesthetic). -\index{plots!segments and arrows|)} -\index{grammar of graphics!various line and path geometries|)} - -\subsection{Column}\label{sec:plot:col} -\index{grammar of graphics!column geometry|(} -\index{plots!column plot|(} - -The \emph{geometry} \gggeom{geom\_col()} can be used to create \emph{column plots}, where each bar represents an observation or row in the \code{data} (frequently means or totals previously computed from the primary observations). - -\begin{warningbox} -In other contexts column plots are frequently called bar plots. \Rlang users not familiar yet with \ggplot are frequently surprised by the default behavior of \gggeom{geom\_bar()} as it uses \ggstat{stat\_count()} to produce a histogram, rather than plotting values as is (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram}). \gggeom{geom\_col()} is identical to \gggeom{geom\_bar()} but with \code{"identity"} as the default statistic. -\end{warningbox} - -Using very simple artificial data helps demonstrate how variations of column plots can be obtained. The data are for two groups, hypothetical males and females. - -<>= -set.seed(654321) -my.col.data <- - data.frame(treatment = factor(rep(c("A", "B", "C"), 2)), - group = factor(rep(c("male", "female"), c(3, 3))), - measurement = rnorm(6) + c(5.5, 5, 7)) -@ - -The first plot includes data for \code{"female"} subjects extracted using a nested call to \Rfunction{subset()}. Except for \code{x} and \code{y} default mappings are used for all \emph{aesthetics}. - -<>= -opts_chunk$set(opts_fig_medium) -@ - -<>= -ggplot(subset(my.col.data, group == "female"), - mapping = aes(x = treatment, y = measurement)) + - geom_col() -@ - -The \label{par:plot:pos:stack} bars above, are overwhelmingly wide, passing \code{width = 0.5} makes the bars narrower, using only half the distance between the levels on the $x$ axis. Setting \code{color = "white"} overrides the default color of the lines bordering the bars. Both males and females are included and \code{group} is mapped to the \code{fill} aesthetic. The default argument for position in \gggeom{geom\_col()} is \ggposition{position\_stack()}. Function \ggposition{position\_stack()} is similar to \ggposition{position\_stack()} but divides the stacked values by their sum, i.e., the individual stacked ``slices´´ of the column display proportions instead of absolute values. - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p.base <- - ggplot(my.col.data, - mapping = aes(x = treatment, y = measurement, fill = group)) -@ - -<>= -p1 <- p.base + geom_col(width = 0.5) + ggtitle("stack (default)") -@ - -Using \code{position = "dodge"}\label{par:plot:pos:dodge} to override the default \code{position = "stack"} the columns for males and females are plotted side by side.\qRfunction{position\_stack()} - -<>= -p2 <- p.base + geom_col(position = "dodge") + ggtitle("dodge") -@ - -<>= -p1 + p2 -@ - -<>= -opts_chunk$set(opts_fig_wide) -@ - -\begin{playground} -Change the argument to \code{position}, or let the default be active, until you understand its effect on the figure. What is the difference between \emph{positions} \code{"identity"}, \code{"dodge"}, \code{"stack"}, and \code{"fill"}? -\end{playground} - -\begin{playground} -Use constants as arguments for \emph{aesthetics} or map variable \code{treatment} to one or more of the \emph{aesthetics} recognized by \gggeom{geom\_col()}, such as \code{color}, \code{fill}, \code{linetype}, \code{size}, \code{alpha} and \code{width}. -\end{playground} - -\index{grammar of graphics!column geometry|)} -\index{plots!column plot|)} - -\subsection{Tiles}\label{sec:tileplot} -\index{grammar of graphics!tile geometry|(} -\index{plots!tile plot|(} -\textbf{Tile plots} and \textbf{heat maps} are useful when observations are available on a regular rectangular 2D grid. The grid can, for example, represent locations in space as well combinations of levels of two discrete classification criteria. The color or darkness of the tiles informs about the value of the observations. A layer with square or rectangular tiles can be added with \gggeom{geom\_tile()}. - -Data from 100 random draws from the $F$ distribution with degrees of freedom $\nu_1 = 2, \nu_2 = 20$ are used in the examples. - -<>= -set.seed(1234) -randomf.df <- data.frame(F.value = rf(100, df1 = 2, df2 = 20), - x = rep(letters[1:10], 10), - y = LETTERS[rep(1:10, rep(10, 10))]) -@ - -\gggeom{geom\_tile()} requires aesthetics $x$ and $y$, with no defaults, and \code{width} and \code{height} with defaults that make all tiles of equal size filling the plotting area. Variable \code{F.value} is mapped to \code{fill}. - -<>= -ggplot(data = randomf.df, - mapping = aes(x, y, fill = F.value)) + - geom_tile() -@ - -Below, setting \code{color = "gray75"} and \code{linewidth = 1.33} makes the tile borders visible. Whether highlighting these lines improves or not a tile plot depends on whether the individual tiles correspond to values of a categorical- or continuous variable. E.g.m when rows of tiles correspond to genes and columns to discrete treatments, visible tile borders are preferable. In contrast, in the case when the tiles are an approximation to a continuous surface like measurements on a regular spatial grid, it is best to suppress tile borders. - -<>= -ggplot(data = randomf.df, - mapping = aes(x, y, fill = F.value)) + - geom_tile(color = "gray75", linewidth = 1) -@ - -\begin{playground} -Play with the arguments passed to parameters \code{color} and \code{size} in the example above, considering what features of the data are most clearly perceived in each of the plots you create. -\end{playground} - -Continuous fill scales can be used to control the appearance. Below, code for a tile plot based on a gray gradient, with missing values in red, is constructed is shown (plot not shown). - -<>= -ggplot(data = randomf.df, - mapping = aes(x, y, fill = F.value) + - geom_tile(color = "white") + - scale_fill_gradient(low = "gray15", high = "gray85", na.value = "red") -@ - -In contrast to \gggeom{geom\_tile()}, \gggeom{geom\_rect()} draws rectangular tiles based on the position of the corners, mapped to aesthetics \code{xmin}, \code{xmax}, \code{ymin} and \code{ymax}. In this case tiles can vary in size and do not need to be contiguous. The filled rectangles can be used, for example, to highlight a rectangular region in a plot (see example on page \pageref{par:plot:inset:zoom}). -\index{plots!tile plot|)} -\index{grammar of graphics!tile geometry|)} - -\subsection{Simple features (sf)}\label{sec:plot:sf} -\index{grammar of graphics!sf geometries|(} -\index{plots!maps and spatial plots|(} - -\ggplot version 3.0.0 or later supports with \gggeom{geom\_sf()}, and its companions, \gggeom{geom\_sf\_text()}, \gggeom{geom\_sf\_label()}, and \ggstat{stat\_sf()}, the plotting of shape data similarly to geographic information systems (GIS). This makes it possible to display data on maps, for example, using different fill values for different regions. The special \emph{coordinate} \code{coord\_sf()} can be used to select different projections for maps. The \emph{aesthetic} used is called \code{geometry} and contrary to all the other aesthetics described above, the values to be mapped are of class \code{sfc} containing \emph{simple features} data with multiple components. Manipulation of simple features data is supported by package \pkgname{sf}. Normal geometries can be use together with \ggstat{stat\_sf\_coordinates()} to add other graphical elements to maps. This subject exceeds the scope of this book, so a single and very simple example is shown below. - -<>= -nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE) -ggplot(nc) + - geom_sf(mapping = aes(fill = AREA), color = "gray90") -@ -\index{grammar of graphics!sf geometries|)} -\index{plots!maps and spatial plots|)} - -\subsection{Text}\label{sec:plot:text} -\index{grammar of graphics!text and label geometries|(} -\index{plots!text in|(} -\index{plots!maths in|(} -Geometries \gggeom{geom\_text()} or \gggeom{geom\_label()} are used to add textual data labels and annotations to plots. - -For \gggeom{geom\_text()} and \gggeom{geom\_label()}, the aesthetic \code{label} provides the text to be plotted and aesthetics \code{x} and \code{y}, the location of the labels. The size of the text is controlled by the \code{size} aesthetics, while the font is selected by the \code{family} and \code{fontface} aesthetics. Below, the whole-plot default mappings for \code{color} and \code{size} aesthetics are overridden within \gggeom{geom\_text()}. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, - color = factor(cyl), size = wt, label = cyl)) + - geom_point(alpha = 1/3) + - geom_text(color = "darkblue", size = 3) -@ - -Aesthetics \code{angle}, expressed in degrees, and \code{vjust} and \code{hjust} can be used to rotate the text and adjust its vertical and horizontal justification. The default value of 0.5 for both \code{hjust} and \code{vjust} sets the center of the text at the supplied \code{x} and \code{y} coordinates. \emph{``Vertical'' and ``horizontal'' for text justification are relative to the text, not the plot.} This is important when \code{angle} is different from zero. Values larger than 0.5 shift the label left or down, and values smaller than 0.5, right or up with respect to its \code{x} and \code{y} coordinates. A value of 1 or 0 sets the text so that its edge is at the supplied coordinate. Values outside the range $0\ldots 1$ shift the text even farther away, however, still using units based on the length or height of the text label. Recent versions of \pkgname{ggplot2} make possible justification using character constants for alignment: \code{"left"}, \code{"middle"}, \code{"right"}, \code{"bottom"}, \code{"center"} and \code{"top"}, and two special alignments, \code{"inward"} and \code{"outward"}, that automatically vary based on the position in the plotting area. - -Below, \gggeom{geom\_text()} or \gggeom{geom\_label()} are used together with \gggeom{geom\_point()} similarly as they are used to add data labels in a plot. - -<>= -my.data <- - data.frame(x = 1:5, - y = rep(2, 5), - label = c("ab", "bc", "cd", "de", "ef")) -@ - -<>= -ggplot(data = my.data, - mapping = aes(x, y, label = label)) + - geom_text(angle = 90, hjust = 1.5, size = 4) + - geom_point() -@ - -In the case of \gggeom{geom\_label()} the text is enclosed in a box and obeys the \code{fill} \emph{aesthetic} and additional parameters (described starting at page \pageref{start:plot:label}) allowing control of the shape and size of the box. Before \ggplot 3.5.0, \gggeom{geom\_label()} did not support rotation with the \code{angle} aesthetic. - -\begin{playground} -Modify the example above to use \gggeom{geom\_label()} instead of \gggeom{geom\_text()} using, in addition, the \code{fill} aesthetic. -\end{playground} - -A serif font is set by passing \code{family = "serif"}. The names \code{"sans"} (the default), \code{"serif"} and \code{"mono"} are recognized by all graphics devices on all operating systems. They do not necessarily correspond to identical fonts in different computers or for different graphic devices, but instead to fonts that are similar. Additional fonts are available for specific graphic devices, such as the 35 ``PDF'' fonts by the \code{pdf()} device. In this case, their names can be queried with \code{names(pdfFonts())}. - -<>= -ggplot(data = my.data, - mapping = aes(x, y, label = label)) + - geom_text(angle = 90, hjust = 1.5, size = 4, family = "serif") + - geom_point() -@ - -\begin{playground} -In the examples above the character strings were all of the same length, containing a single character. Redo the plots above with longer character strings of various lengths mapped to the \code{label} \emph{aesthetic}. Do also play with justification of these labels. -\end{playground} - -\begin{warningbox} -\Rlang\index{plots!fonts} and \ggplot support the use of UNICODE\index{UNICODE}, such as UTF8\index{UTF8} character encodings in strings. If your editor or IDE supports their use, then you can type Greek letters and simple maths symbols directly, and they \emph{may} show correctly in labels if a suitable font is loaded and an extended encoding like UTF8 is in use by the operating system. Even if UTF8 is in use, text is not fully portable unless the same font is available\index{portability}, as even if the character positions are standardized for many languages, most UNICODE fonts support at most a small number of languages. In principle one can use this mechanism to have labels both using other alphabets and languages like Chinese with their numerous symbols mixed in the same figure. Furthermore, the support for fonts and consequently character sets in \Rlang is output-device dependent. The font encoding used by \Rlang by default depends on the default locale settings of the operating system, which can also lead to garbage printed to the console or wrong characters being plotted running the same code on a different computer from the one where a script was created. Not all is lost, though, as \Rlang can be coerced to use system fonts and Google fonts with functions provided by packages \pkgname{showtext} and \pkgname{extrafont}. Encoding-related problems, especially in MS-Windows, are common. -\end{warningbox} - -Plotting (mathematical) expressions involves mapping to the \code{label} aesthetic character strings that can be parsed as expressions, and setting \code{parse = TRUE} (see section \ref{sec:plot:plotmath} on page \pageref{sec:plot:plotmath}). Below, the character strings are assembled using \Rfunction{paste()} but, of course, they could have been also typed in as constant values. This use of \Rfunction{paste()} is an example of recycling of shorter vectors, \code{"alpha["} and \code{"]"} to match the length of \code{1:5} (see section \ref{sec:vectors} on page \pageref{sec:vectors}). - -<>= -my.data <- - data.frame(x = 1:5, y = rep(2, 5), label = paste("alpha[", 1:5, "]", sep = "")) -my.data$label -@ - -Text and labels do not automatically expand the plotting area past their anchoring coordinates. In the example below, \code{expand\_limits(x = 5.2)} ensures that the text is not clipped at the edge of the plotting area. - -<>= -ggplot(data = my.data, - mapping = aes(x, y, label = label)) + - geom_text(hjust = -0.2, parse = TRUE, size = 6) + - geom_point() + - expand_limits(x = 5.2) -@ - -In the example above, the text to be parsed was mapped to the \code{label} aesthetic using character strings previously added to the data frame \code{my.data}. It is also possible, and usually preferable, to build suitable character strings with a nested call in, or code statement, passed as argument in the call to \code{aes()} (plot identical to the previous one, not shown). - -<>= -ggplot(data = my.data, - mapping = aes(x, y, label = paste("alpha[", x, "]", sep = ""))) + - geom_text(hjust = -0.2, parse = TRUE, size = 6) + - geom_point() -@ - -As \gggeom{geom\_label()} obeys the same aesthetics as \gggeom{geom\_text()} (except for \code{angle} in \ggplot versions before 3.5.0) and additionally \code{label.size} for the width of the border line, \code{label.r} for the roundness of the box corners, \code{label.padding} for the space between the text boundary and the box boundary, and \code{fill} for the color of fill of the box. - -\label{start:plot:label} -<>= -my.data <- - data.frame(x = 1:5, y = rep(2, 5), - label = c("one", "two", "three", "four", "five")) - -ggplot(data = my.data, - mapping = aes(x, y, label = label)) + - geom_label(hjust = -0.2, size = 6, - label.size = 0, - label.r = unit(0, "lines"), - label.padding = unit(0.15, "lines"), - fill = "yellow", alpha = 0.5) + - geom_point() + - expand_limits(x = 5.6) -@ - -\begin{playground} -Starting from the example above, play with the arguments to the different parameters and with the mappings to \emph{aesthetics} to get an idea of the variations in the design that they allow. For example, use thicker border lines and increase the padding so that a visually well-balanced margin is retained. You may also try mapping the \code{fill} and \code{color} \emph{aesthetics} to factors in the data. -\end{playground} - -If\index{grammar of graphics!text and label geometries!repulsive} the parameter \code{check\_overlap} of \gggeom{geom\_text()} is set to \code{TRUE}, text overlap will be avoided by suppressing the text that would otherwise overlap other text. \emph{Repulsive} versions of \gggeom{geom\_text()} and \gggeom{geom\_label()}, \gggeom{geom\_text\_repel()} and \gggeom{geom\_label\_repel()}, are available in package \pkgname{ggrepel}. These \emph{geometries} avoid overlaps by automatically repositioning the text or labels. Please read the package documentation for details of how to control the repulsion strength and direction, and the properties of the segments linking the labels to the position of their data coordinates. Nearly all aesthetics supported by \code{geom\_text()} and \code{geom\_label()} are supported by the repulsive versions. However, given that a segment connects the label or text to its anchor point, several properties of these segments can also be controlled with aesthetics or arguments. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, - color = factor(cyl), size = wt, label = cyl)) + - scale_size() + - geom_point(alpha = 1/3) + - geom_text_repel(color = "black", size = 3, - min.segment.length = 0.2, point.padding = 0.1) -@ -\index{plots!maths in|)} -\index{plots!text in|)} -\index{grammar of graphics!text and label geometries|)} - -\subsection{Plot insets}\label{sec:plot:insets} -\index{grammar of graphics!inset-related geometries|(} -\index{plots!insets|(} - -The support for insets in \pkgname{ggplot2} is confined to \code{annotation\_custom()} which was designed to be used for static annotations expected to be the same in each panel of a plot (the use of annotations is described in section \ref{sec:plot:annotations}). Package \pkgname{ggpp} provides geoms that mimic \code{geom\_text()} in relation to the \emph{aesthetics} used, but that similarly to \code{geom\_sf()}, expect that the column in \code{data} mapped to the \code{label} aesthetics are lists of objects containing multiple pieces of information, rather than atomic vectors. Three geometries are currently available: \gggeom{geom\_table()}, \gggeom{geom\_plot()} and \gggeom{geom\_grob()}. - -\begin{warningbox} -Given that \gggeom{geom\_table()}, \gggeom{geom\_plot()} and \gggeom{geom\_grob()} will rarely use a mapping inherited from the whole plot, by default they do not inherit it. Either the mapping should be supplied as an argument to these functions or their parameter \code{inherit.aes} explicitly set to \code{TRUE}. -\end{warningbox} - -\index{plots!inset tables|(} -Tables can be added as plot insets with \gggeom{geom\_table()} by mapping a list of data frames (or tibbles) to the \code{label} \emph{aesthetic}. Positioning, justification, and angle work as for \gggeom{geom\_text()} and are applied to the whole table. The table(s) are constructed as 'grid' \code{grob} objects and added to the \code{gg} plot object as one layer. - -The code below builds a \code{tibble} containing summaries from the \code{mtcars} data set, with the summary values formatted as character strings, adds this tibble as the single member to a list, and stores this list as column named \code{table.inset} in another \code{tibble}, named code{table.tb}, together with the \code{x} and \code{y} coordinates for its location as an inset. - -\begin{explainbox} -The code uses functions from the \pkgname{tidyverse} (see section \ref{sec:dplyr:group:wise} on page \pageref{sec:dplyr:group:wise}). Data frames and base \Rlang functions could have been used instead (see section \ref{sec:calc:df:aggregate} on page \pageref{sec:calc:df:aggregate}). -\end{explainbox} - -<>= -mtcars |> - group_by(cyl) |> - summarize("mean wt" = format(mean(wt), digits = 3), - "mean disp" = format(mean(disp), digits = 2), - "mean mpg" = format(mean(mpg), digits = 2)) -> my.table -table.tb <- tibble(x = 500, y = 35, table.inset = list(my.table)) -@ - -As with text labels, justification is interpreted in relation to table-text orientation, however, the default, \code{"inward"}, rarely needs to be changed if one sets $x$ and $y$ coordinates to the location of the inset corner farthest from the center of the plot. The inset table is added at its native size, given by the \code{size} aesthetic, which is applied to the text in it. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl), size = wt)) + - scale_size() + - geom_point() + - geom_table(data = table.tb, - mapping = aes(x = x, y = y, label = table.inset), - color = "black", size = 3) -@ - -Parsed text, using R's \emph{plotmath} syntax is supported in tables, with fallback to plain text in case of parsing errors, on a cell-by-cell basis. - -\begin{explainbox} -The \emph{geometry} \gggeom{geom\_table()} uses functions from package \pkgname{gridExtra} to build a graphical object for the table. A table theme can be passed as an argument to \gggeom{geom\_table()}. -\end{explainbox} -\index{plots!inset tables|)} - -\index{plots!inset plots|(} -Geometry \gggeom{geom\_plot()} works similarly to \code{geom\_table()} but insets a ggplot within another ggplot. Thus, instead of expecting a list of data frames or tibbles to be mapped to the \code{label} aesthetics, it expects a list of ggplots (objects of class \code{gg}). Inset plots can be very useful for zooming-in on parts of a main plot where observations are crowded and for displaying summaries based on the observations shown in the main plot. The inset plots are nested in viewports which constrain the dimensions of the inset plot. Aesthetics \code{vp.height} and \code{vp.width} set the size of the viewports---with defaults of 1/3 of the height and width of the plotting area of the main plot. Themes can be applied separately to the main and inset plots. - -In the first example of inset plots, the summaries shown above as a column in the inset table are inset as a column plot. create a tibble containing the plot to be inset. - -<>= -mtcars |> - group_by(cyl) |> - summarize(mean.mpg = mean(mpg)) |> - ggplot(data = _, - mapping = aes(factor(cyl), mean.mpg, fill = factor(cyl))) + - scale_fill_discrete(guide = "none") + - scale_y_continuous(name = NULL) + - geom_col() + - theme_bw(8) -> my.plot -plot.tb <- tibble(x = 500, y = 35, plot.inset = list(my.plot)) -@ - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl))) + - geom_point() + - geom_plot(data = plot.tb, - aes(x = x, y = y, label = plot.inset), - vp.width = 1/2, - hjust = "inward", vjust = "inward") -@ - -In the second example the plot inset is a zoom-in into a region of the base plot. The code to build this plot is split into three chunks. \code{p.main} is the plot to be used as the base for the final plot. - -<>= -p.main <- - ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl))) + - geom_point() -@ - -\code{p.inset}, is the plot to be used as the inset; the call to \code{coord\_cartesian()} zooms-into \code{p.main}; the call to \code{labs()} removes the redundant axis labels; the call to \code{scale\_color\_discrete()} removes the redundant guide in the inset; and the calls to \code{theme\_bw()} and \code{theme()} change the theme and font size for the inset. - -<>= -p.inset <- p.main + - coord_cartesian(xlim = c(270, 330), ylim = c(14, 19)) + - labs(x = NULL, y = NULL) + - scale_color_discrete(guide = "none") + - theme_bw(8) + theme(aspect.ratio = 1) -@ - -As in the previous example, \gggeom{geom\_plot()} adds the inset, in this case with constant values for aesthetics. The call to \code{annotate()} using \gggeom{geom\_rect()} adds the rectangle highlighting the zoomed-in region in the main plot.\label{par:plot:inset:zoom} - -<>= -p.main + - geom_plot(x = 480, y = 34, label = list(p.inset), vp.height = 1/2) + - annotate(geom = "rect", fill = NA, color = "black", - xmin = 270, xmax = 330, ymin = 14, ymax = 19, - linetype = "dotted") -@ -\index{plots!inset plots|)} -\index{plots!inset graphical objects|(} -Geometry \gggeom{geom\_grob()} differs very little from \code{geom\_plot()} but insets \pkgname{grid} graphical objects, called \code{grob} for short. This approach is very flexible, as grobs can be vector graphics as well as contain rasters (or bitmaps). In most cases, the grobs need to be first created either using functions from package \pkgname{grid} to draw them or by converting other types of objects into grobs. Geometry \gggeom{geom\_grob()} is as flexible as \gggeom{annotation\_custom()} with respect to the grobs, but behaves as a \emph{geometry}. Below, two bitmaps are added as ``labels'' to the base plot. - -The bitmaps are read from PNG files (contained as examples in package \pkgname{gppmisc}. - -<>= -file1.name <- - system.file("extdata", "Isoquercitin.png", - package = "ggpp", mustWork = TRUE) -Isoquercitin <- magick::image_read(file1.name) -file2.name <- - system.file("extdata", "Robinin.png", - package = "ggpp", mustWork = TRUE) -Robinin <- magick::image_read(file2.name) -@ - -The two bitmaps are converted into \code{grobs}, added as two separate members to a list, and the list added as a column to a \code{data.frame} named \code{grob.tb}. The coordinates for the position of each \code{grob} as well as the size of each viewport are also added to this \code{data.frame}. - -<>= -grob.tb <- - data.frame(x = c(0, 100), y = c(10, 20), height = 1/3, width = c(1/2), - grobs = I(list(grid::rasterGrob(image = Isoquercitin), - grid::rasterGrob(image = Robinin)))) -@ - -The two \code{grobs} are added as a single plot layer to an empty plot. Insets like these, can be added to any base plot. - -<>= -ggplot() + - geom_grob(data = grob.tb, - mapping = aes(x = x, y = y, label = grobs, - vp.height = height, vp.width = width), - hjust = "inward", vjust = "inward") -@ -\index{plots!inset graphical objects|)} - -\begin{explainbox} -Grid graphics\index{grid graphics coordinate systems} provide the low-level functions that \pkgname{ggplot2} uses under the hood. Package \pkgname{grid} supports different types of units for expressing the coordinates of positions. In the \pkgname{ggplot2} user interface \code{"native"} data coordinates are used with only a few exceptions. Package \pkgname{grid} supports the use of physical units like \code{"mm"} as well as relative units like "npc" \emph{normalized parent coordinates}. Positions expressed as npc are numbers in the range 0 to 1, relative to the dimensions of current \emph{viewport}, with origin at the lower left corner. - -Package \pkgname{ggplot2} interprets $x$ and $y$ coordinates in \code{"native"} data coordinates. A rather general solution is provided by package \pkgname{ggpp} through \emph{aesthetics} \code{npcx} and \code{npcy} and \emph{geometries} that support them. At the time of writing, \gggeom{geom\_text\_npc()}, \gggeom{geom\_label\_npc()}, \gggeom{geom\_table\_npc()}, \gggeom{geom\_plot\_npc()} and \gggeom{geom\_grob\_npc()}. These \emph{geometries} are useful for annotating plots and adding insets at positions relative to the plotting area that remain always consistent across different plots, or across panels when using facets with free axis limits. Being geometries they provide freedom in the elements added to different panels and their positions. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl))) + - geom_point() + - geom_label_npc(npcx = 0.5, npcy = 0.9, label = "a label", color = "black") -@ - -\end{explainbox} - -\index{grammar of graphics!inset-related geometries|)} -\index{plots!insets|)} -\index{grammar of graphics!geometries|)} - -\section{Statistics}\label{sec:plot:statistics} -\index{grammar of graphics!statistics|(} -All statistics, except \ggstat{stat\_identity()}, modify the \code{data} they receive before passing it to a geometry. Most statistics compute a specific summary from the data, but there are exceptions. More generally, they make it possible to integrate computations on the data into the plotting work flow. This saves effort but more importantly helps ensure that the data and summaries within a given plot are consistent. Table \ref{tab:plot:stats} list all the statistics used in the chapter. - -When a factor is mapped to an aesthetic, each level creates a group. For example, in the first plot example in section \ref{sec:plot:line} on page \pageref{sec:plot:line} the grouping resulted in separate lines. The grouping is not so obvious with other aesthetics, but it is not different. Most \emph{statistics} operate separately on the data for each of group, returning an independent summary for each group. Mapping a continuous variable to an aesthetics does not create groups. All aesthetics, including \code{x} and \code{y}, follow this pattern, thus a factor mapped to \code{x} also creates a group for each level of the factor. - -\begin{table} - \caption[Statistics]{\ggplot statistics described in section \ref{sec:plot:statistics}, packages where they are defined, their default geometry and the aesthetics they use as input for computations.}\vspace{1ex}\label{tab:plot:stats} - \centering - \begin{tabular}{llll} - \toprule - Statistic & Package & Geometry & Aesthetics \\ - \midrule - \code{stat\_function} & \pkgnameNI{ggplot2} & \code{geom\_function} & x \\ - \code{stat\_summary} & \pkgnameNI{ggplot2} & \code{geom\_pointrange} & x, y \\ - \code{stat\_smooth} & \pkgnameNI{ggplot2} & \code{geom\_smooth} & x, y, weight \\ - \code{stat\_poly\_line} & \pkgnameNI{ggpmisc} &\code{geom\_smooth} & x, y, weight \\ - \code{stat\_poly\_eq} & \pkgnameNI{ggpmisc} & \code{geom\_text} & x, y, weight \\ - \code{stat\_fit\_tb} & \pkgnameNI{ggpmisc} & \code{geom\_table} & x, y, weight \\ - \code{stat\_bin} & \pkgnameNI{ggplot2} & \code{geom\_bar} & x, y \\ - \code{geom\_histogram} & \pkgnameNI{ggplot2} & --- & x, y \\ - \code{stat\_bin2d} & \pkgnameNI{ggplot2} & \code{geom\_tile} & x, y \\ - \code{stat\_bin\_hex} & \pkgnameNI{ggplot2} & \code{geom\_hex} & x, y \\ - \code{stat\_density} & \pkgnameNI{ggplot2} & \code{geom\_area} & x, y \\ - \code{geom\_density} & \pkgnameNI{ggplot2} & --- & x, y \\ - \code{stat\_density\_2d} & \pkgnameNI{ggplot2} & \code{geom\_density\_2d} & x, y \\ - \code{stat\_boxplot} & \pkgnameNI{ggplot2} & \code{geom\_boxplot} & x, y \\ - \code{stat\_ydensity} & \pkgnameNI{ggplot2} & \code{geom\_violin} & x, y \\ - \code{geom\_violin} & \pkgnameNI{ggplot2} & --- & x, y \\ - \code{geom\_quasirandom} & \pkgnameNI{ggbeeswarm} & --- & x, y \\ - \code{stat\_ma\_line} & \pkgnameNI{ggpmisc} & \code{geom\_smooth} & x, y \\ - \code{stat\_ma\_eq} & \pkgnameNI{ggpmisc} & \code{geom\_text} & x, y \\ - \code{stat\_centroid} & \pkgnameNI{ggpmisc} & \code{geom\_point} & x, y \\ - \code{stat\_quant\_line} & \pkgnameNI{ggpmisc} & \code{geom\_smooth} & x, y \\ - \code{stat\_quant\_eq} & \pkgnameNI{ggpmisc} & \code{geom\_text} & x, y \\ - \code{stat\_identity} & \pkgnameNI{ggplot2} & \code{geom\_point} & --- \\ - \bottomrule - \end{tabular} -\end{table} - -\subsection{Functions}\label{sec:plot:function} -\index{grammar of graphics!function statistic|(} -\index{plots!plots of functions|(} -Statistics \ggstat{stat\_function()} is the simplest to use and understand, even if unusual. It generates $y$ values by applying an \Rlang function to a sequence of $x$ values. The range of the \code{numeric} variable mapped to \code{x} determines the range of $x$ values used. - -Any \Rlang function, user defined or not, can be used as long as it is vectorized, with the length of the returned vector equal to the length of the vector passed as argument to its first parameter. The argument passed to parameter \code{n} of \code{geom\_function()} determines the length of the generated vector of $x$ values. The data frame returned contains these are the $x$ values and as $y$ values the values returned by the function. - -The code to plot the Normal probability distribution function is very simple, relying on the defaults \code{n = 101} and \code{geom = "path"}. - -<>= -ggplot(data = data.frame(x = c(-3,3)), - mapping = aes(x = x)) + - stat_function(fun = dnorm) -@ - -Using a named list additional arguments can be passed to the function when called to generate the data (plot not shown). - -<>= -ggplot(data = data.frame(x = c(-3,4)), - mapping = aes(x = x)) + - stat_function(fun = dnorm, args = list(mean = 1, sd = .5)) -@ - -\begin{playground} -Edit the code above so as to plot in the same figure three curves, either for three different values for \code{mean} or for three different values for \code{sd}. -\end{playground} - -Named user-defined functions (not shown), and anonymous functions (below) can also be used. - -<>= -ggplot(data = data.frame(x = 0:1), - mapping = aes(x = x)) + - stat_function(fun = function(x, a, b){a + b * x^2}, - args = list(a = 1, b = 1.4)) -@ - -\begin{playground} -Edit the code above to use a different function, such as $e^{x + k}$, adjusting the argument(s) passed through \code{args} accordingly. Do this by means of an anonymous function, and by means of an equivalent named function defined by your code. -\end{playground} - -\index{plots!plots of functions|)} -\index{grammar of graphics!function statistic|)} - -\subsection{Summaries}\label{sec:plot:stat:summaries} -\index{grammar of graphics!summary statistic|(} -\index{plots!data summaries|(} -\index{plots!means}\index{plots!medians}\index{plots!error bars} -The summaries discussed in this section can be superimposed on raw data plots, or plotted on their own. Beware, that if scale limits are manually set, the summaries will be calculated from the subset of observations within these limits. Scale limits can be altered when explicitly defining a scale or by means of functions \Rfunction{xlim()} and \Rfunction{ylim()}. See section \ref{sec:plot:coord} on page \pageref{sec:plot:coord} for an explanation of how coordinate limits can be used to zoom into a plot without excluding of $x$ and $y$ values from the data. - -It is possible to summarize data on the fly when plotting. The simultaneous calculation of measures of central tendency and of variation in \ggstat{stat\_summary()} allows them to be added together to the same plot layer. - -Data frame \code{fake.data}, constructed below, contains normally distributed artificial values in variable \code{Y} in two groups, distinguished by the levels of factor \code{group}. - -<>= -fake.data <- data.frame( - y = c(rnorm(10, mean = 2, sd = 0.5), - rnorm(10, mean = 4, sd = 0.7)), - group = factor(c(rep("A", 10), rep("B", 10)))) -@ - -Different summaries can be computed by \ggstat{stat\_summary()} because the summary function used is passed as an argument. The function passed as an argument can be one returning a single value like \code{mean()} or one returning a central value and the extremes of a range. The \code{geom} used needs to be able to use the output of the function. - -Below, a base plot is assigned to \code{p1.base}. - -<>= -p1.base <- - ggplot(data = fake.data, mapping = aes(y = y, x = group)) + - geom_point(shape = "circle open") -@ - -Using defaults only, triggers a message but creates a useful plot with means and standard errors. - -<>= -p1.base + stat_summary() -@ - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -Below, is code for a similar plot, \code{p1}, with only means highlighted in red, using \gggeom{geom\_point()}. This is the same \code{geom} described in section \ref{sec:plot:geom:point} on page \pageref{sec:plot:geom:point}. - -<>= -p1 <- - p1.base + - stat_summary(fun = "mean", geom = "point", color = "red", shape = "-", size = 15) -@ - -A function that returns a central value like the mean plus confidence or other limits has to be passed to parameter \code{fun.data} instead of to \code{fun}. In \code{p2} below, confidence intervals for $p = 0.99$ computed assuming normality are added. - -<>= -p2 <- p1.base + - stat_summary(fun.data = "mean_cl_normal", fun.args = list(conf.int = 0.99), - color = "red", size = 0.7, linewidth = 1, alpha = 0.5) -@ - -<>= -p1 + p2 -@ - -The intervals can also be computed without assuming normality, using the empirical distribution estimated from the data by bootstrap using \code{"mean\_cl\_boot"} instead of \code{"mean\_cl\_normal"}. - -For $\bar{x} \pm \mathrm{s.e.}$, the default, \code{"mean\_se"} can be passed as argument to \code{fun.data} to avoid the message seen above, and for $\bar{x} \pm \mathrm{s.d.}$ \code{"mean\_sdl"} should be passed as argument. - -I do not give an example here, but it is possible to use user-defined functions instead of the functions exported by package \ggplot (based on those in package \Hmisc). As arguments to the summary function, except for the first one containing the variable in \code{data} mapped to the $y$ aesthetic, are supplied as a named list through parameter \code{fun.args}, the names used for parameters in the function definition need only match the names in this list. - -<>= -opts_chunk$set(opts_fig_wide) -@ - -Means, or other summaries, computed by groups based on the factor mapped to the \code{x} aesthetic (\code{class} in this example) can be plotted as columns by passing \code{"col"} as an argument for parameter \code{geom}. With this approach there is no need to compute the summaries in advance of plotting (plot shown farther down, with error bars added). - -<>= -p2.base <- - ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + - stat_summary(geom = "col", fun = mean) -@ - -Error bars can be added to the column plot. Passing \code{linewidth = 1} makes the lines of the error bars thicker. The default \emph{geometry} in \ggstat{stat\_summary()} is \gggeom{geom\_pointrange()}, passing \code{"linerange"} as an argument for \code{geom} removes the points at the top edge of the bars. - -<>= -p2.base + - stat_summary(geom = "linerange", fun.data = "mean_cl_normal", - linewidth = 1, color = "red") -@ - -Passing \code{"errorbar"} instead of \code{"linerange"} to \code{geom} results in traditional ``capped'' error bars. However, this type of error bar has been criticized as adding unnecessary clutter to plots \autocite{Tufte1983}. Aesthetic \code{width} controls the width of the caps at the ends of the tips bars. - -When calculated values for the summaries are already available in \code{data}, equivalent plots can be obtained by mapping the summary values from \code{data} to the \emph{aesthetics} \code{x}, \code{y}, \code{ymax} and \code{ymin} and using the \code{geoms} \gggeom{geom\_errorbar()} and \gggeom{geom\_linerange()} with their default for \code{stat}, \ggstat{stat\_identity()}, to add a plot layer. - -\begin{explainbox} -A layer can be added to a plot directly with a \code{geom}, possibly passing a \code{stat} as an argument to it. In this book I have usually avoided this alternative syntax, except when not overriding \ggstat{stat\_identity()}, the usual default. The two code statements below are equivalent. - -<>= -ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + - geom_col(stat = "summary", fun = mean) - -ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + - stat_summary(geom = "col", fun = mean) -@ -\end{explainbox} -\index{plots!data summaries|)} -\index{grammar of graphics!summary statistic|)} - -\subsection{Smoothers and models}\label{sec:plot:smoothers} -\index{plots!smooth curves|(} -\index{plots!fitted curves|(} -\index{plots!statistics!smooth} - -For describing or highlighting relationships between pairs of continuous variables, using a line, straight or curved, in a plot is very effective. Drawing lines that provide a meaningful and accurate description of the relationship, requires lines based on predictions from models fitted to the observations. Frequently fitted models make possible to assess the reliability of the estimation. See section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for a description of the model fitting procedures underlying the plotting described in the current section. - -The \code{stat} \ggstat{stat\_smooth()} fits a smooth curve to observations in the case when the scales for $x$ and $y$ are continuous---the corresponding \emph{geometry} \gggeom{geom\_smooth()} uses this \emph{statistic}, and differs only in how arguments are passed to formal parameters. In the first example, \ggstat{stat\_smooth()} with the default smoother, a spline is used. In \ggstat{stat\_smooth()} the type of smoother, or \code{method}, is automatically chosen based on the number of observations, and the choice informed by a message. In statistics the \code{formula} must be stated using the names of the $x$ and $y$ aesthetics, rather than the original names of the variables mapped, i.e., in this example not their name in the \code{mtcars} data frame. Splines are described in section \ref{sec:stat:splines} on page \pageref{sec:stat:splines}. When their small enough number makes it possible, observations are usually plotted as points together with the smoother. The observations can be plotted on top of the smoother or the smoother on top of the observations, as done here. - -<>= -p3 <- - ggplot(data = mtcars, mapping = aes(x = disp, y = mpg)) + - geom_point() -@ - -<>= -p3 + stat_smooth(method = "loess", formula = y ~ x) -@ - -A model different to the default one can be used. Below, a linear regression is fitted with \Rfunction{lm()}. Fitting of linear models is explained in section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}. - -<>= -p3 + stat_smooth(method = "lm", formula = y ~ x) -@ - -These data can be grouped, here by mapping \code{factor(cyl)} to the \code{color} \emph{aesthetic}. With three groups, three separate linear regressions are fitted, and displayed as three stright lines. Each one line is delimited by a confidence band for the ``true'' location of the curve. - -<>= -p3 + aes(color = factor(cyl)) + - stat_smooth(method = "lm", formula = y ~ x) -@ - -To obtain a single fitted smoother, in this case a single linear regression line across the three groups, the grouping was cancelled by mapping the \code{color} \emph{aesthetic} to a constant value in \ggstat{stat\_smooth()}, the layer function. This local argument overrides the default \code{color} mapping set in \code{ggplot()}. The use of \code{"black"} is arbitrary, any other color definition known to \Rlang could have been used instead if it. - -<>= -p3 + aes(color = factor(cyl)) + - stat_smooth(method = "lm", formula = y ~ x, color = "black") -@ - -A different linear model, a second degree polynomial in this example, is fitted below by passing a different argument to \code{formula} than in the example above for linear regression. - -<>= -p3 + aes(color = factor(cyl)) + - stat_smooth(method = "lm", formula = y ~ poly(x, 2), color = "grey20") -@ - -It is possible to use other types of models, including GAM and GLM, as smoothers. I give next two simple examples of the use of \code{nls()} to fit a model non-linear in its parameters (see section \ref{sec:stat:NLS} on page \pageref{sec:stat:NLS} for details about fitting this same model with \code{nls()}). In both examples the model fitted is the Michaelis-Menten equation, describing the rate of a chemical reaction (\code{rate}) as a function of reactant concentration (\code{conc}). \Rdata{Puromycin} is a data set included in the \Rlang distribution. Function \Rfunction{SSmicmen()}, used in the first example, is also from \Rlang, and is a \emph{self-starting}\index{self-starting functions} implementation of the Michaelis-Menten equation. Thanks to this, even though the fit is done with an iterative algorithm, starting values for the parameters to be fitted are not needed. Passing \code{se = FALSE} suppresses the attempt to compute a confidence band as it is not supported by the \code{predict()} method for model fits done with function \Rfunction{nls()}. - -<>= -ggplot(data = Puromycin, - mapping = aes(conc, rate, color = state)) + - geom_point() + - geom_smooth(method = "nls", formula = y ~ SSmicmen(x, Vm, K), se = FALSE) -@ - -In the second example the code describing the equation is passed as argument to \code{formula}, with starting values passed as a named list to \code{start}. The names used for the parameters to be estimated by fitting the model can be chosen at will, within the restrictions of the \Rlang language, but of course the names used in \code{formula} and \code{start} must match each other. As for other models, \code{x} and \code{y} are the names of the aesthetics to which the observations have been mapped. - -<>= -ggplot(data = Puromycin, - mapping = aes(conc, rate, color = state)) + - geom_point() + - geom_smooth(method = "nls", - method.args = list(formula = y ~ (Vmax * x) / (k + x), - start = list(Vmax = 200, k = 0.05)), - se = FALSE) -@ - -In some cases it is desirable to annotate plots with fitted model equations or fitted parameters. One way of achieving this is by fitting the model and then extracting the parameters to manually construct text strings to use for text or label annotations. However, package \pkgname{ggpmisc} makes it possible to automate such annotations in many cases. This package also provides \ggstat{stat\_poly\_line()} which is similar to \ggstat{stat\_smooth()} but with \code{method = "lm"} consistently as its default irrespective of the number of observations. - -<>= -my.formula <- y ~ x + I(x^2) -p3 + aes(color = factor(cyl)) + - stat_poly_line(formula = my.formula, color = "black") + - stat_poly_eq(formula = my.formula, mapping = use_label(c("eq", "F")), - color = "black", label.x = "right") -@ - -This same package makes it possible to annotate plots with summary tables from a model fit. The argument passed to \code{tb.vars} substitutes the names of the columns in the table. - -<>= -ggplot(data = mtcars, - mapping = aes(x = disp, y = mpg, color = factor(cyl))) + - stat_poly_line(formula = my.formula, color = "black") + - stat_fit_tb(method.args = list(formula = my.formula), - color = "black", - tb.vars = c(Parameter = "term", - Estimate = "estimate", - "s.e." = "std.error", - "italic(t)" = "statistic", - "italic(P)" = "p.value"), - label.y = "top", label.x = "right") + - geom_point() + - expand_limits(y = 40) -@ - -Package \pkgname{ggpmisc} provides additional \emph{statistics} for the annotation of plots based on fitted models supported by package \pkgname{broom} and its extensions. It also supports lines and equations for quantile regression and major axis regression. Please see the package documentation for details. - -\index{plots!smooth curves|)} -\index{plots!fitted curves|)} - -\subsection{Frequencies and counts}\label{sec:histogram}\label{sec:plot:histogram} -\index{plots!histograms|(} - -When the number of observations is rather small, it is possible rely on the density of graphical elements such as points, to convey the density of the observations. For example, scatter plots using well-chosen values for transparency, \code{alpha}, can give a satisfactory impression of the density. Rug plots, described in section \ref{sec:plot:rug} on page \pageref{sec:plot:rug}, can also satisfactorily convey the density of observations along $x$ and/or $y$ axes. Such approaches do not involve computations, while the \emph{statistics} described in this section do. Frequencies by value-range (or bins) and empirical density functions are summaries especially useful when the number of observations is large. These summaries can be computed in one or more dimensions. - -Histograms are defined by how the plotted values are calculated. Although histograms are most frequently plotted as bar plots, many bar or ``column'' plots are not histograms. Although rarely done in practice, a histogram could be plotted using a different \emph{geometry} using \ggstat{stat\_bin()}, the \emph{statistic} used by default by \gggeom{geom\_histogram()}. This \emph{statistic} does binning of observations before computing frequencies, and is suitable for observations on a continuous scales, usually mapped to the \code{x} aesthetic. When a factor is mapped to \code{x}, \ggstat{stat\_count()} can be used, the default \code{stat} of \gggeom{geom\_bar()}. These two \emph{geometries} are described in this section about statistics, because they default to using statistics different from \code{stat\_identity()} and consequently summarize the data. - -The code below constructs a data frame containing an artificial data set. - -<>= -set.seed(54321) -my.data <- - data.frame(X = rnorm(600), - Y = c(rnorm(300, -1, 1), rnorm(300, 1, 1)), - group = factor(rep(c("A", "B"), c(300, 300))) ) -@ - -A default and usually suitable number of bins is automatically selected by the \ggstat{stat\_bin()} statistic, however, \code{bins = 15} set it manually. In a histogram plot the variable mapped onto the \code{y} \emph{aesthetic} is not from \code{data} but instead computed in the statistics as the number of observations falling in each \emph{bin}. - -<>= -ggplot(data = my.data, mapping = aes(x = X)) + - geom_histogram(bins = 15) -@ - -\begin{explainbox} -A reason to add layers with \gggeom{geom\_histogram()}, instead of with \ggstat{stat\_bin()} or \ggstat{stat\_count()} is that its name is easier to remember. - -<>= -ggplot(data = my.data, - mapping = aes(x = Y, fill = group)) + - stat_bin(bins = 15, position = "dodge") -@ -\end{explainbox} - -The grouping created by mapping a factor to an additional \emph{aesthetic}, results in two separate histograms. The position of the two groups of bars with respect to each other affects what information is most easily visible in the plot. The position is controlled with \emph{position} functions (see section \ref{sec:plot:positions} on page \pageref{sec:plot:positions} for details). With \code{position = "dodge"} bars are plotted side by side, with \code{position = "stack"}, the default, plotted one above the other, and with \code{position = "identity"} overlapping. In this last case passing \code{alpha = 0.5} ensures that occluded bar can be seen. - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p.base <- - ggplot(data = my.data, - mapping = aes(x = Y, fill = group)) -@ - -<>= -p1 <- p.base + geom_histogram(bins = 15, position = "dodge") -@ - -In addition to \code{count}, \code{density}, computed as \code{count} divided by the total number of observations, is returned, and mapped below using \Rfunction{after\_stat()}. - -<>= -p2 <- p.base + geom_histogram(mapping = aes(y = after_stat(density)), - bins = 15, position = "dodge") -@ - -<>= -p1 + p2 -@ - -\emph{Statistic} \ggstat{stat\_bin2d()}, and its matching \emph{geometry} \gggeom{geom\_bin2d()}, by default compute a frequency histogram in two dimensions, along the \code{x} and \code{y} \emph{aesthetics}. The \code{count} for each 2D bin is mapped to the \code{fill} aesthetic, with a lighter-colored value being equivalent to a taller bar in a 1D histogram. - -<>= -opts_chunk$set(opts_fig_wide) -@ - -<>= -p.base <- - ggplot(data = my.data, - mapping = aes(x = X, y = Y)) + - facet_wrap(facets = vars(group)) -@ - -<>= -p.base + stat_bin2d(bins = 8) -@ - -\emph{Statistic} \ggstat{stat\_bin\_hex()}, and its matching \emph{geometry} \gggeom{geom\_hex()}, differ from \ggstat{stat\_bin2d()} only in their use of hexagonal instead of square bins, and tiles. - -<>= -p.base + stat_bin_hex(bins = 8) -@ - -As \ggstat{stat\_bin()}, \ggstat{stat\_bin2d()} and \ggstat{stat\_bin\_hex()} compute \code{density} in addition to \code{counts} and they can be plotted by mapping them to the \code{fill} aesthetic. -\index{plots!histograms|)} - -\subsection{Density functions}\label{sec:plot:density} -\index{plots!density plot!1 dimension|(} -\index{plots!statistics!density} -Empirical density functions are the equivalent of a histogram, but are continuous and not calculated using bins, but fitted. They can be estimated in 1 or 2 dimensions (1D or 2D). As with histograms it is possible to use different \emph{geometries} with them. Examples of \gggeom{geom\_density()} used to create 1D density plots follow. A semitransparent fill is used in addition to color for \code{Y} (plot shown below side-by-side). - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p3 <- - ggplot(data = my.data, - mapping = aes(x = Y, color = group, fill = group)) + - geom_density(alpha = 0.3) -@ - -Changing the mapping from \code{x = Y} to \code{x = X} creates a density plot for \code{X}. - -<>= -p4 <- - ggplot(data = my.data, - mapping = aes(x = X, color = group, fill = group)) + - geom_density(alpha = 0.3) -@ - -<>= -p3 + p4 -@ -\index{plots!density plot!1 dimension|)} - -\index{plots!density plot!2 dimensions|(} -\index{plots!statistics!density 2d} - -A 2D density plot using the same data as for the 1D plots above. In the first example \ggstat{stat\_density\_2d()} creates two 2D density ``maps'' shown using isolines, with \code{group} mapped to the \code{color} \emph{aesthetic}. Isolines can be used when the empirical distributions overlap. - -<>= -opts_chunk$set(opts_fig_narrow_square) -@ - -<>= -ggplot(data = my.data, - mapping = aes(x = X, y = Y, color = group)) + - stat_density_2d() -@ - -Below, the 2D density for each group is plotted in a separate panel, with \code{level}, a variable computed by \code{stat\_density\_2d()}, mapped to the \code{fill} \emph{aesthetic}. - -<>= -opts_chunk$set(opts_fig_wide) -@ - -<>= -ggplot(data = my.data, - mapping = aes(x = X, y = Y)) + - stat_density_2d(aes(fill = after_stat(level)), geom = "polygon") + - facet_wrap(facets = vars(group)) -@ - -<>= -opts_chunk$set(opts_fig_narrow) -@ -\index{plots!density plot!2 dimensions|)} - -\subsection{Box and whiskers plots}\label{sec:boxplot} -\index{box plots|see{plots, box and whiskers plot}} -\index{plots!box and whiskers plot|(} - -Box and whiskers plots, or just box plots, are summaries that convey some of the properties of a distribution. They are calculated and plotted with \ggstat{stat\_boxplot()} or the matching \gggeom{geom\_boxplot()}. Although box plots can be plotted based on just a few observations, they are not useful unless each box plot is based on more than 10 to 15 observations. In the next example a sample of every sixth row from the data frame \code{my.data} with \Sexpr{nrow(my.data)} rows is used. - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p.base <- - ggplot(data = my.data[c(TRUE, rep(FALSE, 5)) , ], - mapping = aes(x = group, y = Y)) -@ - -<>= -p1 <- p.base + stat_boxplot() -@ - -As with other \emph{statistics}, their appearance obeys both \emph{aesthetics} such as \code{color}, and parameters specific to box plots: \code{outlier.color}, \code{outlier.fill}, \code{outlier.shape}, \code{outlier.size}, \code{outlier.stroke} and \code{outlier.alpha}, which affect outliers similarly to equivalent \code{aesthetics}. The shape and width of the ``box'' can be adjusted with \code{notch}, \code{notchwidth} and \code{varwidth}. Notches in box plots play a similar role as confidence limits play for means. - -<>= -p2 <- - p.base + - stat_boxplot(notch = TRUE, width = 0.4, - outlier.color = "red", outlier.shape = "*", outlier.size = 5) -@ - -<>= -p1 + p2 -@ - -\index{plots!box and whiskers plot|)} - -\subsection{Violin plots}\label{sec:plot:violin} -\index{plots!violin plot|(} - -Violin plots are a more recent development than box plots, and usable with relatively large numbers of observations. They could be thought of as being a sort of hybrid between an empirical density function (see section \ref{sec:plot:density} on page \pageref{sec:plot:density}) and a box plot (see section \ref{sec:boxplot} on page \pageref{sec:boxplot}). As is the case with box plots, they are particularly useful when comparing distributions of related data, side by side. They can be created with \gggeom{geom\_violin()} as shown in the examples below. - -<>= -p3 <- p.base + - geom_violin(aes(fill = group), alpha = 0.16) + - geom_point(alpha = 0.33, size = 1.5, color = "black", shape = 21) -@ - -As with other \emph{geometries}, their appearance obeys both the usual \emph{aesthetics} such as color, and others specific to these types of visual representation. - -Other types of displays related to violin plots are \emph{beeswarm} plots and \emph{sina} plots, and can be produced with \emph{geometries} defined in packages \pkgname{ggbeeswarm} and \pkgname{ggforce}, respectively. A minimal example of a beeswarm plot is shown below. See the documentation of the packages for details about the many options in their use. - -<>= -p4 <- p.base + geom_quasirandom() -@ - -<>= -p3 + p4 -@ - -\index{plots!violin plot|)} -\index{grammar of graphics!statistics|)} - -<>= -opts_chunk$set(opts_fig_medium) -@ - -\section{Flipped plot layers}\label{sec:plot:flipped} -\index{grammar of graphics!flipped axes(} -\index{grammar of graphics!swap axes} -\index{grammar of graphics!orientation} -\index{grammar of graphics!horizontal geometries} -\index{grammar of graphics!horizontal statistics} - -Although it is the norm to design plots so that the independent variable is on the $x$ axis, i.e., mapped to the \code{x} aesthetic, there are situations where swapping the roles of $x$ and $y$ is useful. In `ggplot2' this is described as \emph{flipping the orientation} of a plot or of a plot layer. In the present section I exemplify both cases where the flipping is automatic and where flipping requires user intervention. Some geometries like \gggeom{geom\_point()} are symmetric on the \textit{x} and \textit{y} aesthetics, but others like \gggeom{geom\_line()} operate differently on \textit{x} and \textit{y}. This is also the case for most \emph{statistics}. - -Starting from \ggplot version 3.3.5, most geometries and statistics where it is meaningful, support flipping using a new syntax. This new approach is different to the flip of the coordinate system (which is expected to be deprecated in the future), and conceptually similar to that implemented by package \pkgname{ggstance}. However, instead of defining new horizontal layer functions as in \pkgname{ggstance}, in \ggplot the orientation of many layer functions can change. This has made package \pkgname{ggstance} nearly redundant and the coding of flipped plots easier and more intuitive. Although \ggplot has offered \ggcoordinate{coord\_flip()} for a long time, flipping of plot coordinates affects the whole plot rather than individual layers. - -When a factor is mapped to $x$ or $y$ flipping is automatic. A factor creates groups and summaries are computed per group, i.e., per level of the factor irrespective of the factor being mapped to the $x$ or $y$ aesthetic. There are also cases that require user intervention. For example, flipping must be requested manually if both $x$ and $y$ are mapped to continuous variables. This is, for example, the case with \ggstat{stat\_smooth()} and with \gggeom{geom\_line()}. - -\begin{figure} - \centering% - {\sffamily% -\resizebox{0.8\linewidth}{!}{% -\begin{tikzpicture}[auto] - \node [b] (data) {data}; - \node [bo, right = of data] (statistic) {statistic}; - \node [b, right = of statistic] (geometry) {geometry}; - \node [b, right = of geometry] (render) {rendered\\plot}; - - \path [ll] (statistic) -- (data) node[near end,above]{\ \ \ \ \ \ $x \rightleftarrows y$}; - \path [ll] (geometry) -- (statistic) node[near end,above]{\ \ \ \ \ \ $y \rightleftarrows x$}; - \path [ll] (render) -- (geometry) node[near end,above]{}; - \end{tikzpicture}}\\[2.5ex] - \resizebox{0.8\linewidth}{!}{% -\begin{tikzpicture}[auto] - \node [b] (data) {data}; - \node [b, right = of data] (statistic) {statistic}; - \node [bo, right = of statistic] (geometry) {geometry}; - \node [b, right = of geometry] (render) {rendered\\plot}; - - \path [ll] (statistic) -- (data) node[near end,above]{}; - \path [ll] (geometry) -- (statistic) node[near end,above]{\ \ \ \ \ \ $x \rightleftarrows y$}; - \path [ll] (render) -- (geometry) node[near end,above]{\ \ \ \ \ \ $y \rightleftarrows x$}; - \end{tikzpicture}}} - \caption[Flipped layers diagram]{Flipped layers. Top diagram, flipped aesthetics in statistic with \code{orientation = "y"}; bottom diagram, flipped aesthetics in geometry with \code{orientation = "y"}. During flipping, related aesthetics such as \code{xmin} and \code{ymin} are also swapped, but not shown in the diagram. }\label{fig:plot:flip:stat} -\end{figure} - -In statistics, passing {orientation = "y"} as argument results in calculations applied after swapping the mappings of the \code{x} and \code{y} aesthetics. After applying the calculations the mappings of the $x$ and $y$ and related aesthetics are swapped back (Figure \ref{fig:plot:flip:stat}). - -In geometries, passing {orientation = "y"} also results in flipping of the aesthetics (Figure \ref{fig:plot:flip:stat}). For example, in \gggeom{geom\_line()}, flipping changes the drawing of the lines. Normally observations are sorted along the $x$ axis before drawing the line segments connecting them. After flipping, as $x$ and $y$ are swapped, observations are sorted along the $y$ axis before drawing the connecting segments. The variables shown on each axis remain the same, as does the position of points drawn with \gggeom{geom\_point()}, but the line connecting them is different: in the example below only two segments are the same in the flipped plot and in the ``normal'' one. - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p.base <- - ggplot(data = mtcars[1:8, ], - mapping = aes(x = hp, y = mpg)) + - geom_point() -p1 <- p.base + geom_line() + ggtitle("Not flipped") -p2 <- p.base + geom_line(orientation = "y") + ggtitle("Flipped") -p1 + p2 -@ - -The next pair of examples demonstrates automatic flipping using \ggstat{stat\_boxplot()}. Factor \code{Species} is mapped first to $x$ and then to $y$. In both cases the same boxplots were computed and plotted for each level of the factor. Statistics \ggstat{stat\_boxplot()}, \ggstat{stat\_summary()}, \ggstat{stat\_histogram()} and \ggstat{stat\_density()} behave similarly with respect to flipping. - -<>= -p3 <- - ggplot(data = iris, - mapping = aes(x = Species, y = Sepal.Length)) + - stat_boxplot() -@ - -<>= -p4 <- - ggplot(data = iris, - mapping = aes(x = Sepal.Length, y = Species)) + - stat_boxplot() -@ - -<>= -p3 + p4 -@ - -In the case of \code{stats} that do computations on a single variable mapped to \code{x} or \code{y} aesthetics, flipping is also automatic. - -<>= -p5 <- - ggplot(data = iris, - mapping = aes(x = Sepal.Length, color = Species)) + - stat_density(geom = "line", position = "identity") -@ - -<>= -p6 <- - ggplot(data = iris, - mapping = aes(y = Sepal.Length, color = Species)) + - stat_density(geom = "line", position = "identity") -@ - -<>= -p5 + p6 -@ - -<>= -opts_chunk$set(opts_fig_narrow) -@ - -\begin{explainbox} -In the case of ordinary least squares (OLS), regressions of $y$ on $x$ and of $x$ on $y$ in most cases yield different fitted lines, even if $R^2$ is consistent. This is due to the assumption that $x$ values are known, either set or measured without error, i.e., not subject to uncertainty. Under this assumption, all unexplained variation in the data is attributed to $y$. See Chapter \ref{chap:R:case:fitted:models} on page \pageref{chap:R:case:fitted:models} or consult a Statistics book such as \citetitle{Holmes2019} \autocite[][pp.\ 168--170]{Holmes2019} for additional information. -\end{explainbox} - -With two continuous variables mapped, the default is to take $x$ as independent and $y$ as dependent. Passing \code{"x"} (the default) or \code{"y"} as argument to parameter \code{orientation} indicates which of $x$ or $y$ is the independent or explanatory variable. - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p.base <- - ggplot(data = iris, - mapping = aes(x = Sepal.Length, y = Petal.Length)) + - geom_point() + - facet_wrap(~Species, scales = "free") -@ - -<>= -p.base + stat_smooth(method = "lm", formula = y ~ x) -@ - -Passing \code{orientation = "y"} to \gggeom{geom\_smooth()} is equivalent to swapping $x$ and $y$ in the model \code{formula}. The looser the correlation, the more different are the lines fitted before and after flipping. - -<>= -p.base + stat_smooth(method = "lm", formula = y ~ x, orientation = "y") -@ - -The two variables in the example above, are both response variables, not directly connected by cause and effect, and with measurements subject to similar errors. None, of the two fitted models are close enough to fulfilling the assumptions. - -\begin{explainbox} -Flipping the orientation of plot layers with \code{orientation = "y"} is not equivalent to flipping the whole plot with \ggcoordinate{coord\_flip()}. In the first case which axis is considered independent for computation changes but not the positions of the axes in the plot, while in the second case the position of the $x$ and $y$ axes in the plot is swapped. So, when coordinates are flipped the $x$ aesthetic is plotted on the vertical axis and the $y$ aesthetic on the horizontal axis, but the role of the variable mapped to the \code{x} aesthetic remains as explanatory variable. (Use of \ggcoordinate{coord\_flip()} will likely be deprecated in the future.) - -<>= -p.base + - stat_smooth(method = "lm", formula = y ~ x) + - coord_flip() -@ -\end{explainbox} - -In package \ggpmisc (version $\geq$ 0.4.1) statistics related to model fitting have an \code{orientation} parameter as those from package \ggplot do, but in addition they accept formulas where $x$ is on the lhs and $y$ on the rhs, such as \code{formula = x \~{} y} providing a syntax consistent with \Rlang's model fitting functions. With two calls to \ggstat{stat\_poly\_line()}, the first using the default \code{formula = y \~{} x}, and the second using \code{formula = x \~{} y} to force the flipping of the fitted model, the plot contains two fitted lines per panel, with the flipped ones highlighted as red lines and yellow bands. - -<>= -p.base + - stat_poly_line() + - stat_poly_line(formula = x ~ y, color = "red", fill = "yellow") -@ - -In\index{plots!major axis regression}\label{par:ma:example} the case of the \code{iris} data used for these examples, both approaches used above to linear regression are wrong. In this case the correct approach is to not assume that there is a variable that can be considered independent and another dependent on it, and instead use a method like major axis (MA) regression, as below. - -<>= -p.base + stat_ma_line() -@ - -%A related problem is when we need to summarize in the same plot layer $x$ and $y$ values. A simple example is adding a point with coordinates given by the means along the $x$ and $y$ axes as we need to pass these computed means simultaneously to \gggeom{geom\_point()}. Package \ggplot provides \ggstat{stat\_density\_2d()} and \ggstat{stat\_summary\_2d()}. However, \ggstat{stat\_summary\_2d()} uses bins, and is similar to \ggstat{stat\_density\_2d()} in how the computed values are returned. Package \pkgname{ggpmisc} provides two dimensional equivalents of \ggstat{stat\_summary()}: \ggstat{stat\_centroid()}, which applies the same summary function along $x$ and $y$, and \ggstat{stat\_summary\_xy()}, which accepts one function for $x$ and one for $y$. -% -%<>= -%ggplot(data = iris, -% mapping = aes(x = Sepal.Length, y = Petal.Length)) + -% geom_point() + -% stat_centroid(color = "red") + -% facet_wrap(~Species, scales = "free") -%@ -% -%<>= -%ggplot(data = iris, -% mapping = aes(x = Sepal.Length, y = Petal.Length)) + -% geom_point() + -% stat_centroid(geom = "rug", sides = "trbl", -% color = "red", linewidth = 1.5) + -% facet_wrap(~Species, scales = "free") -%@ -% -%\begin{playground} -%Which of the plots in the last two chunks above can be created by adding two layers with \ggstat{stat\_summary()}? Recreate this plot using \ggstat{stat\_summary()}. -%\end{playground} -% -\index{grammar of graphics!flipped axes)} - -<>= -opts_chunk$set(opts_fig_narrow) -@ - -\section{Facets}\label{sec:plot:facets} -\index{grammar of graphics!facets|(} -\index{plots!trellis-like}\index{plots!coordinated panels} -Facets are used in a special kind of plots containing multiple panels in which the panels share some properties. These sets of coordinated panels are a useful tool for visualizing complex data. These plots became popular through the \code{trellis} graphs in \langname{S}, and the \pkgname{lattice} package in \Rlang. The basic idea is to have rows and/or columns of plots with common scales, all plots showing values for the same response variable. This is useful when there are multiple classification factors in a data set. Similar-looking plots, but with free scales or with the same scale but a `floating' intercept, are sometimes also useful. In \ggplot there are two possible types of facets: facets organized in a grid, and facets along a single `axis' of variation but, possibly, wrapped into two or more rows. These are produced by adding \Rfunction{facet\_grid()} or \Rfunction{facet\_wrap()}, respectively. Below, \gggeom{geom\_point()} is used in the examples, but faceting can be used with plots containing layers created with any \code{geom} or \code{stat}. - -<>= -opts_chunk$set(opts_fig_wide) -@ - -A single-panel plot, saved as \code{p.base}, will be used through this section to demonstrate how the same plot changes when facets are added. - -<>= -p.base <- - ggplot(data = mtcars, - mapping = aes(x = wt, y = mpg)) + - geom_point() -p.base -@ - -A grid of panels has two dimensions, \code{rows} and \code{cols}. These dimensions in the grid of plot panels can be ``mapped'' to factors. Until recently a formula syntax was the only available one. Although this notation has been retained, the preferred syntax is currently to use the parameters \code{rows} and \code{cols}. The argument passed to \code{cols} in this example is factor \code{cyl} retrieved from \code{data} with a call to \code{vars()}. The ``headings'' of the panels or \emph are by default the names or labels of the levels of the factor. - -<>= -p.base + facet_grid(cols = vars(cyl)) -@ - -Using \Rfunction{facet\_wrap()} the same plot can be coded as follows. - -<>= -p.base + facet_wrap(facets = vars(cyl), nrow = 1) -@ - -By default, all panels share the same scale limits and share the plotting space evenly, but these defaults can be overridden. - -<>= -p.base + facet_wrap(facets = vars(cyl), nrow = 1, scales = "free_y") -@ - -<>= -p.base + facet_grid(cols = vars(cyl), scales = "free_y", space = "free_y") -@ - -Margins, added with \code{margins = TRUE}, display an additional column or row of panels with the combined data. - -<>= -p.base + facet_grid(cols = vars(cyl), margins = TRUE) -@ - -<>= -opts_chunk$set(opts_fig_narrow_square) -@ - -To obtain a 2D grid both \code{rows} and \code{cols} have to be passed factors as arguments. - -<>= -p.base + facet_grid(rows = vars(vs), cols = vars(am), labeller = label_both) -@ - -<>= -opts_chunk$set(opts_fig_wide) -@ - -Each faceting dimension can be mapped to more than one factor, as below. As the levels are not self explanatory, \code{label\_both} is passed as argument to \code{labeller} so that factor names are included in the \emph{strip labels} together with the levels.\qRfunction{label\_both()} - -<>= -p.base + facet_grid(cols = vars(vs, am), labeller = label_both) -@ - -When facetting generates many panels, wrapping them into several rows helps keep the shape of the whole plot manageable. In this example the number of levels is small, and no wrapping takes place by default. In cases when more panels are present, wrapping into two or more continuation rows is the default. Here, we force wrapping with \code{nrow = 2}. When using \Rfunction{facet\_wrap()} there is only one dimension, and the parameter is called \code{facets}, instead of \code{rows} or \code{cols}. - -<>= -opts_chunk$set(opts_fig_narrow_square) -@ - -<>= -p.base + facet_wrap(facets = vars(cyl), nrow = 2) -@ - -<>= -opts_chunk$set(opts_fig_wide) -@ - -\begin{explainbox} -By default panel headings are the names of the levels of the factor they are based on. Changing these names is one way of changing the labels. This approach can be used to add mathematical expressions or Greek letters in the panel headings. Below, first factor labels in the data frame passed as argument to \code{data} are set to strings that can be parsed into plotmath expressions. Then, in the call to \Rfunction{facet\_grid()}, or to \Rfunction{facet\_wrap()}, we pass as argument to \code{labeller} a function definition, \code{label\_parsed}.\qRfunction{label\_parsed()} - -<>= -mtcars$cyl12 <- factor(mtcars$cyl, - labels = c("alpha", "beta", "sqrt(x, y)")) -ggplot(data = mtcars, - mapping = aes(mpg, wt)) + - geom_point() + - facet_grid(cols = vars(cyl12), labeller = label_parsed) -@ - -The labels of the levels of the factor used in faceting can also be combined with a character string template. Passing as argument to \code{labeller} function \Rfunction{label\_bquote()} and using a plotmath expression as its argument makes this possible. See section \ref{sec:plot:plotmath} for an example of the use of \code{bquote()}, the \Rlang function on which \Rfunction{label\_bquote()}, is built. - -<>= -p.base + - facet_grid(cols = vars(cyl), - labeller = label_bquote(cols = .(cyl)~"cylinders")) -@ -\end{explainbox} - -\index{grammar of graphics!facets|)} - -\section{Positions}\label{sec:plot:positions} - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -Position functions are passed as argument to the \code{position} parameter of \code{geoms}. They displace the positions (the values mapped to \code{x} and/or \code{y} aesthetics) away from their original position. Different position functions differ in what displacement is applied. Table \ref{tab:plots:position} lists most of the position functions available. Function \ggposition{position\_stack()} and \ggposition{position\_fill()} were already described on page \pageref{par:plot:pos:stack}, with stacked column- and area plots. Function \ggposition{position\_dodge()} was used in plots with side-by-side columns on page \pageref{par:plot:pos:dodge} and \ggposition{position\_jitter()} was used in dot plot examples on page \pageref{par:plot:pos:jitter}. - -\begin{table} - \caption[Positions]{Position functions from packages \ggplot and \ggpp. The table is divided into two sections. A. Positions that only return the modified $x$ and $y$ values. B. Identical positions that additionally return a copy of the unmodified $x$ and $y$ values. The last column describes the type of displacement: fixed uses constant values supplied in the call; random, uses random values for the displacement, within a maximum distance set by the user.}\vspace{1ex}\label{tab:plots:position} - \centering - \noindent - \begin{tabular}{@{}llp{6.25cm}l@{}} - \toprule - Position & Package & Parameters & Displ. \\ - \midrule - A. \textit{Origin not kept} & & & \\ \addlinespace - \code{position\_identity} & \ggplot & --- & none \\ - \code{position\_stack} & \ggplot & vjust, reverse & fixed \\ - \code{position\_fill} & \ggplot & vjust, reverse & fixed \\ - \code{position\_dodge} & \ggplot & width, preserve, padding, reverse & fixed \\ - \code{position\_dodge2} & \ggplot & width, preserve, padding, reverse & fixed \\ - \code{position\_jitter} & \ggplot & width, height, seed & rand. \\ - \code{position\_nudge} & \ggplot & x, y & fixed \\ - \midrule - B. \textit{Origin kept} & & & \\ \addlinespace - \code{position\_stack\_keep} & \ggpp & vjust, reverse & fixed \\ - \code{position\_fill\_keep} & \ggpp & vjust, reverse & fixed \\ - \code{position\_dodge\_keep} & \ggpp & width, preserve, padding, reverse & fixed \\ - \code{position\_dodge2\_keep} & \ggpp & width, preserve, padding, reverse & fixed \\ - \code{position\_jitter\_keep} & \ggpp & width, height, seed & rand. \\ - \code{position\_nudge\_keep} & \ggpp & x, y & fixed \\ -% \midrule -% C. \textit{Computed and kept} & & & \\ \addlinespace -% \code{position\_nudge\_to} & \ggpp & \raggedright x, y, x.action, y.action, kept.origin & comp. \\ -% \code{position\_nudge\_line} & \ggpp & \raggedright x, y, xy\_relative, abline, method, formula, direction, line\_nudge, kept.origin & comp. \\ -% \code{position\_nudge\_center} & \ggpp & \raggedright x, y, center\_x, center\_y, direction, obey\_grouping, kept.origin & comp. \\ -% \midrule -% D. \textit{Combined and kept} & & & \\ \addlinespace -% \code{position\_stacknudge} & \ggpp & \raggedright vjust, reverse, x, y, direction, kept.origin & fixed \\ -% \code{position\_fillnudge} & \ggpp & \raggedright vjust, reverse, x, y, direction, kept.origin & fixed \\ -% \code{position\_dodgenudge} & \ggpp & \raggedright width, preserve, x, y, direction, kept.origin & fixed \\ -% \code{position\_dodge2nudge} & \ggpp & \raggedright width, preserve, x, y, direction, kept.origin & fixed \\ -% \code{position\_jitternudge} & \ggpp & \raggedright width, height, seed, x, y, direction, nudge.from, kept.origin & mix. \\ - \bottomrule - \end{tabular} -\end{table} - -The difference between \ggposition{position\_stack()} and \ggposition{position\_fill()} is illustrated by the example below. - -<>= -p.base <- - ggplot(data = Orange, - mapping = aes(x = age, y = circumference, fill = Tree)) -@ - -<>= -p1 <- p.base + geom_area(position = "stack", color = "white", linewidth = 1) + - ggtitle("stack") -p2 <- p.base + geom_area(position = "fill", color = "white", linewidth = 1) + - ggtitle("fill") -@ - -<>= -p1 + p2 -@ - -Position \ggposition{position\_nudge()} is used to consistently displace positions, and is most frequently used with \gggeom{geom\_text()} and \gggeom{geom\_label()} when adding data labels. When position functions are used to add data labels, it is common to add a segment linking the data point to the label. For this to be possible, position functions have to keep the original position. Position functions from package \ggplot discard them while the position functions from packages \ggpp and \ggrepel keep them in data under a different name. Table \ref{tab:plots:position} is divided into sections. The only difference between the position functions in the two sections of the table is in whether the original position is kept or not, i.e., those from package \ggpp are backwards compatible with those from package \ggplot. - -The displacement introduced by jitter and nudge differ in that jitter is random, and nudge deterministic. In each case the displacement can be separately adjusted vertically and horizontally. Jitter, as shown above is useful when we desire to make visible overlapping points. Nudge is most frequently used with data labels to avoid occluding points or other graphical features. - -Layer function \gggeom{geom\_point\_s()} from package \pkgname{ggpp} is used below to make the displacement visible by drawing an arrow connecting original and displaced positions for each observation. We need to use the \code{\_keep} flavor of the position functions for arrows to be drawn. - -<>= -p.base <- - ggplot(data = mtcars, - mapping = aes(x = factor(cyl), y = mpg)) + - geom_point(colour = "blue") -p3 <- p.base + - geom_point_s(position = position_jitter_keep(width = 0.35, heigh = 0.6), - colour = "red") + - ggtitle("jitter") -@ - -The amount of nudging is set by a distance expressed in data units through parameters \code{x} and \code{y}. (Factors have mode \code{numeric} and each level is represented by an integer, thus distance between levels of a factor is 1.) - -<>= -p4 <- p.base + - geom_point_s(position = position_nudge_keep(x = 0.25, y = 1), - colour = "red") + - ggtitle("nudge") -@ - -<>= -p3 + p4 -@ - -<>= -opts_chunk$set(opts_fig_narrow) -@ - -\section{Scales}\label{sec:plot:scales} -\index{grammar of graphics!scales|(} - -In earlier sections of this chapter, most examples have used the default \emph{scales}. In this section I describe in more detail the use of \emph{scales}. There are \emph{scales} available for all the different \emph{aesthetics} recognized by \code{geoms}, such as position aesthetics (\code{x, y, z}), \code{size}, \code{shape}, \code{linewidth}, \code{linetype}, \code{color}, \code{fill}, \code{alpha} or transparency, \code{angle}. Scales determine how values in \code{data} are mapped to values of an \emph{aesthetics}, and optionally, also how these values are labeled. - -Depending on the characteristics of the variables in \code{data} being mapped, \emph{scales} can be continuous or discrete, for \code{numeric} or \code{factor} variables in \code{data}, respectively. Some \emph{aesthetics}, like \code{size} and \code{color}, are inherently continuous but others like \code{linetype} and \code{shape} are inherently discrete. In the case of inherently continuous aesthetics, both discrete and continuous scales are available, while, obviously for those inherently discrete only discrete scales are available. - -The scales used by default have default mappings of data values to aesthetic values (e.g., which color value corresponds to $cyl = 4$ and which one to $cyl = 8$). For each \emph{aesthetic} such as \code{color}, there are multiple scales to choose from when creating a plot, both continuous and discrete (e.g., 20 different color scales in \ggplot 3.4.3). In addition, some scales implement multiple palettes. - -\begin{warningbox} -As seen in previous sections, \emph{aesthetics} in a plot layer, in addition to being determined by mappings, can also be set to constant values. Aesthetics set to constant values, are not mapped to data, and are consequently independent of scales. -\end{warningbox} - -The most direct mapping to data is \code{identity}, with the values in the mapped variable directly interpreted as aesthetic values. In a color scale, say \ggscale{scale\_color\_identity()}, the variable in the data would be encoded with values such as \code{"red"}, \code{"blue"}---i.e., valid \Rlang colours. In a simple mapping using \ggscale{scale\_color\_discrete()} levels of a factor, such as \code{"treatment"} and \code{"control"} would be represented as distinct colours with the correspondence between factor levels and individual colours set automatically. In contrast with \code{scale\_color\_manual()} the user explicitly provides the mapping between factor levels and colours by passing arguments to the scale functions' parameters \code{breaks} and \code{values}. - -The details of the mapping of a continuous variable to an \emph{aesthetic} are controlled with a continuous scale such as \code{scale\_color\_continuous()}. In this case values in a \code{numeric} variable will be mapped into a continuous range of colours. How the correspondence between numeric values and colors is controlled can vary among scales. In the case of color, some scales use complex palettes, while others implement simple gradients between two or three colours. - -\begin{explainbox} -In some scales missing values, or \code{NA}, can be assigned an aesthetic value, such as color, while in other cases \code{NA} values are always skipped instead of plotted. The reverse, mapping values in data to \code{NA} as aesthetic value is in some cases also possible. -\end{explainbox} - -\subsection{Axis and key labels}\label{sec:plot:scale:name}\label{sec:plot:labs} -\index{plots!labels|(} -\index{plots!title|(} -\index{plots!subtitle|(} -\index{plots!tag|(} -\index{plots!caption|(} -First I describe a feature common to all scales, their \code{name}. The default \code{name} of all scales is the name of the variable or the expression mapped to it. In the case of the \code{x}, \code{y} and \code{z} \emph{aesthetics} the \code{name} given to the scale is used for the axis labels. For other \emph{aesthetics} the name of the scale becomes the ``heading'' or \emph{key title} of the guide or key. All scales have a \code{name} parameter to which a character string or an \Rlang expression (see section \ref{sec:plot:plotmath}) can be passed as an argument to override the default. In scales that add a key or guide, passing \code{guide = "none"} to the scale function removes the key corresponding to the scale. - -Convenience functions \Rfunction{xlab()} and \Rfunction{ylab()} can be used to set the axis labels. -Convenience function \Rfunction{labs()} can be used to manually set axis labels, key/guide titles, and title and other labels for the plot as a whole. For the names of scales, \Rfunction{labs()} accepts the names of aesthetics as if they were formal parameters and using \code{title}, \code{subtitle}, \code{caption}, \code{tag}, and \code{alt} for the labels for the plot as a whole. The text passed to \code{alt} is not visible in the plot but is expected to be made available to web browsers and used to enhance accessibility. (The size of title and subtitle seem too big, see section \ref{sec:plot:themes} on page \pageref{sec:plot:themes} on how to change the theme.) - -<>= -p.base <- - ggplot(data = Orange, - mapping = aes(x = age, y = circumference, color = Tree)) + - geom_line() + - geom_point() -@ - -<>= -p.base + - expand_limits(y = 0) + - labs(title = "Growth of orange trees", - subtitle = "Starting from 1968-12-31", - caption = "see Draper, N. R. and Smith, H. (1998)", - tag = "A", - alt = "A data plot", - x = "Time (d)", - y = "Circumference (mm)", - color = "Tree\nnumber") -@ - -When passing names directly to scales, the plot title and subtitle can be added with function \Rfunction{ggtitle()} by passing either character strings or \Rlang expressions as arguments. - -<>= -p.base + - expand_limits(y = 0) + - scale_x_continuous(name = "Time (d)") + - scale_y_continuous(name = "Circumference (mm)") + - ggtitle(label = "Growth of orange trees", - subtitle = "Starting from 1968-12-31") -@ - -\begin{playground} -Make an empty plot (\code{ggplot()}) and add to it as title an \Rlang expression producing $y = b_0 + b_1 x + b_2 x^2$. (Hint: have a look at the examples for the use of expressions in the \code{plotmath} demo in \Rlang by typing \code{demo(plotmath)} at the \Rlang console. -\end{playground} - -\index{plots!tag|)} -\index{plots!caption|)} -\index{plots!subtitle|)} -\index{plots!title|)} -\index{plots!labels|)} - -\subsection{Continuous scales}\label{sec:plot:scales:continuous} -\index{grammar of graphics!continuous scales|(} -I start by listing the most frequently used arguments to the continuous scale functions: \code{name}, \code{breaks}, \code{minor\_breaks}, \code{labels}, \code{limits}, \code{expand}, \code{na.value}, \code{trans}, \code{guide}, and \code{position}. The value of \code{name} is used for axis labels or the key title (see previous section). The arguments to \code{breaks} and \code{minor\_breaks} override the default locations of major and minor ticks and grid lines. Setting them to \code{NULL} suppresses the ticks. By default the tick labels are generated from the value of \code{breaks} but an argument to \code{labels} of the same length as \code{breaks} will replace these defaults. The values of \code{limits} determine both the range of values in the data included and the plotting area as described above---by default the out-of-bounds (\code{oob}) observations are replaced by \code{NA} but it is possible to instead ``squish'' these observations towards the edge of the plotting area. The argument to \code{expand} determines the size of the margins or padding added to the area delimited by \code{lims} when setting the ``visual'' plotting area. The value passed to \code{na.value} is used as a replacement for \code{NA} valued observations---most useful for \code{color} and \code{fill} aesthetics. The transformation object passed as an argument to \code{trans} determines the transformation used---the transformation affects the rendering, but breaks and tick labels remain expressed in the original data units. The argument to \code{guide} determines the type of key or removes the default key. Depending on the scale in question not all these parameters are available. A family of continuous scales, \emph{binned scales}, was added in \ggplot 3.3.0. These scales map a continuous variable from \code{data} onto a discrete gradient of aesthetic values, but are otherwise very similar. - -The code below constructs data frame \code{fake2.data}, containing artificial data. - -<>= -fake2.data <- - data.frame(y = c(rnorm(20, mean = 20, sd = 5), - rnorm(20, mean = 40, sd = 10)), - group = factor(c(rep("A", 20), rep("B", 20))), - z = rnorm(40, mean = 12, sd = 6)) -@ - -\subsubsection{Limits} - -Limits are relevant to all kinds of \emph{scales}. Limits are set through parameter \code{limits} of the different scale functions. They can also be set with convenience functions \code{xlim()} and \code{ylim()} in the case of the \code{x} and \code{y} \emph{aesthetics}, and more generally with function \code{lims()} which like \code{labs()}, takes arguments named according to the name of the \emph{aesthetics}. The \code{limits} argument of scales accepts vectors, factors or a function computing them from \code{data}. In contrast, the convenience functions do not accept functions as their arguments. - -In the next example by setting ``hard'' limits some observations are excluded from the plot, they are not seen by \code{stats} and \code{geoms}, i.e., hard limits in scales subset observations in \code{data} at the \code{start} stage (see Figure \ref{fig:ggplot:stages} on page \pageref{fig:ggplot:stages}). More precisely, the off-limits observations are converted to \code{NA} values before they are passed as \code{data} to \code{stats}, and subsequently discarded with a warning. - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p1.base <- - ggplot(data = fake2.data, mapping = aes(x = z, y = y)) + - geom_point() -@ - -<>= -p1 <- p1.base + scale_y_continuous(limits = c(0, 100)) -@ - -To set only one limit leaving the other free, \code{NA} is used as a boundary. - -<>= -p2 <-p1.base + scale_y_continuous(limits = c(50, NA)) -@ - -<>= -p1 + p2 -@ - -Convenience functions \Rfunction{ylim()} and \Rfunction{xlim()} can be used to set the limits to the default $x$ and $y$ scales in use. Below, \Rfunction{ylim()} is used, but \Rfunction{xlim()} works identically except for the scale it modifies (plot identical to \code{p2} above, not shown). - -<>= -p1.base + ylim(50, NA) -@ - -In general, setting hard limits should be avoided, even though a warning is issued about \code{NA} values being omitted, as it is easy to unwillingly subset the data being plotted. -It is preferable to use function \Rfunction{expand\_limits()} as it safely \emph{expands} the dynamically computed default limits of a scale---the scale limits will grow past the requested expanded limits when needed to accommodate all observations. The arguments to \code{x} and \code{y} are numeric vectors of length one or two each, matching how the limits of the $x$ and $y$ continuous scales are defined. Below, the limits are expanded to include the origin. - -<>= -opts_chunk$set(opts_fig_medium) -@ - -<>= -p1.base + expand_limits(y = 0, x = 0) -@ - -The \code{expand} parameter of the scales plays a different role than \Rfunction{expand\_limits()}. It adds a ``margin'' or padding around the plotting area. The actual plotting area is given by the scale limits, set either dynamically or manually. Very rarely plots are drawn so that observations are plotted on top of the axes, avoiding this is a key role of \code{expand}. Rug plots and marginal annotations can make it necessary to expand the plotting area more than the default of 5\% on each margin. - -In the example below, the upper edge of the plotting area is expanded by adding 0.02 units of padding and the expansion at the bottom set to zero. - -<>= -opts_chunk$set(opts_2fig_very_wide) -@ - -<>= -p2.base <- - ggplot(data = fake2.data, - mapping = aes(fill = group, color = group, x = y)) + - stat_density(alpha = 0.3, position = "identity") -@ - -<>= -p1 <- - p2.base + scale_y_continuous(expand = expansion(add = c(0, 0.01))) -@ - -Using multipliers has the advantage that the expansion is proportional. A similar effect as above is achieved using multipliers, 10\% compared to the range of the \code{limits} at the top and none at the bottom. - -<>= -p2 <- - p2.base + scale_y_continuous(expand = expansion(mult = c(0, 0.1))) -@ - -<>= -p1 + p2 -@ - -\begin{playground} -Compare the rendered plot from \code{p2.base} to \code{p1} and \code{p2} displayed above. What has been the effect of using \Rfunction{expansion()}? Try different values as arguments for \code{add} and \code{mult}. -\end{playground} - -The direction of a scale can be reversed using a transformation (see section \ref{sec:plot:scales:trans} on page \pageref{sec:plot:scales:trans}). Scales \ggscale{scale\_x\_reverse()} and \ggscale{scale\_y\_reverse()} use by default the necessary transformation. However, inconsistently, \Rfunction{xlim()} and \Rfunction{ylim()} can be used to reverse the scale direction by passing the numeric values for the limits in decreasing order. - -\begin{playground} -Test what the result is when the first limit is larger than the second one. Is it the same as when setting these same values as limits with \code{ylim()}? or by replacing \code{scale\_y\_continuous()} with \code{scale\_y\_reverse()}? - -<>= -p1.base <- scale_y_continuous(limits = c(100, 0)) -@ -\end{playground} - -\subsubsection{Breaks and their labels}\label{sec:plot:scales:ticks} - -Parameter \code{breaks}\index{plots!scales!tick breaks} is used not only to set the location of ticks along the axis in scales for the \code{x} and \code{y} aesthetics, but also for the keys or guides for other continuous scales such as those for color. Parameter \code{labels}\index{plots!scales!tick labels} is used to set the break labels, including tick labels. The argument passed to each of these parameters can be vector or a function. The default is to compute ``good'' breaks based on the limits and use to nice numbers suitable for labels. Examples in this section are for continuous scales, see section \ref{sec:plot:scales:time:date} on page \pageref{sec:plot:scales:time:date} for break labels in time and date scales. - -When manually setting breaks, labels for the \code{breaks} are automatically computed unless overridden. - -<>= -p3.base <- - ggplot(data = fake2.data, mapping = aes(x = z, y = y)) + - geom_point() -@ - -<>= -p3.base + scale_y_continuous(breaks = c(20, pi * 10, 40, 60)) -@ - -The default breaks are computed by function \Rfunction{pretty\_breaks()} from \pkgname{scales}. The argument passed to its parameter \code{n} determines the target number ticks to be generated automatically, but the actual number of ticks computed may be slightly different depending on the range of the data. - -<>= -p3 <- - p3.base + scale_y_continuous(breaks = pretty_breaks(n = 7)) -@ - -We can set tick labels manually, in parallel to the setting of \code{breaks} by passing as arguments two vectors of equal length. Below, an expression is used to include a Greek letter in the label. - -<>= -p4 <- - p3.base + - scale_y_continuous(breaks = c(20, pi * 10, 40, 60), - labels = c("20", expression(10*pi), "40", "60")) -@ - -<>= -p3 + p4 -@ - -Package \pkgname{scales} provides several functions for the automatic generation of tick labels. For example, function \code{percent()} can be used to display tick labels as percentages when the values mapped from \code{data} are expressed as decimal fractions. This ``transformation'' is applied only to the tick labels. - -<>= -p5 <- - ggplot(data = fake2.data, mapping = aes(x = z, y = y / max(y))) + - geom_point() + - scale_y_continuous(labels = percent) -@ - -\sloppy -For currency, functions \code{dollar()} and \code{comma()} can be used to format the numbers in the labels as used for currency. Function \code{scientific\_format()} formats numbers using exponents of 10---useful for logarithmic-transformed scales. Additional functions, \code{label\_number(scale\_cut = cut\_short\_scale())}, \code{label\_log()}, or \code{label\_number(scale\_cut = cut\_si("g")} provide other options. As shown below, some of these functions can be useful with untransformed continuous scales. -<>= -p6 <- - ggplot(data = fake2.data, mapping = aes(x = z, y = y * 1000)) + - geom_point() + - scale_y_continuous(name = "Mass", - labels = label_number(scale_cut = cut_si("g"))) -@ - -<>= -p5 + p6 -@ - -<>= -opts_chunk$set(opts_fig_medium) -@ - -\begin{explainbox} -Function \Rfunction{label\_number()} and the similar functions listed above, build new functions base on the arguments passed to them, and the definition of these functions are the values they return. Thus, in the example above even if the statement passed as argument to \code{labels} is a function call, the value actually ``received'' by \code{scale\_y\_continuous()} is a function definition. Some packages define additional functions that work similarly to those from package \pkgname{scales}. -\end{explainbox} - -\subsubsection{Transformed scales}\label{sec:plot:scales:trans} - -The\index{plots!scales!transformations} default scales used by the \code{x} and \code{y} aesthetics, \ggscale{scale\_x\_continuous()} and \ggscale{scale\_y\_continuous()}, accept a user-supplied transformation function as an argument to \code{trans} with default \code{trans = "identity"} (no transformation). Package \pkgname{scales} defines several transformations that can be used as arguments for \code{trans}. User-defined transformations can be also implemented and used. In addition, there are predefined convenience scale functions for \code{log10}, \code{sqrt} and \code{reverse}. - -\begin{warningbox} - Consistently with maths functions in \Rlang, the names of the scales are \ggscale{scale\_x\_log10()} and \ggscale{scale\_y\_log10()}, rather than \ggscale{scale\_y\_log()} because in \Rlang, function \code{log()} computes the natural logarithm. -\end{warningbox} - -Axis tick-labels display the original values, not transformed ones, and the argument to \code{breaks} also refers to these. Using \ggscale{scale\_y\_log10()} a $\log_{10}$ transformation is applied to the $y$ values. - -<>= -ggplot(data = fake2.data, mapping = aes(x = z, y = y)) + - geom_point() + - scale_y_log10(breaks=c(10,20,50,100)) -@ - -\begin{playground} -Using a transformation in a scale is not equivalent to applying the same transformation on the fly when mapping a variable to the $x$ (or $y$) \emph{aesthetic}. How does the plot produced by the code below differ from the plot using the transformed scale, shown above? - -<>= -ggplot(data = fake2.data, mapping = aes(x = z, y = log10(y))) + - geom_point() -@ -\end{playground} - -For the most common transformations like \Rfunction{log10()}, scales with those transformations as their default are available. In other cases, as mentioned above, the transformation is set by passing an argument to parameter \code{trans} of continuous scale functions that by default do not apply a transformation. Below, a predefined transformation, \code{"reciprocal"} or $1/y$ is used (plot not shown). - -<>= -ggplot(data = fake2.data, mapping = aes(x = z, y = y)) + - geom_point() + - scale_y_continuous(trans = "reciprocal") -@ - -Natural logarithms are important in growth analysis as the slope against time gives the relative growth rate. The growth data for orange trees, from data set \code{Orange}, are plotted using a \Rfunction{log()} as transformation. Breaks are set using the original values. - -<>= -opts_chunk$set(opts_fig_wide) -@ - -<>= -ggplot(data = Orange, - mapping = aes(x = age, y = circumference, color = Tree)) + - geom_line() + - geom_point() + - scale_y_continuous(trans = "log", breaks = c(20, 50, 100, 200)) -@ - -\begin{explainbox} -In the examples above, and in practice most frequently, transformations are applied to position aesthetics, \code{x} and \code{y}. As the grammar of graphics is consistent, most if not all continuous scales, also accept transformations. In some cases, applying a transformation to a size or color scale helps convey the information contained in the data. -\end{explainbox} - -\subsubsection{Position of $x$ and $y$ axes} -\index{plots!axis position} - -The default position of axes can be changed through parameter \code{position}, using character constants \code{"bottom"}, \code{"top"}, \code{"left"} and \code{"right"}. - -<>= -ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + - geom_point() + - scale_x_continuous(position = "top") + - scale_y_continuous(position = "right") -@ - -\subsubsection{Secondary axes} - -It\index{plots!secondary axes} is also possible to add secondary axes with ticks displayed in a transformed scale. - -<>= -ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + - geom_point() + - scale_y_continuous(sec.axis = sec_axis(~ . ^-1, name = "1/y") ) -@ - -It is also possible to use different \code{breaks} and \code{labels} than for the main axes, and to provide a different \code{name} to be used as a secondary axis label. - -<>= -ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + - geom_point() + - scale_y_continuous(sec.axis = sec_axis(~ . / 2.3521458, - name = expression(km / l), - breaks = c(5, 7.5, 10, 12.5))) -@ -\index{grammar of graphics!continuous scales|)} - -\subsection{Time and date scales for $x$ and $y$}\label{sec:plot:scales:time:date} -\index{grammar of graphics!time and date scales|(} -Time scales are similar to continuous scales for \code{numeric} values. In \Rlang and many other computing languages, time values are stored as integer values subject to special interpretation (see section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}). Times stored as objects of class \code{POSIXct} (or \code{POSIXlt}) can be mapped to continuous \emph{aesthetics} such as \code{x}, \code{y}, \code{color}, etc. Special scales for different aesthetics are available for time-related data. - -Limits and breaks are preferably set using constant values of class \code{POSIXct}. These are most easily input with the functions in packages \pkgname{lubridate} or \pkgname{anytime} that convert dates and times from character strings. - -<>= -opts_chunk$set(opts_fig_very_wide) -@ -\begin{explainbox} -In the next two chunks scale limits subset a part of the observations present in \code{data}. Passing \code{na.rm = TRUE} when calling the \code{geom} functions silences warning messages. -\end{explainbox} - -<>= -ggplot(data = weather_wk_25_2019.tb, - mapping = aes(x = with_tz(time, tzone = "EET"), - y = air_temp_C)) + - geom_line(na.rm = TRUE) + - scale_x_datetime(name = NULL, - breaks = ymd_h("2019-06-11 12", tz = "EET") + days(0:1), - limits = ymd_h("2019-06-11 00", tz = "EET") + days(c(0, 2))) + - scale_y_continuous(name = "Air temperature (C)") + - expand_limits(y = 0) -@ - -As\index{plots!scales!axis labels} for numeric scales breaks and the corresponding labels can be set differently to defaults. For example, if all observations have been collected within a single day, default tick labels will show hours and minutes. With several years, the labels will show only dates. The default labels are frequently good enough. Below, both breaks and the format of the labels are set through parameters passed in the call to \ggscale{scale\_x\_datetime()}. - -<>= -ggplot(data = weather_wk_25_2019.tb, - mapping = aes(x = with_tz(time, tzone = "EET"), - y = air_temp_C)) + - geom_line(na.rm = TRUE) + - scale_x_datetime(name = NULL, - date_breaks = "1 hour", - limits = ymd_h("2019-06-16 00", tz = "EET") + hours(c(6, 18)), - date_labels = "%H:%M") + - scale_y_continuous(name = "Air temperature (C)") + - expand_limits(y = 0) -@ - -\begin{playground} -The formatting strings used are those supported by \Rfunction{strptime()} and \code{help(strptime)} lists them. Change, in the two examples above, the $y$-axis labels used and the limits---e.g., include a single hour or a whole week of data, check which tick labels are produced by default and then pass as an argument to \code{date\_labels} different format strings, taking into account that in addition to the \emph{conversion specification} codes, format strings can include additional text. -\end{playground} - -In date scales tick labels are created with functions \Rfunction{label\_date()} or \Rfunction{label\_date\_short()}. In the case of time scales, tick labels are created with function \Rfunction{label\_time()}. As shown for continuous scales, calls to these functions can passed as argument to the scales. - -\index{grammar of graphics!time and date scales|)} - -\subsection{Discrete scales for $x$ and $y$} -\index{grammar of graphics!discrete scales|(} - -In\index{plots!scales!limits} the case of ordered or unordered factors, the tick labels are by default the names of the factor levels. Consequently, one roundabout way of obtaining the desired tick labels is to set them as factor labels in the data frame. This approach is not recommended as in many cases the text of the desired tick labels may not be a valid \Rlang name making more complex by the need to \emph{scape} these names each time they are used. It is best to use simple mnemonic short names for factor levels and variables, and to set suitable labels through scales. - -Scales \ggscale{scale\_x\_discrete()} and \ggscale{scale\_y\_discrete()} can be used to reorder and select the factor levels without altering the data. When using this approach to subset the data, it is necessary to pass \code{na.rm = TRUE} in the call to layer functions to avoid warnings. Below, arguments passed to \code{limits} and \code{labels} in the call \code{scale\_x\_discrete} manually convert level names to uppercase (plot not shown, identical plot shown farther down using alternative code). - -<>= -opts_chunk$set(opts_fig_narrow) -@ -<>= -ggplot(data = mpg, - mapping = aes(x = class, y = hwy)) + - stat_summary(geom = "col", fun = mean, na.rm = TRUE) + - scale_x_discrete(limits = c("compact", "subcompact", "midsize"), - labels = c("COMPACT", "SUBCOMPACT", "MIDSIZE")) -@ - -If, as above, replacement is with the same names in upper case, passing function \Rfunction{toupper()} automates the operation. In addition, the code becomes independent of the labels used in the data. This is a more general and less error-prone approach. Any function, user defined or not, that converts the values of \code{limits} into the desired values can be passed as an argument to \code{labels}. This example, for completeness, sets scale names and limits, as well as the width of the columns. - -<>= -ggplot(data = mpg, - mapping = aes(x = class, y = hwy)) + - stat_summary(geom = "col", fun = mean, na.rm = TRUE, width = 0.6) + - scale_x_discrete(name = "Vehicle class", - limits = c("compact", "subcompact", "midsize"), - labels = toupper) + - scale_y_continuous(name = "Petrol use efficiency (mpg)", limits = c(0, 30)) -@ - -The order of the columns in the plot follows the order of the levels in the factor, thus changing this ordering in factor \code{mpg\$class} works. This approach makes sense when the new ordering needs to be computed based on values in \code{data}, but can still be applied in the plotting code. Below, the breaks, and together with them the columns, are ordered based on the \code{mean()} of variable \code{hwy} by means of a call to \Rfunction{reorder()} within the call to \Rfunction{aes()}. - -<>= -ggplot(data = mpg, - mapping = aes(x = reorder(x = factor(class), X = hwy, FUN = mean), - y = hwy)) + - stat_summary(geom = "col", fun = mean) -@ -\index{grammar of graphics!discrete scales|)} -<>= -opts_chunk$set(opts_fig_wide) -@ - -\subsection{Size and line width} -\index{grammar of graphics!size scales|(} -\index{grammar of graphics!linewidth scales|(} -\begin{warningbox} - The \code{linewidth} aesthetic was added to package \ggplot in version 3.4.0. Previously, aesthetic \code{size} described the width of lines as well as the size of text and points or shapes. Below I describe the scales according to version 3.4.0 and more recent. -\end{warningbox} - -For the \code{size} \emph{aesthetic}, several scales are available, discrete, ordinal, continuous and binned. They are similar to those already described above. Geometries \gggeom{geom\_point()}, \gggeom{geom\_text()}, and \gggeom{geom\_label()} obey the \code{size} aesthetic as expected. Size scales can be used with continuous numeric variables, date and times, and with discrete variables. Examples of the use of \ggscale{scale\_size()} and \ggscale{scale\_size\_area()} were given in section \ref{sec:plot:geom:point} on page \pageref{sec:plot:geom:point}. Scale \ggscale{scale\_size\_radius()} is rarely used as it does not match human visual size perception. - -A similar set of scales is available for \code{linewidth} as there is for \code{size}, discrete, ordinal, continuous and binned. Geometries \gggeom{geom\_line()}, \gggeom{geom\_hline()}, \gggeom{geom\_vline()}, \gggeom{geom\_abline()}, \gggeom{geom\_segment()}, \gggeom{geom\_curve()} and related ones, obey the \code{linewidth} aesthetic. Geometry \gggeom{geom\_pointrange()} obeys both aesthetics, as expected, \code{size} is used for the size of the point and \code{linewidth} for the bar segment. In geometries \gggeom{geom\_bar()}, \gggeom{geom\_col()}, \gggeom{geom\_area()}, \gggeom{geom\_ribbon()} and all other geometric elements bordered by lines, \code{linewidth} controls the width of these lines. Like lines, these borders and segments also obey the \code{linetype} aesthetic. - -\begin{warningbox} - Using \code{linewidth} makes code incompatible with versions of \ggplot2 prior to 3.4.0, while continuing to use \code{size} will trigger deprecation messages in newer versions of \ggplot2. Eventually, use of \code{size} for lines will become an error, so when possible, it is preferable to use the new \code{linewidth} aesthetic. -\end{warningbox} - -\index{grammar of graphics!linewidth scales|)} -\index{grammar of graphics!size scales|)} - -\subsection{Color and fill} -\index{grammar of graphics!color and fill scales|(} -\index{plots!with colors|(} - -color and fill scales are similar, but they affect different elements of the plot. All visual elements in a plot obey the \code{color} \emph{aesthetic}, but only elements that have an inner region and a boundary, obey both \code{color} and \code{fill} \emph{aesthetics}. The boundary does not need to be rendered as a line when the plot is displayed, but it must exist. This is the case for \gggeom{geom\_area()} and \gggeom{geom\_ribbon()} that in recent versions of \ggplot are displayed with lines only on some edges. Only a subset of the shapes supported by \gggeom{geom\_point()} can be filled. There are separate but equivalent sets of scales available for these two \emph{aesthetics}. I will describe in more detail the \code{color} \emph{aesthetic} and give only some examples for \code{fill}. I will, however, start by reviewing how colors are defined and used in \Rlang. - -\subsubsection{Color definitions in R}\label{sec:plot:colors} -\index{color!definitions|(} -Colors can be specified in \Rlang not only through character strings with the names of previously defined colors, but also directly as strings describing the RGB (red, green and blue) components as hexadecimal numbers (on base 16 expressed using 0, 1, 2, 3, 4, 6, 7, 8, 9, A, B, C, D, E, and F as ``digits'') such as \code{"\#FFFFFF"} for white or \code{"\#000000"} for black, or \code{"\#FF0000"} for the brightest available pure red. - -The list of color names\index{color!names} known to \Rlang can be obtained be typing \Rfunction{colors()} at the \Rlang console. -Given the number of colors available, subsetting them based on their names is frequently a good first step. Function \code{colors()} returns a character vector. Using \code{grep()} it is possible find the names containing a given character substring, in this example \code{"dark"}. - -<>= -length(colors()) -grep("dark",colors(), value = TRUE) -@ - -The RGB values for a color definition are returned by function \Rfunction{col2rgb()}. - -<>= -col2rgb("purple") -col2rgb("#FF0000") -@ - -Color definitions in \Rlang can contain a \emph{transparency} component described by an \code{alpha} value, which by default is not returned. - -<>= -col2rgb("purple", alpha = TRUE) -@ - -With function \Rfunction{rgb()} one can define new colors. Enter \code{help(rgb)} for more details. - -<>= -rgb(1, 1, 0) -rgb(1, 1, 0, names = "my.color") -rgb(255, 255, 0, names = "my.color", maxColorValue = 255) -@ - -As described above, colors can be defined in the RGB \emph{color space}, however, other color models such as HSV (hue, saturation, value) can be also used to define colours.\qRfunction{hsv()} - -<>= -hsv(c(0,0.25,0.5,0.75,1), 0.5, 0.5) -@ - -Probably a more useful flavor of HSV colors for use in scales are those returned by function \Rfunction{hcl()} for hue, chroma and luminance. While the ``value'' and ``saturation'' in HSV are based on physical values, the ``chroma'' and ``luminance'' values in HCL are based on human visual perception. Colours with equal luminance will be seen as equally bright by an ``average'' human. In a scale based on different hues but equal chroma and luminance values, as used by default by package \ggplot, all colours are perceived as equally bright. The hues need to be expressed as angles in degrees, with values between zero and 360. - -<>= -hcl(c(0,0.25,0.5,0.75,1) * 360) -@ - -It is also important to remember that humans can only distinguish a limited set of colours, and even smaller color gamuts can be reproduced by screens and printers. Furthermore, variation from individual to individual exists in color perception, including different types of color blindness. It is important to take this into account when choosing the colors used in illustrations. -\index{color!definitions|)} - -\subsection{Continuous color-related scales} -\sloppy -Continuous color scales \ggscale{scale\_color\_continuous()}, \ggscale{scale\_color\_gradient()}, \ggscale{scale\_color\_gradient2()}, \ggscale{scale\_color\_gradientn()}, \ggscale{scale\_color\_date()} and \ggscale{scale\_color\_datetime()}, give a smooth continuous gradient between two or more colours. They are used with \code{numeric}, \code{date} and \code{datetime} data. A corresponding set of \code{fill} scales is also available. Other scales like \ggscale{scale\_color\_viridis\_c()} and \ggscale{scale\_color\_distiller()} are based on the use of ready-made palettes of sets of color gradients chosen to work well together under multiple conditions or for human vision including different types of color blindness. - -\subsection{Discrete color-related scales} -\sloppy -Color scales \ggscale{scale\_color\_discrete()}, \ggscale{scale\_color\_hue()}, \ggscale{scale\_color\_gray()} are used with categorical data stored as factors. Other scales like \ggscale{scale\_color\_viridis\_d()} and \ggscale{scale\_color\_brewer()} provide discrete sets of colours based on palettes. - -\subsection{Binned scales}\label{sec:binned:scales} -\index{grammar of graphics!binned scales|(} -Before version 3.3.0 of \pkgname{ggplot2} only two types of scales were available, continuous and discrete. A third type of scales (implemented for all the aesthetics where relevant) was added in version 3.3.0 called \emph{binned}. They are to be used with continuous variables, but they discretize the continuous values into bins or classes, each for a range of values, and then represent them in the plot using a discrete set of aesthetic values from gradient. We re-do the figure shown on page \pageref{chunk:plot:weighted:resid} but replacing \ggscale{scale\_color\_gradient()} by \ggscale{scale\_color\_binned()}. - -<>= -@ - -<>= -@ - -<>= -ggplot(data = df2) + - stat_fit_residuals(formula = y ~ poly(x, 3, raw = TRUE), - method = "rlm", - mapping = aes(x = X, - y = stage(start = Y, - after_stat = y * weights), - colour = after_stat(weights)), - show.legend = TRUE) + - scale_color_binned(low = "red", high = "blue", limits = c(0, 1), - guide = "colourbar", n.breaks = 5) -@ - -The advantage of binned scales is that they facilitate the fast reading of the plot while their disadvantage is the decreased resolution of the scale. The choice of a binned vs.\ continuous scale, and the number and boundaries of bins, set by the argument passed to parameter \code{n.breaks} or to \code{breaks} need to be chosen carefully, taking into account the audience, the length of time available to the viewer to peruse the plot vs.\ the density of observations. Transformations are also allowed in these scales as in others. - -\index{grammar of graphics!binned scales|)} - -\subsection{Identity scales} -\index{grammar of graphics!identity color scales|(} -In the case of identity scales, the mapping is one to-one to the data. For example, if we map the \code{color} or \code{fill} \emph{aesthetic} to a variable using \ggscale{scale\_color\_identity()} or \ggscale{scale\_fill\_identity()}, the mapped variable must already contain valid color definitions. In the case of mapping \code{alpha}, the variable must contain numeric values in the range 0 to 1. - -We use a data frame containing a variable \code{colors} containing character strings interpretable as the names of color definitions known to \Rlang. We then use them directly in the plot by passing \ggscale{scale\_color\_identity()}. - -<>= -df3 <- data.frame(X = 1:10, Y = dnorm(10), colors = rep(c("red", "blue"), 5)) - -ggplot(data = df3, mapping = aes(x = X, y = Y, color = colors)) + - geom_point() + - scale_color_identity() -@ - -\begin{playground} -How does the plot look, if the identity scale is deleted from the example above? Edit and re-run the example code. - -While using the identity scale, how would you need to change the code example above, to produce a plot with green and purple points? -\end{playground} -\index{grammar of graphics!identity color scales|)} -\index{plots!with colors|)} -\index{grammar of graphics!color and fill scales|)} -\index{grammar of graphics!scales|)} - -\section{Adding annotations}\label{sec:plot:annotations} -\index{grammar of graphics!annotations|(} -The idea of annotations is that they add plot elements that are not directly connected individual observations in \code{data}. Some like company logos, could be called ``decorations'', but others like text indicating the number of observations or even an inset plot or table may convey information about the data set as a whole. They can be drawn referenced to the ``native'' data coordinates used to plot but the position itself does convey information. Annotations are distinct from data labels. Annotations are added to a ggplot with function \Rfunction{annotate()} as plot layers (each call to \code{annotate()} creates a new layer). To achieve the behavior expected of annotations, \Rfunction{annotate()} does not inherit the default \code{data} or \code{mapping} of variables to \emph{aesthetics}. Annotations frequently make use of \code{"text"} or \code{"label"} \emph{geometries} with character strings as data, possibly to be parsed as expressions. In addition, for example, the \code{"segment"} geometry can be used to add arrows. - -\begin{warningbox} -While layers added to a plot using \emph{geometries} and \emph{statistics} respect faceting, layers added with \Rfunction{annotate()} are replicated unchanged in all panels of a faceted plot. The reason is that annotation layers accept \emph{aesthetics} only as constant values which are the same for every panel as no grouping is possible without a \code{mapping} to \code{data}. Alternatives, using new geometries are provided by package \ggpp. -\end{warningbox} - -Function \Rfunction{annotate()} takes the name of a geometry as its argument, in the example below, \code{"text"}. Function \Rfunction{aes()} is not used, as only mappings to constant values are accepted. These values can be vectors, thus, layers added with annotate can add multiple graphic objects of the same type to a plot. - -<>= -ggplot(data = fake2.data, mapping = aes(x = z, y = y)) + - geom_point() + - annotate(geom = "text", - label = "origin", - x = 0, y = 0, - color = "blue", - size=4) -@ - -\begin{playground} -Play with the values of the arguments to \Rfunction{annotate()} to vary the position, size, color, font family, font face, rotation angle and justification of the annotation. -\end{playground} - -\index{plots!insets as annotations|(} -It is relatively common to use inset tables, plots, bitmaps or vector graphics as annotations. In section \ref{sec:plot:insets} on page \pageref{sec:plot:insets} \code{geoms} from package \ggpp were used to create insets in plots. An older alternative is to use \Rfunction{annotation\_custom()} to add grobs (\pkgname{grid} graphical object) a ggplot. To add another or the same plot as an inset, it first needs to to be converted it into a grob. A ggplot can be converted with function \Rfunction{ggplotGrob()}. In this example the inset is a zoomed-in window into the main plot. In addition to the grob, coordinates for its location are passed in native data units. - -<>= -p <- ggplot(data = fake2.data, mapping = aes(x = z, y = y)) + - geom_point() -p + expand_limits(x = 40) + - annotation_custom(ggplotGrob(p + coord_cartesian(xlim = c(5, 10), ylim = c(20, 40)) + - theme_bw(10)), - xmin = 21, xmax = 40, ymin = 30, ymax = 60) -@ - -This approach has the limitation, shared with the use of \Rfunction{annotate()} that if used together with faceting, the inset is added identically to all plot panels. -\index{plots!insets as annotations|)} - -In the next example, expressions are used as annotations as well as for tick labels. Do notice that we use recycling for setting the breaks, as \code{c(0, 0.5, 1, 1.5, 2) * pi} is equivalent to \code{c(0, 0.5 * pi, pi, 1.5 * pi, 2 * pi}. Annotations are plotted at their own position, unrelated to any observation in the data, but using the same coordinates and units as for plotting the data. - -<>= -ggplot(data = data.frame(x = c(0, 2 * pi)), - mapping = aes(x = x)) + - stat_function(fun = sin) + - scale_x_continuous( - breaks = c(0, 0.5, 1, 1.5, 2) * pi, - labels = c("0", expression(0.5~pi), expression(pi), - expression(1.5~pi), expression(2~pi))) + - labs(y = "sin(x)") + - annotate(geom = "text", - label = c("+", "-"), - x = c(0.5, 1.5) * pi, y = c(0.5, -0.5), - size = 20) + - annotate(geom = "point", - color = "red", - shape = 21, - fill = "white", - x = c(0, 1, 2) * pi, y = 0, - size = 6) -@ - -\begin{playground} -Modify the plot above to show the cosine instead of the sine function, replacing \code{sin} with \code{cos}. This is easy, but the catch is that you will need to relocate the annotations. -\end{playground} - -\begin{warningbox} -Function \Rfunction{annotate()} cannot be used with \code{geom = "vline"} or \code{geom = "hline"} as we can use \code{geom = "line"} or \code{geom = "segment"}. Instead, \gggeom{geom\_vline()} and/or \gggeom{geom\_hline()} can be used directly passing constant arguments to them. See section \ref{sec:plot:line} on page \pageref{sec:plot:vhline}. -\end{warningbox} -\index{grammar of graphics!annotations|)} - -\section{Coordinates and circular plots}\label{sec:plot:circular}\label{sec:plot:coord} -\index{grammar of graphics!polar coordinates|(} -\index{plots!circular|(} -Circular plots can be thought of as plots equivalent to those described earlier in this chapter but drawn using a different system of coordinates. This is a key insight, that the grammar of graphics as implemented in \ggplot makes use of. To obtain circular plots we use the same \emph{geometries}, \emph{statistics} and \emph{scales} we have been using with the default system of cartesian coordinates. The only thing that we need to do is to add \ggcoordinate{coord\_polar()} to override the default. Of course only some observed quantities can be better perceived in circular plots than in cartesian plots. the grammar of graphics, \textit{coordinates}, such as \ggcoordinate{coord\_polar()}. -When using polar coordinates, the \code{x} and \code{y} \textit{aesthetics} correspond to the angle and radial distance, respectively. - -\subsection{Wind-rose plots} -\index{plots!wind rose|(} -Some types of data are more naturally expressed on polar coordinates than on cartesian coordinates. The clearest example is wind direction, from which the name derives. In some cases of time series data with a strong periodic variation, polar coordinates can be used to highlight any phase shifts or changes in frequency. A more mundane application is to plot variation in a response variable through the day with a clock-face-like representation of time of day. - -Wind rose plots are frequently histograms or density plots drawn on a polar system of coordinates (see sections \ref{sec:plot:histogram} and \ref{sec:plot:density} on pages \pageref{sec:plot:histogram} and \pageref{sec:plot:density}, respectively for a description of the use of these \emph{statistics} and \emph{geometries}). We will use them for examples where we plot wind speed and direction data, measured once per minute during 24~h (from package \pkgname{learnrbook}). - -A circular histogram of wind directions with 30-degree-wide bins is create using \ggstat{stat\_bin()}. The counts represent the number of minutes during 24~h when the wind direction was within each bin. - -<>= -p.base <- - ggplot(data = viikki_d29.dat, - mapping = aes(x = WindDir_D1_WVT)) + - coord_polar() + - scale_x_continuous(breaks = c(0, 90, 180, 270), - labels = c("N", "E", "S", "W"), - limits = c(0, 360), - expand = c(0, 0), - name = "Wind direction") - -p.base + - stat_bin(color = "black", fill = "gray50", geom = "bar", - binwidth = 30, na.rm = TRUE) + - labs(y = "Frequency (min/d)") -@ - -For an equivalent plot, using an empirical density, \ggstat{stat\_density()} is used instead of \ggstat{stat\_bin()}, and \gggeom{geom\_polygon()} instead of \gggeom{geom\_bar()}. The \code{name} of the \code{y} scale is also updated. - -<>= -p.base + - stat_density(color = "black", fill = "gray50", - geom = "polygon", linewidth = 1) + - labs(y = "Density (/1)") -@ - -The final wind-rose plot example, is a 2D density plot with facets, using \Rfunction{facet\_wrap()} to create separate panels for AM and PM. This plot uses fill to describe the density of observations for different combinations wind directions and speeds, the radius ($y$ \emph{aesthetic}) to represent wind speeds and the angle ($x$ \emph{aesthetic}) to represent wind direction. - -<>= -opts_chunk$set(opts_fig_very_wide) -@ - -<>= -ggplot(data = viikki_d29.dat, - mapping = aes(x = WindDir_D1_WVT, y = WindSpd_S_WVT)) + - coord_polar() + - stat_density_2d(mapping = aes(fill = after_stat(level)), - geom = "polygon") + - scale_x_continuous(breaks = c(0, 90, 180, 270), - labels = c("N", "E", "S", "W"), - limits = c(0, 360), - expand = c(0, 0), - name = "Wind direction") + - scale_y_continuous(name = "Wind speed (m/s)") + - facet_wrap(~factor(ifelse(hour(solar_time) < 12, "AM", "PM"))) -@ -\index{plots!wind rose|)} -<>= -opts_chunk$set(opts_fig_narrow) -@ - -\subsection{Pie charts} -\index{plots!pie charts|(} - -\begin{warningbox} -Pie charts are more difficult to read than bar charts because our brain is better at comparing lengths than angles. If used, pie charts should only be used to show composition, or fractional components that add up to a total. In this case, used only if the number of “pie slices” is small (rule of thumb: seven at most), however in general, they are best avoided. -\end{warningbox} - -A pie chart of counts is like a bar plot in which instead of heights angles describe the number of counts. \gggeom{geom\_bar()}, which defaults to use \code{stat\_count()}, together with \ggcoordinate{coord\_polar()} creates a pie chart. The brewer gradient scale supplies the palette for the fills, while the color of the border line is set with \code{color = "black")}. - -<<>>= -ggplot(data = mpg, - mapping = aes(x = factor(1), fill = factor(class))) + - geom_bar(width = 1, color = "black") + - coord_polar(theta = "y") + - scale_fill_brewer() + - scale_x_discrete(breaks = NULL) + - labs(x = NULL, fill = "Vehicle class") -@ -\index{plots!pie charts|)} -\index{plots!circular|)} -\index{grammar of graphics!polar coordinates|)} - -\begin{playground} -Edit the code for the pie chart above to obtain a bar chart. Which one of the two plots is easier to read? -\end{playground} - -\section{Themes}\label{sec:plot:themes} -\index{grammar of graphics!themes|(} -\index{plots!styling|(} -In \ggplot, \emph{themes} are the equivalent of style sheets. They determine how the different elements of a plot are rendered when displayed, printed or saved to a file. \emph{Themes} do not alter what aesthetics or scales are used to plot the observations or summaries, but instead how text-labels, titles, axes, grids, plotting-area background and grid, etc., are formatted and if displayed or not. Package \ggplot includes several predefined \emph{theme constructors} (usually described as \emph{themes}), and independently developed extension packages define additional ones. These constructors return complete themes, which when added to a plot, replace any theme already present in whole. In addition to choosing among these already available \emph{complete themes}, users can modify the ones already present by adding \emph{incomplete themes} to a plot. When used in this way, \emph{incomplete themes} usually are created on the fly. It is also possible to create new theme constructors that return complete themes, similar to \code{theme\_gray()} from \ggplot. - -\subsection{Complete themes} -\index{grammar of graphics!complete themes|(} -The theme used by default is \ggtheme{theme\_gray()} with default arguments. In \pkgnameNI{ggplot2}, predefined themes are defined as constructor functions, with parameters. These parameters allow changing some ``base'' properties. The \code{base\_size} for text elements is given in points, and affects all text elements in the returned theme object as the size of these elements is by default defined relative to the base size. Another parameter, \code{base\_family}, allows the font family to be set. These functions return complete themes. - -\begin{warningbox} -\emph{Themes} have no effect on layers produced by \emph{geometries} as themes have no effect on \emph{mappings}, \emph{scales} or \emph{aesthetics}. In the name \ggtheme{theme\_bw()} black-and-white refers to the color of the background of the plotting area and labels. If the \emph{color} or fill \emph{aesthetics} are mapped or set to a constant in the figure, these will be respected irrespective of the theme. We cannot convert a color figure into a black-and-white one by adding a \emph{theme}. For color gradients and alternative is to use a greyscale gradient by changing the \emph{scales} used to map values to aesthetics. For discrete scales, a different aesthetic can be used, for example, use \code{shape} or \code{linetype} instead of \code{color}. -\end{warningbox} - -Even the default \ggtheme{theme\_gray()} can be added to a plot, to replace the default one with a newly constructed one created with arguments different to the defaults ones. Below, a serif font at a larger size than the default is used. - -<>= -ggplot(data = fake2.data, - mapping = aes(x = z, y = y)) + - geom_point() + - theme_gray(base_size = 18, - base_family = "serif") -@ - -\begin{playground} -Change the code in the previous chunk to use, one at a time, each of the predefined themes from \ggplot: \ggtheme{theme\_bw()}, \ggtheme{theme\_classic()}, \ggtheme{theme\_minimal()}, \ggtheme{theme\_linedraw()}, \ggtheme{theme\_light()}, \ggtheme{theme\_dark()} and \ggtheme{theme\_void()}. -\end{playground} - -\begin{explainbox} -Predefined ``themes'' like \ggtheme{theme\_gray()} are, in reality, not themes but instead are constructors of theme objects. The \emph{themes} they return when called depend on the arguments passed to their parameters. In other words, \code{theme\_gray(base\_size = 15)}, creates a different theme than \code{theme\_gray(base\_size = 11)}. In this case, as sizes of different text elements are defined relative to the base size, the size of all text elements changes in coordination. Font size changes by \emph{themes} do not affect the size of text or labels in plot layers created with geometries, as their size is controlled by the \code{size} \emph{aesthetic}. -\end{explainbox} - -A frequent idiom is to create a plot without specifying a theme, and then adding the theme when printing or saving it. This can save work, for example, when producing different versions of the same plot for a publication and a talk. - -<>= -p.base <- - ggplot(data = fake2.data, - mapping = aes(x = z, y = y)) + - geom_point() -print(p.base + theme_bw()) -@ - -It is also possible to change the theme used by default in the current \Rlang session with \Rfunction{theme\_set()}. - -<>= -old_theme <- theme_set(theme_bw(15)) -@ - -Similar to other functions used to change options in \Rlang, \Rfunction{theme\_set()} returns the previous setting. By saving this value to a variable, here \code{old\_theme}, we are able to restore the previous default, or undo the change. - -<>= -theme_set(old_theme) -@ - -\begin{explainbox} -The use of a grey background as default for plots is unusual. This graphic design decision originates in the typesetters desire to maintain a uniform luminosity throughout the text and plots in a page. Many scientific journals require or at least prefer a more traditional graphic design. Theme \ggtheme{theme\_bw()} is the most versatile of the traditional designs supported as it works well both for individual plots as for plots with facets as it includes a box. Theme \ggtheme{theme\_classic()} lacking a box and grid works well for individual plots as is, but needs changes to the facet bars when used with facets. -\end{explainbox} -\index{grammar of graphics!complete themes|)} - -\subsection{Incomplete themes} -\index{grammar of graphics!incomplete themes|(} -To create a significantly different theme, and/or reuse it in multiple plots, it is best to create a new constructor, or a modified complete theme as described in section \ref{sec:plot:theme:new} on page \pageref{sec:plot:theme:new}. In other cases it is enough to tweak individual theme settings for a single plot. Below, overlapping $x$-axis tick labels are avoided by rotation the axis tick labels. When rotating the labels it is also necessary to change their justification, as justification is relative to the orientation of the text. - -<>= -ggplot(data = fake2.data, - mapping = aes(x = z + 1000, y = y)) + - geom_point() + - scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) + - theme(axis.text.x = element_text(angle = 33, hjust = 1, vjust = 1)) -@ - -\begin{playground} -Play with the code in the last chunk above, modifying the values used for \code{angle}, \code{hjust} and \code{vjust}. (Angles are expressed in degrees, and justification with values between 0 and 1). -\end{playground} - -A less elegant approach is to use a smaller font size. Within \Rfunction{theme()}, function \Rfunction{rel()} can be used to set size relative to the base size. In this example, we use \code{axis.text.x} so as to change the size of tick labels only for the $x$ axis. - -<>= -ggplot(fake2.data, aes(z + 100, y)) + - geom_point() + - scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) + - theme(axis.text.x = element_text(size = rel(0.6))) -@ - -Theme definitions follow a hierarchy, allowing us to modify the formatting of groups of similar elements, as well as of individual elements. In the chunk above, using \code{axis.text} instead of \code{axis.text.x}, would have affected the tick labels in both $x$ and $y$ axes. - -\begin{playground} -Modify the example above, so that the tick labels on the $x$-axis are blue and those on the $y$-axis red, and the font size is the same for both axes, but changed from the default. Consult the documentation for \code{theme()} to find out the names of the elements that need to be given new values. For examples, see \citebooktitle{Wickham2016} \autocite{Wickham2016} and \citebooktitle{Chang2018} \autocite{Chang2018}. -\end{playground} - -Formatting of other text elements can be adjusted in a similar way, as well as thickness of axes, length of tick marks, grid lines, etc. However, in most cases these are graphic design elements that are best kept consistent throughout sets of plots and best handled by creating a new \emph{theme}. - -\begin{warningbox} -If you both add a \emph{complete theme} and want to modify some of its elements, you should add the whole theme before modifying it with \code{+ theme(...)}. This may seem obvious once one has a good grasp of the grammar of graphics, but can be at first disconcerting. -\end{warningbox} - -It is also possible to modify the default theme used for rendering all subsequent plots. - -<>= -old_theme <- theme_update(text = element_text(color = "darkred")) -@ - -Having saved the previous default to \code{old\_theme} it can be restored when needed. - -<>= -theme_set(old_theme) -@ -\index{grammar of graphics!incomplete themes|)} - -\subsection{Defining a new theme}\label{sec:plot:theme:new} -\index{grammar of graphics!creating a theme|(} -Themes can be defined both from scratch, or by modifying existing saved themes, and saving the modified version. As discussed above, it is also possible to define a new, parameterized theme constructor function. - -Unless we plan to widely reuse the new theme, there is usually no need to define a new function. We can simply save the modified theme to a variable and add it to different plots as needed. As we will be adding a ``ready-build'' theme object rather than a function, we do not use parentheses. - -<>= -my_theme <- theme_bw(15) + theme(text = element_text(color = "darkred")) -p.base + my_theme -@ - -\begin{playground} -It is always good to learn to recognize error messages. One way of doing this is by generating errors on purpose. So do add parentheses to the statement in the code chunk above and study the error message. -\end{playground} - -\begin{explainbox} -Creating a new theme constructor similar to those from package \ggplot can be fairly simple if the changes are few. As the implementation details of theme objects may change in future versions of \ggplot, the safest approach is to rely only on the public interface of the package. The functions exported by package \ggplot can be wrapped inside a new function that modifies the theme before returning it. The interface, parameters, of the wrapped function can be included in the new one and the arguments passed along to the wrapped function, as is or modified. If needed additional parameters can be handled by code in the wrapper function. Below, a wrapper on \ggtheme{theme\_gray()} is constructed retaining a compatible interface, but adding a new base parameter, \code{base\_color}. A different default is used for \code{base\_family}. The key detail is passing \code{complete = TRUE} to \Rfunction{theme()}, as this tags the returned theme as being usable by itself, resulting in replacement of any theme already in a plot when it is added. - -<>= -my_theme_gray <- - function (base_size = 11, - base_family = "serif", - base_line_size = base_size/22, - base_rect_size = base_size/22, - base_color = "darkblue") { - - theme_gray(base_size = base_size, - base_family = base_family, - base_line_size = base_line_size, - base_rect_size = base_rect_size) + - - theme(line = element_line(color = base_color), - rect = element_rect(color = base_color), - text = element_text(color = base_color), - title = element_text(color = base_color), - axis.text = element_text(color = base_color), - complete = TRUE) - } -@ - -Our own theme constructor, created without too much effort, is ready to be used. To avoid surprising users, it is good to make \code{my\_theme\_grey()} a synonym of \code{my\_theme\_gray()} following \ggplot practice. - -<>= -my_theme_grey <- my_theme_gray -@ - -A plot created using \code{my\_theme\_gray()} with text color set to dark red. - -<>= -p.base + my_theme_gray(15, base_color = "darkred") -@ -\end{explainbox} -\index{grammar of graphics!creating a theme|)} -\index{plots!styling|)} -\index{grammar of graphics!themes|)} - -\section{Composing plots} -\index{plots!composing|(} -While facets make it possible to create plots with panels that share the same mappings and data (see section \ref{sec:plot:facets} on page \pageref{sec:plot:facets}), plot composition makes it possible to combine separately created \code{"gg"} plot objects into a single plot. Composition before rendering makes it possible to automate the correct alignments, ensure consistency of text size and even merge duplicate guide or keys. Composite plots can save space on the screen or page, but more importantly can bring together data visualizations that need to be compared or read as a whole. - -Package \pkgname{patchwork} defines a simple grammar for composing plots created with \ggplot, that I have used earlier in the chapter to display pairs of plots side by side. Composition with \pkgname{patchwork} can also include grid graphical objects. The plot composition grammar uses operators \Roperator{+}, \Roperator{|} and \Roperator{/}, although \pkgname{patchwork} provides additional tools for defining complex layouts of panels. While \Roperator{+} allows different layouts, \Roperator{|} composes panels side by side, and \Roperator{/} composes panels on top of each other. The plots to be used as panels can be grouped using parentheses. The operands must be whole plots, below, this ensured by saving each plot to a variable. When composing anonymous plots they must be enclosed in parentheses, to ensure that the correct operators are dispatched. - -Three simple plots, \code{p1}, \code{p2} and \code{p3} will be used below. - -<>= -p1 <- ggplot(mpg, aes(displ, cty, color = factor(cyl))) + - geom_point() + - theme(legend.position = "top") -p2 <- ggplot(mpg, aes(displ, cty, color = factor(year))) + - geom_point() + - theme(legend.position = "top") -p3 <- ggplot(mpg, aes(factor(model), cty)) + - geom_point() + - theme(axis.text.x = - element_text(angle = 90, hjust = 1, vjust = 0.5)) -@ - -<>= -opts_chunk$set(opts_fig_very_wide_square) -@ - -\begin{playground} -A combined plot can be simply assembled using the operators (plot not shown). - -<>= -p1 | p2 / p3 -(p1 | p2) / p3 -@ - -The operators used for composition are the arithmetic ones, and even if used for a different purpose still obey the precedence rules of mathematics. The order of precedence can be altered, as done above, using parentheses. Run the examples above after creating three plots. Modify the code trying different ways of organizing the three panels. -\end{playground} - -A title for the whole plot and a letter as tag for each panel are added as a whole-plot annotation. - -<>= -((p1 | p2) / p3) + - plot_annotation(title = "Fuel use in city traffic:", tag_levels = 'a') -@ - -<>= -opts_chunk$set(opts_fig_wide) -@ - -Package \pkgname{patchwork} has in recent versions tools for the creation of complex layouts, addition of insets and combining in the same layout plots and other graphic objects such as bitmaps, photographs and even tables. - -\begin{advplayground} -Package \pkgname{patchwork} can be very useful. Study the documentation and its examples, and try to think how it could be useful to you. Then try to compose plots like those you could use in your work or studies. -\end{advplayground} - -\index{plots!composing|)} - -\section[Using plotmath expressions]{Using \code{plotmath} expressions}\label{sec:plot:plotmath} -\index{plotmath} -\index{plots!math expressions|(} -Plotmath expression are similar to \Rlang expressions, but they are targeted at the creation of mathematical annotations. In some respects they are similar to the math mode in \LaTeX. They are used in graphical output like plots. The syntax sometimes feels awkward and takes some time to be learnt, but it gets the job done. - -\begin{explainbox} -The main limitation to producing rich text annotations in \Rlang similar to those possible using \LaTeX\ or using HTML is at the core of the \Rpgrm program. There is work in progress and improvements can be expected in coming years. Meanwhile, the already implemented enhancements gradually appear as enhanced features in \ggplot and its extensions. - -Package \pkgname{ggtext} provides rich-text (basic \langname{HTML} and \Markdown) support for \ggplot, both for annotations and for data visualization. This is an alternative to the use of \Rlang expressions. -\end{explainbox} - -In sections \ref{sec:plot:function} and \ref{sec:plot:text} simple examples of the use of \Rlang expressions for labelling plots were given. The \code{demo(plotmath)} demo and the help page \code{help(plotmath)} provide enough information to start using expressions in plots. Although expressions are shown here in the context of plotting, they are also used in other contexts in \Rlang code. - -In general it is possible to create \emph{expressions} explicitly with function \Rfunction{expression()}, or by parsing a character string. In the case of \ggplot for some plot elements, layers created with \gggeom{geom\_text()} and \gggeom{geom\_label()}, and the strip labels of facets the parsing is delayed and applied to mapped character variables in \code{data}. In contrast, for titles, subtitles, captions, axis-labels, etc. (anything that is defined within \Rfunction{labs()}) the expressions have to be entered explicitly, or saved as such into a variable, and the variable passed as an argument. - -When plotting expressions using \gggeom{geom\_text()}, the parsing of character strings are is signaled by passing \code{parse = TRUE} in the call to the layer function. In the case of facets' strip labels, parsing or not depends on the \emph{labeller} function used. An additional twist is the possibility of combining static character strings with values taken from \code{data} (see section \ref{sec:plot:facets} on page \pageref{sec:plot:facets}). - -The most difficult thing to remember when writing expressions is how to connect the different parts. A tilde (\code{\textasciitilde}) adds space in between symbols. Asterisk (\code{*}) can be also used as a connector, and is needed usually when dealing with numbers. Using space is allowed in some situations, but not in others. To include within an expression text that should not be parsed it is enclosed in quotation marks, which may need themselves to be quoted. For a long list of examples have a look at the output and code displayed by \code{demo(plotmath)} at the \Rlang command prompt. - -Expressions are frequently used for axis labels, e.g., when the units or symbols require the use of superscripts or Greek letters. In this case they are usually entered as expressions. - -<>= -p1 + labs(y = expression("Fuel use"~~(m~g^{-1})), - x = "Engine displacement (L)") -@ - -<>= -set.seed(54321) # make sure we always generate the same data -my.data <- - data.frame(x = 1:5, - y = rnorm(5), - greek.label = paste("alpha[", 1:5, "]", sep = "")) -@ - -The $x$-axis label is a Greek $\alpha$ character with $i$ as subscript, and the $y$-axis label includes a superscript in the units. The title we use a character string, but the subtitle is rather complex expression. - -Each observation has as data label a subscripted $alpha$. When using a \code{geom} instead of an expression we map character strings that can be parsed into expressions. In other words, character strings, that are written using the syntax of expressions. We need to set \code{parse = TRUE} in the call to the \emph{geometry} so that the strings instead of being plotted as is are parsed into expressions before the plot is rendered. - -<>= -ggplot(my.data, aes(x, y, label = greek.label)) + - geom_point() + - geom_text(angle = 45, hjust = 1.2, parse = TRUE) + - labs(x = expression(alpha[i]), - y = expression(Speed~~(m~s^{-1})), - title = "Using expressions", - subtitle = expression(sqrt(alpha[1] + frac(beta, gamma)))) -@ - -As parsing character strings is an alternative way of creating expressions, these approach can be also used in other situations. For example, a character string stored in a variable can be parsed with \Rfunction{parse()} as done here for \code{subtitle}. Below, tick labels are also set to expressions, taking advantage that \Rfunction{expression()} accepts multiple arguments separated by commas returning a vector of expressions. - -<>= -my_eq.char <- "alpha[i]" -ggplot(my.data, aes(x, y)) + - geom_point() + - labs(title = parse(text = my_eq.char)) + - scale_x_continuous(name = expression(alpha[i]), - breaks = c(1,3,5), - labels = expression(alpha[1], alpha[3], alpha[5])) -@ - -A different approach (no example shown) would be to use \Rfunction{parse()} explicitly for each individual label, something that might be needed if the tick labels need to be ``assembled'' programmatically instead of set as constants. - -\begin{explainbox} -\textbf{Differences between \Rfunction{parse()} and \Rfunction{expression()}}. Function \Rfunction{parse()} takes as an argument a character string. This is very useful as the character string can be created programmatically. When using \code{expression()} this is not possible, except for substitution at execution time of the value of variables into the expression. See the help pages for both functions. - -Function \Rfunction{expression()} accepts its arguments without any delimiters. Function \Rfunction{parse()} takes a single character string as an argument to be parsed, in which case quotation marks within the string need to be \emph{escaped} (using \code{\backslash"} where a literal \code{"} is desired). In both cases, a character string can be embedded using one of the functions \Rfunction{plain()}, \Rfunction{italic()}, \Rfunction{bold()} or \Rfunction{bolditalic()} which also affect the font used. The argument to these functions needs to be a character string delimited by quotation marks if it is not to be parsed. - -When using \Rfunction{expression()}, bare quotation marks can be embedded, - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - xlab(expression(x[1]*" test")) -@ - -while in the case of \Rfunction{parse()} they need to be \emph{escaped}, - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - xlab(parse(text = "x[1]*\" test\"")) -@ - -and in some cases will be enclosed within a format function. - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - xlab(parse(text = "x[1]*italic(\" test\")")) -@ - -Some additional remarks. If \Rfunction{expression()} is passed multiple arguments, it returns a vector of expressions. Where \Rfunction{ggplot()} expects a single value as an argument, as in the case of axis labels, only the first member of the vector will be used. - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - xlab(expression(x[1], " test")) -@ - -Depending on the location within a expression, spaces maybe ignored, or illegal. To juxtapose elements without adding space use \code{*}, to explicitly insert white space, use \code{\textasciitilde}. As shown above, spaces are accepted within quoted text. Consequently, the following alternatives can also be used. - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - xlab(parse(text = "x[1]~~~~\"test\"")) -@ - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - xlab(parse(text = "x[1]~~~~plain(test)")) -@ - -However, unquoted white space is discarded. - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - xlab(parse(text = "x[1]*plain( test)")) -@ - -Finally, it can be surprising that trailing zeros in numeric values appearing within an expression are dropped. -\end{explainbox} - -Function \Rfunction{paste()} was used above to insert values stored in a variable; functions \Rfunction{format()}, \Rfunction{sprintf()}, and \Rfunction{strftime()} allow the conversion into character strings of other values. These functions can be used when creating plots to generate suitable character strings for the \code{label} \emph{aesthetic} out of numeric, logical, date, time, and even character values. They can be, for example, used to create labels within a call to \code{aes()}. - -<>= -sprintf("log(%.3f) = %.3f", 5, log(5)) -sprintf("log(%.3g) = %.3g", 5, log(5)) -@ - -\begin{playground} -Study the chunk above. If you are familiar with \langname{C} or \langname{C++} function \Rfunction{sprintf()} will already be familiar to you, otherwise study its help page. - -Play with functions \Rfunction{format()}, \Rfunction{sprintf()}, and \Rfunction{strftime()}, formatting different types of data, into character strings of different widths, with different numbers of digits, etc. -\end{playground} - -It is also possible to substitute the value of variables or, in fact, the result of evaluation, into a new expression, allowing on the fly construction of expressions. Such expressions are frequently used as labels in plots. This is achieved through use of \emph{quoting} and \emph{substitution}. - -Function \Rfunction{bquote()} can be used to substitute variables or expressions enclosed in \code{.( )} by their value. Be aware that the argument to \Rfunction{bquote()} needs to be written as an expression; in this example a tilde, \code{\textasciitilde}, inserts a space between words. Furthermore, if the expressions include variables, these will be searched for in the environment rather than in \code{data}, except within a call to \code{aes()}. - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - labs(title = bquote(Time~zone: .(Sys.timezone())), - subtitle = bquote(Date: .(as.character(today()))) - ) -@ - -In the case of \Rfunction{substitute()} a named list can be passed as argument. - -<>= -ggplot(cars, aes(speed, dist)) + - geom_point() + - labs(title = substitute(Time~zone: tz, list(tz = Sys.timezone())), - subtitle = substitute(Date: date, list(date = as.character(today()))) - ) -@ - -For example, substitution can be used to assemble an expression within a function based on the arguments passed. One case of interest is to retrieve the name of the object passed as an argument, from within a function. - -<>= -deparse_test <- function(x) { - print(deparse(substitute(x))) -} - -a <- "saved in variable" - -deparse_test("constant") -deparse_test(1 + 2) -deparse_test(a) -@ - -\index{plots!math expressions|)} - -\section{Creating complex data displays}\label{sec:plot:composition} -\index{plots!modular construction|(} - -The grammar of graphics\index{grammar of graphics}\index{plots!layers} allows one to build and test plots incrementally. In daily use, when creating a completely new plot, it is best to start with a simple design for a plot, \code{print()} this plot, checking that the output is as expected and the code error-free. Afterwards, one can map additional \emph{aesthetics} and add \emph{geometries} and \emph{statistics} gradually. The final steps are then to add \emph{annotations} and the text or expressions used for titles, and axis and key labels. Another approach is to start with an existing plot and modify it, e.g., by using the same plotting code with different \code{data} or mapping different variables. When reusing code for a different data set, scale \code{limits} and \code{names} are likely to need to be edited. - -\begin{playground} - Build a graphically complex data plot of your interest, step by step. By step by step, I do not refer to using the grammar in the construction of the plot as earlier, but of taking advantage of this modularity to test intermediate versions in an iterative design process, first by building up the complex plot in stages as a tool in debugging, and later using iteration in the processes of improving the graphic design of the plot and improving its readability and effectiveness. -\end{playground} - -\section{Creating sets of plots}\label{sec:plot:sets:of} -\index{plots!consistent styling}\index{plots!programatic construction|(} -Plots to be presented at a given occasion or published as part of the same work need to be consistent in various respects: themes, scales and palettes, annotations, titles and captions. To guarantee this consistency we need to build plots modularly and avoid repetition by assigning names to the ``modules'' that need to be used multiple times. - -In a simple version of this approach was used in many examples above, where a base plot was modified by addition of different layers or scales. - -\subsection{Saving plot layers and scales in variables} - -When creating plots with \ggplot,\index{plots!reusing parts of} objects are composed using operator \code{+} to assemble together the individual components. The functions that create plot layers, scales, etc.\ are constructors of objects and the objects they return can be stored in variables, and once saved, added to multiple plots at a later time. - -A plot can saved to variable \code{p.base} and e.g., the value returned by a call to function \code{labs()} into a different variable, \code{p.labels}. - -<>= -p.base <- ggplot(data = mtcars, - aes(x = disp, y = mpg, - color = factor(cyl))) + - geom_point() - -p.labels <- labs(x = "Engine displacement)", - y = "Gross horsepower", - color = "Number of\ncylinders", - shape = "Number of\ncylinders") -@ - -\begin{warningbox} - When composing plots with the \code{+} operator, the left-hand-side operand must be a \code{"gg"} object. The left operand is added to the \code{"gg"} plot object and the result returned as a \code{"gg"} plot object. -\end{warningbox} - -The final plot can be assembled from the objects saved to variables. This is useful when creating several plots that should have consistent labels. The same approach can be used with other components. Below, the objects are combined with additional components to create different versions of the same plot. - -<>= -p.base -p.base + p.labels + theme_bw(16) -p.base + p.labels + theme_bw(16) + ylim(0, NA) -@ - -We can also save intermediate results. - -<>= -p.log <- p.base + scale_y_log10(limits=c(8,55)) -p.log + p.labels + theme_bw(16) -@ - -\subsection{Saving plot layers and scales in lists} - -If the pieces to be put together do not include a \code{"gg"} object, they can be collected into a list and saved. When the list is added to a \code{"gg"} plot object, the members of the list are added one by one to the plot respecting their order. - -<>= -p.parts <- list(p.labels, theme_bw(16)) -p1 + p.parts -@ - -\begin{playground} -Revise the code you wrote for the ``playground'' exercise in section \ref{sec:plot:composition}, but this time, pre-building and saving groups of elements that you expect to be useful unchanged when composing a different plot of the same type, or a plot of a different type from the same data. -\end{playground} - -\subsection{Using functions as building blocks} - -The ``packaged'' plots parts sometimes should adjust their behaviour at the time they are added to a plot. In this case a function that accepts the necessary arguments can be written, rather similarly as in the example for creating a new theme by wrapping function \ggtheme{theme\_grey()} (see section \ref{sec:plot:theme:new} on page \pageref{sec:plot:theme:new}). These functions can return a \code{"gg"} object, a list of plot components, or a single plot component. The simplest use is to alter some defaults in existing constructor functions returning \code{"gg"} objects or layers. The ellipsis (\code{...}) allows passing named arguments to a nested function. In this case, every single argument passed by name to \code{bw\_ggplot()} will be copied as argument to the nested call to \code{ggplot()}. Be aware, that supplying arguments by position, is possible only for parameters explicitly included in the definition of the wrapper function, - -<>= -bw_ggplot <- function(...) { - ggplot(...) + - theme_bw() -} -@ - -which could be used as follows. - -<>= -bw_ggplot(data = mtcars, - aes(x = disp, y = mpg, - color = factor(cyl))) + - geom_point() -@ - -\index{plots!programatic construction|)} -\index{plots!modular construction|)} - -\section{Generating output files}\label{sec:plot:render} -\index{devices!output|see{graphic output devices}} -\index{plots!saving to file|see{plots, rendering}} -\index{graphic output devices|(} -\index{plots!rendering|(} -It is possible, when using \RStudio, to directly export the displayed plot to a file using a menu. However, if the file will have to be generated again at a later time, or a series of plots need to be produced with consistent format, it is best to include the commands to export the plot in the script. - -In \Rlang,\index{plots!printing}\index{plots!saving}\index{plots!output to files} files are created by printing to different devices. Printing is directed to a currently open device such a window in \RStudio. Some devices produce screen output, others files. Devices depend on drivers. There are both devices that are part of \Rlang and additional ones defined in contributed packages. - -Creating a file involves opening a device, printing and closing the device in sequence. In most cases the file remains locked until the device is close. - -For example when rendering a plot to\index{plots!PDF output} PDF, Encapsulated Postcript, SVG or other vector graphics formats, arguments passed to \code{width} and \code{height} are expressed in inches. - -<>= -fig1 <- ggplot(data.frame(x = -3:3), aes(x = x)) + - stat_function(fun = dnorm) -pdf(file = "fig1.pdf", width = 8, height = 6) -print(fig1) -dev.off() -@ - -For Encapsulated Postscript\index{plots!Postscript output} and SVG\index{plots!SVG output} output, we only need to substitute \code{pdf()} with \code{postscript()} or \code{svg()}, respectively. - -<>= -postscript(file = "fig1.eps", width = 8, height = 6) -print(fig1) -dev.off() -@ - -In the case of graphics devices for\index{plots!bitmap output} file output in BMP, JPEG, PNG and TIFF bitmap formats, arguments passed to \code{width} and \code{height} are expressed in pixels. - -<>= -tiff(file = "fig1.tiff", width = 1000, height = 800) -print(fig1) -dev.off() -@ -\index{plots!rendering|)} -\index{graphic output devices|)} - -\begin{warningbox} -Some graphics devices are part of base-\Rlang, and others are implemented in contributed packages. In some cases, there are multiple graphic device available for rendering graphics in a given file format. These devices usually use different libraries, or have been designed with different aims. These alternative graphic devices can also differ in their function signature, i.e., have differences in the parameters and their names. In cases when rendering fails inexplicably, it can be worthwhile to switch to an alternative graphics device to find out if the problem is in the plot or in the rendering engine. -\end{warningbox} - -\section{Debugging ggplots} - -Package \pkgname{gginnards} provides an enhanced \code{str()} methods, \code{num\_layers()}, \code{top\_layer()}, \code{bottom\_layer()}, and \code{mapped\_vars()}. It also defines \code{geoms} and \code{stats} that instead of creating a layer print the data they receive through parameter \code{data}. These are simple functions that even if dependent on \ggplot internals are not prone to easily break with \ggplot updates. - -Package \pkgname{ggtrace} provides much more detailed and sophisticated approaches to explore the internals of \code{"gg"} plot objects. Package \ggplot itself gives access to some object components. - -Of these tools, \gggeom{geom\_debug()}, is probably the most intuitive to use, both on its own and as an argument to \code{stats}. - -<>= -ggplot(data = iris, mapping = aes(x = Petal.Length, y = Species)) + - stat_summary(geom = "debug") -@ - -<>= -ggplot(data = iris, mapping = aes(x = Petal.Length)) + -+ stat_bin(geom = "debug") -@ - -\section{Further reading} -An\index{further reading!grammar of graphics}\index{further reading!plotting} in-depth discussion of the many extensions to package \pkgname{ggplot2} is outside the scope of this book. Several books describe in detail the use of \pkgname{ggplot2}, being \citebooktitle{Wickham2016} \autocite{Wickham2016} the one written by the main author of the package. For inspiration or worked out examples, the book \citebooktitle{Chang2018} \autocite{Chang2018} is an excellent reference. In depth explanations of the technical aspects of \Rlang graphics are available in the book \citebooktitle{Murrell2019} \autocite{Murrell2019}. - -<>= -try(detach(package:tidyverse)) -try(detach(package:lubridate)) -try(detach(package:ggbeeswarm)) -try(detach(package:ggpmisc)) -try(detach(package:gginnards)) -try(detach(package:ggrepel)) -try(detach(package:ggplot2)) -try(detach(package:scales)) -try(detach(package:dplyr)) -try(detach(package:tibble)) -try(detach(package:learnrbook)) -@ - -<>= -knitter_diag() -R_diag() -other_diag() -@ - - -%See section \ref{sec:plot:composition} on page \pageref{sec:plot:composition} on plot composition for an explanation of the code below. ->>>>>>> Stashed changes diff --git a/R.scripts.Rnw b/R.scripts.Rnw index 7d2b2766..fc2659e7 100644 --- a/R.scripts.Rnw +++ b/R.scripts.Rnw @@ -1,4 +1,3 @@ -<<<<<<< Updated upstream % !Rnw root = appendix.main.Rnw <>= @@ -1729,1735 +1728,3 @@ detach(package:scales) \section{Further reading} For\index{further reading!the R language} further readings on the aspects of \Rlang discussed in the current chapter, I suggest the books \citetitle{Matloff2011} \autocite{Matloff2011} and \citetitle{Wickham2019} \autocite{Wickham2019}. -======= -% !Rnw root = appendix.main.Rnw - -<>= -set_parent('r4p.main.Rnw') -opts_knit$set(concordance=TRUE) -opts_knit$set(unnamed.chunk.label = 'scripts-chunk') -@ - -\chapter{Base R: ``Paragraphs'' and ``Essays''}\label{chap:R:scripts} -\index{scripts} - -\begin{VF} -An \Rlang script is simply a text file containing (almost) the same commands that you would enter on the command line of \Rlang. - -\VA{Jim Lemon}{\emph{Kickstarting R}}\nocite{LemonND} -\end{VF} - -%\dictum[\href{https://cran.r-project.org/doc/contrib/Lemon-kickstart/}{Kickstarting R}]{An R script is simply a text file containing (almost) the same commands that you would enter on the command line of R.}\vskip2ex - -\section{Aims of this chapter} - -For those who have mainly used graphical user interfaces, understanding why and when scripts can help in communicating a certain data analysis protocol can be revelatory. As soon as a data analysis stops being trivial, describing the steps followed through a system of menus and dialogue boxes becomes extremely tedious. - -Moreover, graphical user interfaces tend to be difficult to extend or improve in a way that keeps step-by-step instructions valid across program versions and operating systems. - -Many times, exactly the same sequence of commands needs to be applied to different data sets, and scripts make both implementation and validation of such a requirement easy. - -In this chapter, I will walk you through the use of \Rpgrm scripts, starting from an extremely simple script. - -\section{Writing scripts} - -In \Rlang language, the closest match to a natural language essay is a script. A script is built from multiple interconnected code statements needed to complete a given task. Simple statements, equivalent to sentences, can be combined into compound statements, equivalent to natural language paragraphs. Frequently, we combine simple sequences of statements into a sequence of actions necessary to complete a task. The sequence is not necessarily linear, as branching and repetition are also available. - -Scripts can vary from simple scripts containing only a few code statements, to complex scripts containing hundreds of code statements. In the rest of the present section I discuss how to write readable and reliable scripts and how to use them. - -\subsection{What is a script?}\label{sec:script:what:is} -\index{scripts!definition} -A \textit{script} is a text file that contains (almost) the same commands that you would type at the \Rlang console prompt. A true script is not, for example, an MS-Word file where you have pasted or typed some \Rlang commands. - -When typing commands/statements at the \Rlang console, we ``feed'' one line of text at a time. When we end the line by typing the enter key, the line of text is interpreted and evaluated. We then type the next line of text, which gets in turn interpreted and evaluated, and so on. In a script we write nearly the same text in an editor and save multiple lines containing commands into a text file. Interpretation takes place only later, when we \emph{source} the file as a whole into \Rlang. - -A script file has the following characteristics. -\begin{itemize} - \item The script is a plain text file, i.e., a file containing bytes that represent alphanumeric characters in a standardized character set like UTF8 or ASCII. - \item The text in the file contains valid \Rlang statements (including comments) and nothing else. - \item Comments start at a \code{\#} and end at the end of the line. - \item The \Rlang statements are in the file in the order that they must be executed, and respecting the line continuation rules of \Rlang. - \item \Rlang scripts customarily have file names ending in \texttt{.r} or \texttt{.R}. -\end{itemize} - -\begin{figure} -\centering -\begin{small} -\begin{tikzpicture}[node distance=1.5cm] -\node (start) [startstop, color = blue, fill = blue!15] {\textsl{Top = start}}; -\node (stat2) [process, color = blue, fill = blue!15, below of=start] {\code{}}; -\node (stat3) [process, color = blue, fill = blue!15, below of=stat2] {\code{}}; -\node (continue) [startstop, color = blue, fill = blue!15, below of=stat3] {$\cdots$}; -\node (stop) [startstop, color = blue, fill = blue!15, below of=continue] {\textsl{Bottom = end}}; -\draw [arrow, color = blue] (start) -- (stat2); -\draw [arrow, color = blue] (stat2) -- (stat3); -\draw [arrow, color = blue] (stat3) -- (continue); -\draw [arrow, color = blue] (continue) -- (stop); -\end{tikzpicture} -\end{small} - \caption[Code statements in a script.]{Diagram of script showing sequentially evaluated code statements; \textcolor{blue}{$\cdots$} represent additional statements in the script.}\label{fig:script} -\end{figure} - -The statements in the text file, are read, interpreted and evaluated sequentially, from the start to the end of the file, as represented in the diagram (Figure \ref{fig:script}). - -As we will see later in the chapter, code statements can be combined into larger statements and evaluated conditionally and/or repeatedly, which allows us to control the realised sequence of evaluated statements. - -In addition to being valid it is important that scripts are also understandable to humans, consequently a clear writing style and consistent adherence to it are important. - -It is good practice to write scripts so that they are self-contained. To make a script self-contained, one must include code to load the packages used, load or import data from files, perform the data analysis and display and/or save the results of the analysis. Such scripts can be used to apply the same analysis algorithm to other data by reading data from a different file and/or to reproduce the same analysis at a later time using the same data. Such scripts document all steps used for the analysis. - -<>= -show.results <- FALSE -@ - -\subsection{How do we use a script?}\label{sec:script:using} -\index{scripts!sourcing} - -A script can be ``sourced'' using function \Rfunction{source()}. If a text file called \texttt{my.first.script.r} contains the text -\begin{shaded} -\footnotesize -\begin{verbatim} -# this is my first R script -print(3 + 4) -\end{verbatim} -\end{shaded} - -it can be sourced by typing at \Rpgrm console - -<>= -source("my.first.script.r") -@ - -Execution of the statements in the file makes \Rlang display \code{[1] 7} at the console as, below the command we typed in. The commands themselves are not shown (by default the sourced file is not \emph{echoed} to the console) and the results of computations are not printed unless one includes explicit \Rfunction{print()} commands in the script. - -Scripts can be run both by sourcing them into an open \Rlang session, or at the operating system command prompt (see section \ref{sec:intro:using:R} on page \pageref{sec:intro:using:R}). In \RStudio, the script in the currently active editor tab can be sourced using the ``source'' button. The drop-down menu of this button has three entries: ``Source'' , quietly to the \Rlang console, ``Source with echo'' showing the code as it is run, and ``Source as local job'', using a new instance of \Rlang in the background. In the last case, the \Rlang console remains free for other uses while the script is running. - -When a script is \emph{sourced}, the output can be saved to a text file instead of being shown in the console. It is also easy to call \Rpgrm with the \Rlang script file as an argument directly at the operating system shell or command-interpreter prompt---and obviously also from shell scripts. The next two chunks show commands entered at the OS shell command prompt rather than at the \Rlang command prompt. - -\begin{shaded} -\footnotesize -\begin{verbatim} -RScript my.first.script.r -\end{verbatim} -\end{shaded} - -You can open an operating system's \emph{shell} from the Tools menu in \RStudio, to run this command. The output will be printed to the shell console. If you would like to save the output to a file, use output redirection using the operating system's syntax. - -\begin{shaded} -\footnotesize -\begin{verbatim} -RScript my.first.script.r > my.output.txt -\end{verbatim} -\end{shaded} - -While developing or debugging a script, one usually wants to run (or \emph{execute}) one or a few statements at a time. This can be done in \RStudio using the ``run'' button after either positioning the cursor in the line to be executed, or selecting the text to be run (the selected text can be part of a line, a whole line, or a group of lines, as long as it is syntactically valid). The key-shortcut Ctrl-Enter is equivalent to pressing the ``run'' button. - -\subsection{How to write a script}\label{sec:script:writing} -\index{scripts!writing} - -As with any type of writing, different approaches may be preferred by different \Rlang users. In general, the approach used, or mix of approaches, will also depend on how confident one is that the statements will work as expected---one already knows the best approach vs.\ one is exploring different alternatives. - -Three approaches are listed below. They all can result in equally good code, but as work in progress they differ. In the first approach, the script as whole is likely to contain bugs before fully tested. In the middle approach, only the most recently added statements are likely to contain bugs. In the last one the script contains at all times only valid \Rlang code, even if incomplete. The third approach also has the advantage that code remains in the \Rpgrm console \emph{History} and can be retrieved with a delay, e.g., after comparison against an alternative statement. -\begin{description} - \setlength{\itemsep}{1pt} - \setlength{\parskip}{0pt} - \setlength{\parsep}{0pt} -\item[If one is very familiar with similar problems] One would just create a new text file and write the whole thing in the editor, and then test it. This is rather unusual. -\item[If one is moderately familiar with the problem] One would write the script as above, but testing it, step by step, as one is writing it, i.e., running parts of the script as one adds them. This is the approach I use most frequently. -\item[If one is mostly playing around] Then if one is using \RStudio, one can type statements at the console prompt. As you should know by now, everything you run at the console is saved to the ``History.'' In \RStudio, the History is displayed in its own pane, and in this pane one can select any previous statement(s) and by clicking on a single icon, copy and paste them to either the \Rlang console prompt, or the cursor position in the editor pane. In this way one can build a script by copying and pasting from the history to the script file, the bits that have worked as you wanted. -\end{description} - -\begin{playground} -By now you should be familiar enough with \Rlang to be able to write your own script.% -\begin{enumerate} - \setlength{\itemsep}{1pt} - \setlength{\parskip}{0pt} - \setlength{\parsep}{0pt} - \item Create a new \Rpgrm script (in \RStudio, from the File menu, leftmost ``+'' icon, or by typing ``Ctrl + Shift + N''). - \item Save the file as \texttt{my.second.script.r}. - \item Use the editor pane in \RStudio to type some \Rpgrm commands and comments. - \item \emph{Run} individual commands. - \item \emph{Source} the whole file. -\end{enumerate} -\end{playground} - -\subsection{The need to be understandable to people}\label{sec:script:readability} -\index{scripts!readability} - -It is not enough for program code to be understood by a computer and that it returns the correct answer. Both large programs and small scripts have to be readable to humans, and the intention of the code understandable. In most cases \Rlang code will be maintained, reused and modified over time. In many cases it serves to document a given computation and to make it possible to reproduce it. - -When one writes a script, it is either because one wants to document what has been done or because one plans to use it again in the future. In the first case other persons will read it and in the second case one rarely remembers all the details. Thus, spending time and effort on the writing style paying special attention to following recommendations is important. -\begin{itemize} - \setlength{\itemsep}{1pt} - \setlength{\parskip}{0pt} - \setlength{\parsep}{0pt} - \item Avoid the unusual. People using a certain programming language tend to use some implicit or explicit rules of style---style includes \textit{indentation} of statements, \textit{capitalization} of variable and function names. As a minimum try to be consistent with yourself. - \item Use meaningful names for variables, and any other object. What is meaningful depends on the context. Depending on common use, a single letter may be more meaningful than a long word. However self-explanatory names are usually better: e.g., using \code{n.rows} and \code{n.cols} is much clearer than using \code{n1} and \code{n2} when dealing with a matrix of data. Probably \code{number.of.rows} and \code{number.of.columns} would make the script verbose, and take longer to type without gaining anything in return. Sometimes, short textual explanations in comments (ignored by \Rlang) are needed to achieve readability for humans. - \item How to make the words visible in names: traditionally in \Rlang one would use dots to separate the words and use only lower case. Some years ago, it became possible to use underscores. The use of underscores is quite common nowadays because it is ``safer'', as in some situations a dot may have a special meaning. Names like \code{NumCols}, using ``camel case'', are only infrequently used in \Rlang programming but is common in other languages like \pascallang. -\end{itemize} - -The \emph{Tidyverse style guide} for writing \Rlang code (\url{https://style.tidyverse.org/}) provides more detailed ``rules''. However, more important than strictly following a published guideline is to be consistent in the style one, a team of programmers or data analysts, or even members of an organization use. In the current book, I have not followed this guide in all respects, instead following in some cases the style used in \Rlang documentation. However, I have attempted to be consistent. - -\begin{playground} -Here is an example of bad style in a script. Edit the code in the chunk below so that it becomes easier to read. - -<>= -a <- 2 # height -b <- 4 # length -C <- - a * -b -C -> variable - print( -"area: ", variable -) -@ -\end{playground} - -The points discussed above already help a lot. However, one can go further in achieving the goal of human readability by interspersing explanations and code ``chunks'' and using all the facilities of typesetting, even of formatted maths formulas and equations, within the listing of the script. Furthermore, by including the results of the calculations and the code itself in a typeset report built automatically one ensures that they match each other. This greatly contributes to data analysis reproducibility, which is becoming a widespread requirement both in academia and in industry. - -This approach is called literate programming\index{literate programming} and was first proposed by \citeauthor{Knuth1984a} (\citeyear{Knuth1984a}) through his \pgrmname{WEB} system. In the case of \Rpgrm programming, the first support of literate programming was in \pkgname{Sweave}, which has been superseded by \pkgname{knitr} \autocite{Xie2013}. This package supports the use of \Markdown or \Latex\ \autocite{Lamport1994} as the markup language for the textual contents and also formats and applies syntax highlighting to code. \langname{Rmarkdown} is an extension to \Markdown that makes it easier to include \Rlang code in documents (see \url{http://rmarkdown.rstudio.com/}). It is the basis of \Rlang packages that support typesetting large and complex documents (\pkgname{bookdown}), web sites (\pkgname{blogdown}), package vignettes (\pkgname{pkgdown}) and slides for presentations \autocite{Xie2016,Xie2018}. \Quarto, which provides an enhanced version of \Rmarkdown, is implemented in \Rlang package \pkgname{quarto} together with the \pgrmname{quarto} program as a separate executable. The use of \pkgname{knitr} and \pkgname{quarto} is very well integrated into the \RStudio IDE. -The generation of typeset reports is outside the scope of the book, but it is an important skill to learn. It is well described in the books and web sites cited. - -\subsection{Debugging scripts}\label{sec:script:debug} -\index{scripts!debugging} - -The use of the word \emph{bug} to describe a problem in computer hardware and software started in 1946 when a real bug, more precisely a moth, got between the contacts of a relay in an electromechanical computer causing it to malfunction and Grace Hooper described the first computer \emph{bug}. The use of the term bug in engineering predates the use in computer science, and consequently, the first use of bug in computing caught on easily. - -A suitable quotation from a letter written by Thomas Alva Edison 1878 \autocite[as given by][]{Hughes2004}: -\begin{quotation} - It has been just so in all of my inventions. The first step is an intuition, and comes with a burst, then difficulties arise--this thing gives out and [it is] then that ``Bugs''--as such little faults and difficulties are called--show themselves and months of intense watching, study and labor are requisite before commercial success or failure is certainly reached. -\end{quotation} - -The quoted paragraph above makes clear that only very exceptionally does any new design fully succeed. The same applies to \Rlang scripts as well as any other non-trivial piece of computer code. From this it logically follows that testing and de-bugging are fundamental steps in the development of \Rlang scripts and packages. Debugging, as an activity, is outside the scope of this book. However, clear programming style and good documentation are indispensable for efficient testing and reuse. - -Even for scripts used for analyzing a single data set, we need to be confident that the algorithms and their implementation are valid, and able to return correct results. This is true both for scientific reports, expert reports and any data analysis related to assessment of compliance with legislation or regulations. Of course, even in cases when we are not required to demonstrate validity, say for decision making purely internal to a private organization, we will still want to avoid costly mistakes. - -The first step in producing reliable computer code is to accept that any code that we write needs to be tested and, if possible, validated. Another important step is to make sure that input is validated within the script and a suitable error produced for bad input (including valid input values falling outside the range that can be reliably handled by the script). - -If during testing, or during normal use, a wrong value is returned by a calculation, or no value (e.g., the script crashes or triggers a fatal error), debugging consists in finding the cause of the problem. The cause can be either a mistake in the implementation of an algorithm, as well as in the algorithm itself. However, many apparent \emph{bugs} are caused by bad or missing handling of special cases like invalid input values, rounding errors, division by zero, etc., in which a program crashes instead of elegantly issuing a helpful error message. - -Diagnosing the source of bugs is, in most cases, like detective work. One uses hunches based on common sense and experience to try to locate the lines of code causing the problem. One follows different \emph{leads} until the case is solved. In most cases, at the very bottom we rely on some sort of divide-and-conquer strategy. For example, we may check the value returned by intermediate calculations until we locate the earliest code statement producing a wrong value. Another common case is when some input values trigger a bug. In such cases it is frequently best to start by testing if different ``cases'' of input lead to errors/crashes or not. Boundary input values are usually the telltale ones: e.g., for numbers, zero, negative and positive values, very large values, very small values, missing values (\code{NA}), vectors of length zero (\code{numeric()}), etc. - -\begin{warningbox} - \textbf{Error messages} When debugging, keep in mind that in some cases a single bug can lead to a whole cascade of error messages. Do also keep in mind that typing mistakes, originating when code is entered through the keyboard, can wreak havock in a script: usually there is little correspondence between the number of error messages and the seriousness of the bug triggering them. When several errors are triggered, start by reading the error message printed first, as later errors can be an indirect consequence of earlier ones. -\end{warningbox} - -There are special tools, called debuggers, available, and they help enormously. Debuggers allow one to step through the code, executing one statement at a time, allowing inspection of the objects present in the \Rlang environment. It is even possible to execute additional statements at the \Rpgrm console, e.g., to modify the value of a variable, while execution is paused. An \Rlang debugger is available within \RStudio and also through the \Rlang console. - -When writing your first scripts, you will manage perfectly well, and learn more by running the script one line at a time, and when needed temporarily inserting \code{print()} statements to ``look'' at how the value of variables changes at each step. A debugger allows a lot more control, as one can ``step in'' and ``step out'' of function definitions, and set and unset break points where execution will stop. However, using a debugger is not as simple as using \code{print()}. - -If you get stuck trying to find the cause of a bug, do extend your search both to the most trivial of possible causes, and later on to the least likely ones (such as a bug in a package installed from \CRAN or \Rlang itself). Of course, when suspecting a bug in code you have not written, it is wise to very carefully read the documentation, as the ``bug'' may be just a misunderstanding of what a certain piece of code is expected to do. Also keep in mind that as discussed on page \pageref{sec:intro:net:help}, you will be able to find online already-answered questions to many of your likely problems and doubts. For example, searching with Google for the text of an error message is usually well rewarded. Most important to remember is that bugs do pop up frequently in newly written code, and occasionally in old code. Nobody is immune to them: not the code you write, packages you use or \Rlang itself. - -\section{Compound statements}\label{sec:script:compound:statement} -\index{compound code statements}\index{simple code statements} - -Individual statements can be grouped into \emph{compound statements} by enclosing them in curly braces (Figure \ref{fig:compound:statement}). Conceptually is like putting these statements into a box that allows us to operate with them as an anonymous whole. - -\begin{figure} -\centering -\begin{small} -\begin{tikzpicture}[node distance=1.7cm] -\node (start) [startstop] {\ldots}; -\node (enc) [enclosure, color = blue, fill = blue!5, below of=start, yshift=-0.75cm] {\ }; -\node (stat2) [process, color = blue, fill = blue!15, below of=start] {\code{}}; -\node (stat3) [process, color = blue, fill = blue!15, below of=stat2, yshift=+0.2cm] {\code{}}; -\node (stop) [startstop, below of=stat3] {\ldots}; -\draw [arrow, color = blue] (start) -- (stat2); -\draw [arrow, color = blue] (stat2) -- (stat3); -\draw [arrow, color = blue] (stat3) -- (stop); -\draw [arrow, color = black] (start) -- (enc); -\draw [arrow, color = black] (enc) -- (stop); -\end{tikzpicture} -\end{small} - \caption[Compound code statement]{Diagram of a compound code statement is a grouping of statements that in some contexts behaves as a single statement. In the diagram, statements A and B have been grouped into a compound statement.}\label{fig:compound:statement} -\end{figure} - - -<>= -print("...") -{ - print("A") - print("B") -} -print("...") -@ - -The grouping of the two middle statements above is of no consequence, as it does not alter sequential evaluation. In the example above only side effects are of interest. In the example below, the value returned by a compound statement is that returned by the last statement evaluated within it. Individual statements can be separated by an end-of-line as above, or by a semicolon (;) as shown below, for two statements, each of them implementing an arithmetic operation. - -<>= -{1 + 2; 3 + 4} -@ - -The example above demonstrates that only the value returned by the compound statement as a whole is displayed automatically at the \Rlang console, i.e., the implicit call to \code{print()} is applied to the compound statement. Thus, even though both statements were evaluated, we only see the result returned by the second one. - -\begin{playground} -Nesting is also possible. Before running the compound statement below try to predict the value it will return, and then run the code and compare your prediction to the value returned. - -<>= -{1 + 2; {a <- 3 + 4; a + 1}} -@ -\end{playground} - -Grouping is of little use by itself. It becomes useful together with control-of-execution constructs, when defining functions, and similar cases where we need to treat a group of code statements as if they were a single statement. We will see several examples of the use of compound statements in the current chapter and in chapter \ref{chap:R:functions} on page \pageref{chap:R:functions}. - -\section{Function calls} -\index{functions!call} -We will describe functions in detail and how to create new ones in chapter \ref{chap:R:functions}. We have already been using functions since chapter \ref{chap:R:as:calc}. Functions are structurally \Rlang statements, in most cases, compound statements, using formal parameters as placeholders. When one calls a function one passes arguments for the different parameters (or placeholder names) and the (compound) statement conforming the \emph{body} of the function is evaluated after ``replacing'' the placeholders by the values passed as arguments. - -In the first example we have two statements. The first one computes $log(100)$ by calling function \code{log10()} with \code{100} as argument and stores the returned value in variable \code{a}. In the second statement we pass variable \code{a} as argument to \code{print()} and as a side effect the value 2 is displayed. - -<>= -a <- log10(100) -print(a) -@ - -Function calls can be nested. The example above can be rewritten as. - -<>= -print(log10(100)) -@ - -The difference is that we avoid the explicit creation of a variable. Whether this is an advantage or not depends on whether we use variable \code{a} in later statements or not. - -Statements with more levels of nesting than shown above become very difficult to read, so alternative notations can help. - -\section{Data pipes}\label{sec:script:pipes} -\index{pipes!base R|(} -\index{pipe operator} -\index{chaining statements with \emph{pipes}} -Pipes have been at the core of shell scripting in \osname{Unix} since early stages of its design \autocite{Kernigham1981} as well as in \osname{Linux} distributions. Within an OS, pipes are chains of small programs or ``tools'' that carry out a single well-defined task (e.g., \code{ed}, \code{gsub}, \code{grep}, \code{more}, etc.). Data such as text is described as flowing from a source into a sink through a series of steps at which a specific transformations take place. In \osname{Unix} and \osname{Linux} shells like \pgrmname{sh} or \pgrmname{bash}, sinks and sources are files, but in \osname{Unix} and \osname{Linux} files are an abstraction that includes all devices and connections for input or output, including physical ones such as terminals and printers. - -<>= -stdin | grep("abc") | more -@ - -How can \emph{pipes} exist within a single \Rlang script? When chaining functions into a pipe, data is passed between them through temporary \Rlang objects stored in memory, which are created and destroyed automatically. Conceptually there is little difference between \osname{Unix} shell pipes and pipes in \Rlang scripts, but the implementations are different. - -What do pipes achieve in \Rlang scripts? They relieve us from the responsibility of creating and deleting the temporary objects and of enforcing the sequential execution of the different steps. Pipes usually improve readability of scripts by allowing more concise code. - -Since year 2021, starting from version 4.1.0, \Rlang includes a native pipe operator (\Roperator{\textbar >}) as part of the language. Subsequently, the placeholder (\code{\_}) was implemented in version 4.2.0 and its functionality expanded in version 4.3.0. Another two implementations of pipes, that have been available as \Rlang extensions for some years in packages \pkgnameNI{magrittr} and \pkgnameNI{wrapr}, are described in chapter \ref{chap:R:data} on page \pageref{chap:R:data}. - -I describe R's pipe syntax based on \Rpgrm 4.3.0. I start by showing the same operations coded using nested function calls, using explicit saving of intermediate values in temporary objects, and using the pipe operator. - -Nested function calls are concise, but difficult to read when the depth of nesting increases. - -<>= -sum(sqrt(1:10)) -@ - -Saving intermediate results explicitly results in clear but verbose code. - -<>= -data.in <- 1:10 -data.tmp <- sqrt(data.in) -sum(data.tmp) -rm(data.tmp) # clean up! -@ - -A pipe using operator \Roperator{\textbar >} makes the data flow clear and keeps the code concise. - -<>= -1:10 |> sqrt() |> sum() -@ - -We can assign the result of the computation to a variable, most elegantly using the \Roperator{->} operator on the \emph{rhs} of the pipe. - -<>= -1:10 |> sqrt() |> sum() -> my_rhs.var -my_rhs.var -@ - -We can also use the \Roperator{<-} operator on the \emph{lhs} of the pipe, i.e., for assignments a pipe behaves as a compound statement. - -<>= -my_lhs.var <- 1:10 |> sqrt() |> sum() -my_lhs.var -@ - -Formally, the \Roperator{\textbar >} operator from base \Rlang takes two operands, just like operator \code{+} does. The value returned by the \emph{lhs} (left-hand side) operand, which can be any \Rlang expression, is passed as argument to the function-call operand on \emph{rhs} (right-hand side). The called function must accept at least one argument. This default syntax that implicitly passes the argument by position to the first parameter of the function would limit which functions could be used in a pipe construct. However, it is also possible to pass the piped argument explicitly by name to any parameter of the function on the \emph{rhs} using an underscore (\code{\_}) as placeholder. - -<>= -1:10 |> sqrt(x = _) |> sum(x = _) -@ - -The placeholder can be also used with extraction operators. - -<>= -1:10 |> sqrt(x = _) |> _[2:8] |> sum(x = _) -@ - -\begin{explainbox} -Base \Rlang functions like \Rfunction{subset()} have a signature that is natural for use in pipes by implicitly passing the piped value as argument to its first formal parameter, while others like \Rfunction{assign()} do not. For example, when calling function \code{assign()} to save a value using a name available as a character string we would like to pass the piped value as argument to parameter \code{value} which is not the first. In such cases we can use \code{\_} as a placeholder and pass it by name. - -<>= -obj.name <- "data.out" -1:10 |> sqrt() |> sum() |> assign(x = obj.name, value = _) -@ - -Alternatively, we can define a wrapper function, with the desired order for the formal parameters. This approach can be worthwhile when the same function is called repeatedly within a script. - -<>= -value_assign <- function(value, x, ...) { - assign(x = x, value = value, ...) -} -obj.name <- "data.out" -1:10 |> sqrt() |> sum() |> value_assign(obj.name) -@ - -\end{explainbox} - -In general whenever we use temporary variables to store values that are passed as arguments only once, we can nest or chain the statements making the saving of intermediate results into a temporary variable implicit instead of explicit. Examples of some useful idioms follow. - -Addition of computed variables to a data frame using \Rfunction{within()} (see section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}) and selecting rows with \Rfunction{subset()} (see section \ref{sec:calc:df:subset} on page \pageref{sec:calc:df:subset -}) are combined in our first simple example. For clarity, we use the \code{\_} placeholder to indicate the value returned by the preceding function in the pipe. - -<>= -data.frame(x = 1:10, y = rnorm(10)) |> - within(data = _, - { - x4 <- x^4 - is.large <- x^4 > 1000 - }) |> - subset(x = _, is.large) -@ - -\begin{playground} -Without using the \code{\_} placeholder and a more compact layout, the code above becomes. Compare the code below to that above to work out how I simplified the code. - -<>= -data.frame(x = 1:10, y = rnorm(10)) |> - within({x4 <- x^4; is.large <- x^4 > 1000}) |> - subset(is.large) -@ -\end{playground} - -Subset can be also used to select variables or columns from data frames and matrices. - -<>= -data.frame(x = 1:10, y = rnorm(10)) |> - within(data = _, - { - x4 <- x^4 - is.large <- x^4 > 1000 - }) |> - subset(x = _, is.large, select = -x) -@ - -<>= -data.frame(x = 1:10, y = rnorm(10)) |> - within(data = _, - { - x4 <- x^4 - is.large <- x^4 > 1000 - }) |> - subset(x = _, select = c(y, x4)) -@ - -<>= -data.frame(group = factor(rep(c("T1", "T2", "Ctl"), each = 4)), - y = rnorm(12)) |> - subset(x = _, group %in% c("T1", "T2")) |> - aggregate(data = _, y ~ group, mean) -@ - -The extraction operators are accepted on the \emph{rhs} of a pipe only starting from \Rpgrm 4.3.0. With these versions \code{\_[["y"]]}, as shown below, as well as its equivalent \code{\_\$y} can be used. Function \Rfunction{getElement()} used as \code{getElement("y")}, being a normal function, can be used in situations where operators are not accepted, like on the \emph{rhs} of \Roperator{|>} in older versions of \Rlang. - -<>= -data.frame(group = factor(rep(c("T1", "T2", "Ctl"), each = 4)), - y = rnorm(12)) |> - subset(x = _, group %in% c("T1", "T2")) |> - aggregate(data = _, y ~ group, mean) |> - _[["y"]] -@ - -Additional functions designed to be used in pipes are available through packages as described in chapter \ref{chap:R:data}. - -\begin{playground} - In the last three examples, in which function calls is the explicit use of the placeholder needed, and in which ones is it optional? Hint: edit the code, removing the parameter name, \code{=}, and \code{\_}, and test whether the edited code works and returns the same value as before. -\end{playground} -\index{pipes!base R|)} - -\section{Conditional evaluation}\label{sec:script:flow:control} -\index{control of execution flow} -By default \Rlang statements in a script are evaluated (or executed) in the sequence they appear in the script \textit{listing} or text. We give the name \emph{control of execution constructs} to those special statements that allow us to alter this default sequence, by either skipping or repeatedly evaluating individual statements. The statements whose evaluation is controlled can be either simple or compound. Some of the control of execution flow statements, function like \emph{ON-OFF switches} for program statements. Others allow statements to be executed repeatedly while or until a condition is met, or until all members of a list or a vector are processed. - -These \emph{control of execution constructs} can be also used at the \Rlang console, but it is usually awkward to do so as they can extend over several lines of text. In simple scripts, the \emph{flow of execution} can be fixed and linear from the first to the last statement in the script. However, \emph{control of execution constructs} are a crucial part of most useful scripts. As we will see next, a compound statement can include multiple simple or nested compound statements. \Rpgrm has two types of \emph{if}\index{conditional statements} statements, non-vectorized and vectorized. - -\subsection[Non-vectorized \texttt{if}, \texttt{else} and \texttt{switch}]{Non-vectorized \code{if}, \code{else} and \code{switch}}\label{sec:script:if} -\qRcontrol{if ()}\qRcontrol{if () \ldots\ else}% - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.5cm] -\node (start) [startstop] {\ldots}; -\node (dec1) [decision, color = blue, fill = blue!15, below of=start, yshift=-0.3cm] {\code{if ()}}; -\node (stat2) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.2cm] {\code{}}; -\node (stat3) [process, below of=dec1, yshift=-0.5cm] {\code{}}; -\node (stop) [startstop, below of=stat3] {\ldots}; -\draw [arrow] (start) -- (dec1); -\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\code{TRUE}} (stat2); -\draw [arrow, color=blue] (dec1) -- node[anchor=west] {\code{FALSE}} (stat3); -\draw [arrow] (stat2) |- (stat3); -\draw [arrow] (stat3) -- (stop); -\end{tikzpicture} -\end{small} - \caption[Flowchart for \code{if} construct.]{Flowchart for \code{if} construct.}\label{fig:if:diagram} -\end{figure} - -The \code{if} construct ``decides,'' depending on a \code{logical} value, whether the next code statement is executed (if \code{TRUE}) or skipped (if \code{FALSE}) (Figure \ref{fig:if:diagram}). The flow chart shows how \code{if} works: \code{} is either evaluated or skipped depending on the value of \code{}, while \code{} is always evaluated.\label{flowchart:if} - -The usefulness of \emph{if} statements stems from the possibility of computing the \code{logical} value used as \code{} with comparison operators (see section \ref{sec:calc:comparison} on page \pageref{sec:calc:comparison}) and logical operators (see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}). - -We start with toy examples demonstrating how \emph{if} statements work. Later we will see examples closer to real use cases. Here \Rcontrol{if} controls the evaluation or not of the simple statement \code{print("Hello!")}. - -\begin{explainbox} -We use the name \emph{flag} for a \code{logical} variable set manually, preferably near the top of the script. Real flags were used in railways to indicate to trains whether to stop or continue at stations and which route to follow at junctions. Use of \code{logical} flags in scripts is most useful when switching between two behaviors that depend on multiple separate statements. -\end{explainbox} - -<>= -flag <- TRUE -if (flag) print("Hello!") -@ - -\begin{playground} -Play with the code above by changing the value assigned to variable \code{flag}, \code{FALSE}, \code{NA}, and \code{logical(0)}. - -In the example above we use variable \code{flag} as the \emph{condition}. - -Nothing in the \Rlang language prevents this condition from being a \code{logical} constant. Explain why \code{if (FALSE)} in the syntactically-correct statement below is of no practical use. - -<>= -if (FALSE) print("Hello!") -@ -\end{playground} - -Conditional execution is much more useful than what could be expected from the previous examples, because the statement whose execution is being controlled can be a compound statement of almost any length or complexity. A very simple example follows, with a compound statement containing two statements, each one, a call to function \code{print()} with a different argument. - -<>= -printing <- TRUE -if (printing) { - print("A") - print("B") -} -@ - -\begin{warningbox} -The condition passed as an argument to \code{if}, enclosed in parentheses, can be anything yielding a \Rclass{logical} vector of length one. As this condition is \emph{not} vectorized, a longer vector will trigger an \Rlang warning or error depending on \Rlang's version. -\end{warningbox} - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.5cm] -\node (start) [startstop] {\ldots}; -\node (dec1) [decision, color = blue, fill = blue!15, below of=start, yshift=-0.5cm] {\code{if () else}}; -\node (stat2) [process, color = blue, fill = blue!15, left of=dec1, xshift=-3.2cm] {\code{}}; -\node (stat3) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.2cm] {\code{}}; -\node (stat4) [process, below of=dec1, yshift=-0.5cm] {\code{}}; -\node (stop) [startstop, below of=stat4] {\ldots}; -\draw [arrow] (start) -- (dec1); -\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\code{TRUE}} (stat2); -\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\code{FALSE}} (stat3); -\draw [arrow] (stat2) |- (stat4); -\draw [arrow] (stat3) |- (stat4); -\draw [arrow] (stat4) -- (stop); -\end{tikzpicture} -\end{small} - \caption[Flowchart for \code{if \ldots\ else} construct.]{Flowchart for \code{if \ldots else} construct.}\label{fig:if:else:diagram} -\end{figure} - -The \code{if \ldots\ else \ldots} construct ``decides,'' depending on a \code{logical} value, which of two code statements is executed (Figure \ref{fig:if:else:diagram}). The flow chart shows how it works: either \code{} or \code{} is evaluated and the other skipped depending on the value of \code{}, while \code{} is always evaluated.\label{flowchart:if:else} - -<>= -a <- 10 -if (a < 0) print("'a' is negative") else print("'a' is not negative") -print("This is always printed") -@ - -As can be seen above, the statement immediately following \code{if} is executed if the condition returns \code{TRUE} and that following \code{else} is executed if the condition returns \code{FALSE}. Statements after the conditionally executed \code{if} and \code{else} statements are always executed, independently of the value returned by the condition. - -\begin{playground} -Play with the code in the chunk above by assigning different numeric vectors to \code{a}. -\end{playground} - -<>= -show.results <- TRUE -if (show.results) eval.if.4 <- c(1:4) else eval.if.4 <- FALSE -# eval.if.4 -show.results <- FALSE -if (show.results) eval.if.4 <- c(1:4) else eval.if.4 <- FALSE -#eval.if.4 -@ - -\begin{explainbox} -Do you still remember the rules about continuation lines? - -<>= -# 1 -a <- 1 -if (a < 0) print("'a' is negative") else print("'a' is not negative") -@ - -Why does the statement below (not evaluated here) trigger an error while the one above does not? - -<>= -# 2 (not evaluated here) -if (a < 0) print("'a' is negative") -else print("'a' is not negative") -@ - -How do the continuation line rules apply when we add curly braces as shown below. - -<>= -# 1 -a <- 1 -if (a < 0) { - print("'a' is negative") - } else { - print("'a' is not negative") - } -@ - -In the example above, we enclosed a single statement between each pair of curly braces, but as these braces create compound statements, multiple statements could have been enclosed between each pair. -\end{explainbox} - -\begin{playground} -Play with the use of conditional execution, with both simple and compound statements, and also think how to combine \code{if} and \code{else} to select among more than two options. -\end{playground} - -In \Rlang, the value returned by any compound statement is the value returned by the last simple statement executed within the compound one. This means that we can assign the value returned by an \code{if} and \code{else} statement to a variable. This style is less frequently used, but occasionally can result in easier-to-understand scripts.\label{chunk:if:assignment} - -<>= -a <- 1 -my.message <- - if (a < 0) "'a' is negative" else "'a' is not negative" -print(my.message) -@ - -\begin{explainbox} -If the condition statement returns a value of a class other than \code{logical}, \Rlang will attempt to convert it into a logical. This is sometimes used instead of a comparison to zero, as the conversion from \code{integer} yields \code{TRUE} for all integers except zero. The code below illustrates a rather frequently used idiom for checking if there is something available to display. -<>= -message <- "abc" -if (length(message)) print(message) -@ -\end{explainbox} - -\begin{advplayground} -Study the conversion rules between \Rclass{numeric} and \Rclass{logical} values, run each of the statements below, and explain the output based on how type conversions are interpreted, remembering the difference between \emph{floating-point numbers} as implemented in computers and \emph{real numbers} ($\mathbb{R}$) as defined in mathematics (see page \pageref{box:integer:float}). - -% chunk contains intentional error-triggering examples -<>= -if (0) print("hello") -if (-1) print("hello") -if (0.01) print("hello") -if (1e-300) print("hello") -if (1e-323) print("hello") -if (1e-324) print("hello") -if (1e-500) print("hello") -if (as.logical("true")) print("hello") -if (as.logical(as.numeric("1"))) print("hello") -if (as.logical("1")) print("hello") -if ("1") print("hello") -@ - -Hint: if you need to refresh your understanding of the type conversion rules, see section \ref{sec:calc:type:conversion} on page \pageref{sec:calc:type:conversion}. -\end{advplayground} - -\begin{figure} - \centering -\begin{small}\label{flowchart:switch} -\begin{tikzpicture}[node distance=1.5cm] -\node (start) [startstop] {\ldots}; -\node (dec1) [decision, color = blue, fill = blue!15, below of=start, yshift=-0.4cm] {\code{switch()}}; -\node (stat2) [process, color = blue, fill = blue!15, below of=dec1, xshift=3.4cm] {\code{}}; -\node (stat3) [process, color = blue, fill = blue!15, below of=stat2] {\code{}}; -\node (stat4) [process, color = blue, fill = blue!15, below of=stat3] {\code{}}; -\node (stat5) [process, color = blue, fill = blue!15, below of=stat4] {\code{}}; -\node (stat6) [process, below of=stat5, xshift=3.3cm] {\code{}}; -\node (stop) [startstop, below of=stat6] {\ldots}; -\draw [arrow] (start) -- (dec1); -\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{}} (stat2); -\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{}} (stat3); -\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{}} (stat4); -\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{}} (stat5); -\draw [arrow] (stat2) -| (stat6); -\draw [arrow] (stat3) -| (stat6); -\draw [arrow] (stat4) -| (stat6); -\draw [arrow] (stat5) -| (stat6); -\draw [arrow] (stat6) -- (stop); -\end{tikzpicture} -\end{small} - \caption{Flowchart for a \code{switch} construct.}\label{fig:switch:diagram} -\end{figure} - -In addition to \Rcontrol{if} and \Rcontrol{if \ldots\ else}, there is in \Rlang a \Rcontrol{switch} statement (Figure \ref{fig:switch:diagram}). It can be used to select among \emph{cases}, or several alternative statements, based on an expression evaluating to a \code{numeric} or a \code{character} value of length equal to one. - -A \Rcontrol{switch} statement returns a value, just like \code{if}, the value returned by the \Rcontrol{switch} statement is the value returned by one of the statements. Each of the alternative statements is called a \textit{case}. The value passed as argument to \Rcontrol{switch} functions as an index selecting one of them. - -In the first example we use a \code{character} variable as the condition, named cases, and a final unlabelled case as default in case of no match. In real use, a computed value or user input would be used in place of \code{my.object}. As with the \code{logical} argument to \code{if}, the \code{character} string value passed as argument must be a vector of length one. - -<<>>= -my.object <- "two" -b <- switch(my.object, - one = 1, - two = 1 / 2, - four = 1 / 4, - 0 -) -b -@ - -Multiple condition values can share the same statement. - -<<>>= -my.object <- "two" -b <- switch(my.object, - one =, uno = 1, - two =, dos = 1 / 2, - four =, cuatro = 1 / 4, - 0 -) -b -@ - -\begin{playground} - Do play with the use of the switch statement. Look at the documentation for \code{switch()} using \code{help(switch)} and study the examples at the end of the help page. Explore what happens if you set \code{my.object <- "ten"}, \code{my.object <- "three"}, \code{my.object <- NA\_character\_} or \code{my.object <- character()}. Then remove the \code{, 0} as default value, and repeat. -\end{playground} - -When the expression used as a condition returns a value that is not a \code{character}, it will be interpreted as an \code{integer} index. In this case no names are used for the cases, and the last one is always interpreted as the default. - -<<>>= -my.number <- 2 -b <- switch(my.number, - 1, - 1 / 2, - 1 / 4, - 0 -) -b -@ - -\begin{playground} - Continue playing with the use of the switch statement. Explore what happens if you set \code{my.number <- 10}, \code{my.number <- 3}, \code{my.number <- NA} or \code{my.object <- numeric()}. Then remove the \code{, 0} as default value, and repeat. -\end{playground} - -\begin{explainbox} -The statements for the cases in a \Rcontrol{switch()} statement can be compound statements as in the case of \code{if}, and they can even be used for a side effect. For example the example above can edited to print a message when the default value is returned. - -<>= -my.object <- "ten" -b <- switch(my.object, - one = 1, - two = 1 / 2, - three = 1 / 4, - {print("No match! Using default"); 0} -) -b -@ -\end{explainbox} - -\begin{explainbox} - The \Rcontrol{switch} statement can substitute for chained \code{if \ldots\ else} statements when all the conditions can be described by constant values or distinct values returned by the same test. The advantage is more concise and readable code. The equivalent of the first \Rcontrol{switch} example above when written using \code{if \ldots\ else} becomes longer. Given how terse code using \Rcontrol{switch} is, those not yet familiar with its use may find the more verbose style used below easier to understand. On the other hand, with numerous cases a \Rcontrol{switch} statement is easier to read and understand. - -<>= -my.object <- "two" -if (my.object == "one") { - b <- 1 -} else if (my.object == "two") { - b <- 1 / 2 -} else if (my.object == "four") { - b <- 1 / 4 -} else { - b <- 0 -} -b -@ - -\end{explainbox} - -\begin{advplayground} - Consider another alternative approach, the use of a named vector to map values. In most of the examples above the code for the cases is a constant value or an operation among constant values. Implement one of this examples using a named vector instead of a \Rcontrol{switch} statement. -\end{advplayground} - -\subsection[Vectorized \texttt{ifelse()}]{Vectorized \code{ifelse()}} -\index{vectorized ifelse} -Vectorized \emph{ifelse} is a peculiarity of the \Rlang language, but very useful for writing concise code that may execute faster than logically equivalent but not vectorized code. -Vectorized conditional execution is coded by means of \emph{function} \Rcontrol{ifelse()} (written as a single word). This function takes three arguments: a \code{logical} vector usually the result of a test (parameter \code{test}), an expression to use for \code{TRUE} cases (parameter \code{yes}), and an expression to use for \code{FALSE} cases (parameter \code{no}). At each index position along the vectors, the value included in the returned vector is taken from \code{yes} if the corresponding member of the \code{test} logical vector is \code{TRUE} and from \code{no} if the corresponding member of \code{test} is \code{FALSE}. All three arguments can be any \Rlang statement returning the required vectors. - -The flow chart for \Rcontrol{ifelse()} is similar to that for \code{if \ldots\ else} shown on page \pageref{flowchart:if} but applied in parallel to the individual members of vectors; e.g.\ the condition expression is evaluated at index position \code{1} controls which value will be present in the returned vector at index position \code{1}, and so on. - -It is customary to pass arguments to \code{ifelse} by position. We give a first example with named arguments to clarify the use of the function. - -<>= -my.test <- c(TRUE, FALSE, TRUE, TRUE) -ifelse(test = my.test, yes = 1, no = -1) -@ - -In practice, the most common idiom is to have as an argument passed to \code{test}, the result of a comparison calculated on the fly. In the first example we compute the absolute values for a vector, equivalent to that returned by \Rlang function \code{abs()}. - -<>= -nums <- -3:+3 -ifelse(nums < 0, -nums, nums) -@ - -\begin{warningbox} -In the case of \Rcontrol{ifelse()}, the length of the returned value is determined by the length of the logical vector passed as an argument to its first formal parameter (named \code{test})! A frequent mistake is to use a condition that returns a \code{logical} vector of length one, expecting that it will be recycled because arguments passed to the other formal parameters (named \code{yes} and \code{no}) are longer. However, no recycling will take place, resulting in a returned value of length one, with the remaining elements of the vectors passed to \code{yes} and \code{no} being discarded. Do try this by yourself, using logical vectors of different lengths. You can start with the examples below, making sure you understand why the returned values are what they are. - -<<>>= -ifelse(TRUE, 1:5, -5:-1) -ifelse(FALSE, 1:5, -5:-1) -ifelse(c(TRUE, FALSE), 1:5, -5:-1) -ifelse(c(FALSE, TRUE), 1:5, -5:-1) -ifelse(c(FALSE, TRUE), 1:5, 0) -@ -\end{warningbox} - -\begin{playground} -Some additional examples to play with, with a few surprises. Study the examples below until you understand why returned values are what they are. In addition, create your own examples to test other possible cases. In other words, play with the code until you fully understand how \code{ifelse()} statements work. - -<>= -a <- 1:10 -ifelse(a > 5, 1, -1) -ifelse(a > 5, a + 1, a - 1) -ifelse(any(a > 5), a + 1, a - 1) # tricky -ifelse(logical(0), a + 1, a - 1) # even more tricky -ifelse(NA, a + 1, a - 1) # as expected -@ -Hint: if you need to refresh your understanding of \code{logical} values and Boolean algebra see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}. -\end{playground} - -\begin{advplayground} -Write, using \Rcontrol{ifelse()}, a single statement to combine numbers from the two vectors \code{a} and \code{b} into a result vector \code{d}, based on whether the corresponding value in vector \code{c} is the character \code{"a"} or \code{"b"}. Then print vector \code{d} to make the result visible. - -<>= -a <- -10:-1 -b <- +1:10 -c <- c(rep("a", 5), rep("b", 5)) -# your code -@ - -If you do not understand how the three vectors are built, or you cannot guess the values they contain by reading the code, print them, and play with the arguments, until you understand what each parameter does. Also use \code{help(rep)} and/or \code{help(ifelse)} to access the documentation. -\end{advplayground} - -\begin{advplayground} -Continuing from the playground above, test the behaviour of \Rcontrol{ifelse()} with \code{NA}, \code{NULL} and \code{logical()} passed as arguments to \code{test}. Also test the behaviour when only some members of a logical vector are not available (\code{NA}). -\end{advplayground} - -\section{Iteration} -\index{loops|seealso{iteration}} -We give the name \emph{iteration} to the process of repetitive execution of a program statement---e.g., \emph{computed by iteration}. We use the same word, \emph{iteration}, to name each one of these repetitions of the execution of a statement---e.g., \emph{the second iteration}. - -Iteration constructs make it possible to ``decide'' at run time the number of iterations, i.e., when execution breaks out of the loop and continues at the next statement in the script. Iteration can be used to apply the same computations to the different members of a vector or list (this section), but also to apply different functions to members of a vector, matrix, list or data frame (section \ref{sec:R:faces:of:loops} on page \pageref{sec:R:faces:of:loops}). - -In \Rlang three types of iteration loops are available: \Rloop{for}, \Rloop{while} and \Rloop{repeat} constructs. They differ in the origin of the values they iterate over, and in the type of test used to terminate iteration. When the same algorithm can be implemented with more than one of these constructs, using the least flexible of them usually results in easier to understand code. - -In \Rlang, explicit loops as described in this section can in some cases be replaced by calls to \emph{apply} functions (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}) or with vectorized functions and operators (see page \pageref{par:calc:vectorized:opers}). The choice among these approaches affects readability and performance (see section \ref{sec:loops:slow} on page \pageref{sec:loops:slow}). - -\subsection[\texttt{for} loops]{\code{for} loops} - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.5cm] -\node (start) [startstop] {\ldots}; -\node (entry) [below of=start, color = blue, yshift=0.5cm]{$\bullet$}; -\node (dec1) [decision, color = blue, fill = blue!15, below of=entry, yshift=0.3cm] {\code{for ()}}; -\node (stat2) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.55cm] {\code{}}; -\node (stat3) [process, below of=dec1, yshift=-0.5cm] {\code{}}; -\node (stop) [startstop, below of=stat3] {\ldots}; -\draw [arrow] (start) -- (dec1); -\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\textsl{continue}} (stat2); -\draw [arrow, color=blue] (dec1) -- node[anchor=west] {\textsl{break}} (stat3); -\draw [arrow, color = blue] (stat2) |- (entry); -\draw [arrow, color = blue] (entry) -- (dec1); -\draw [arrow] (stat3) -- (stop); -\end{tikzpicture} -\end{small} - \caption{Flowchart for a \code{for} iteration loop.}\label{fig:for:loop:diagram} -\end{figure} - -The\index{for loop}\index{iteration!for loop}\qRloop{for} most frequently used type of loop is a \code{for} loop. These loops work in \Rlang by ``walking through'' a list or vector of values to act upon (Figure \ref{fig:for:loop:diagram}). Within a \qRloop{for} loop these values are available, sequentially, one at a time through a variable that functions as a placeholder. The implicit test for the end of the vector or list takes place at the top of the construct before the loop statement is evaluated. The flow chart has the shape of a \emph{loop} as the execution can be directed to an earlier position in the sequence of statements, allowing the same section of code to be evaluated multiple times, each time with a new value assigned to the placeholder variable. - -In the diagram above the argument to \code{for()} is shown as \code{} but it can also be a \code{vector} of any mode. Objects of most classes derived from \code{list} or from an atomic vector can also fulfil the same role. The extraction operation with a numeric index must be supported by objects of the class passed as argument. - -Similarly to \code{if} constructs, only one statement is controlled by \Rloop{for}, however this statement can be a compound statement enclosed in braces \verb|{ }| (see pages \pageref{sec:script:compound:statement} and \pageref{sec:script:if}). - -<>= -b <- 0 # variable needs to set to a valid numeric value! -for (a in 1:5) b <- b + a -b -@ - -Here the statement \code{b <- b + a} is executed five times, with placeholder variable \code{a} sequentially taking each of the values, 1, 2, 3, 4, and 5, in anonymous vector \code{1:5}. The name used as placeholder has to fulfil the same requirements as an ordinary \Rlang variable name. The list or vector following \code{in} can contain any valid \Rlang objects, as long as the code statements in the loop body can handle them. - -\begin{warningbox} -In a \code{for()} loop construct, even when it is a variable, the vector or list passed as argument cannot be modified by the code statement within the \code{for} loop. -\end{warningbox} - -A\index{for loop!unrolled} loop can be ``unrolled'' into a linear sequence of statements. Let's work through the \code{for} loop above. - -<>= -b <- 0 -# start of loop -# first iteration -a <- 1 -b <- b + a -# second iteration -a <- 2 -b <- b + a -# third iteration -a <- 3 -b <- b + a -# fourth iteration -a <- 4 -b <- b + a -# fifth iteration -a <- 5 -b <- b + a -# end of loop -b -@ - -The operation implemented in this example is a very frequent one, the sum of a vector, so base \Rlang provides a function optimized for efficiently computing it. - -<>= -sum(1:5) -@ - -\begin{warningbox} -It is important to note that a list or vector of length zero is a valid argument to \code{for()}, that triggers no error, but skips the statements in the loop body. - -<>= -b <- 0 -for (a in numeric()) b <- b + a -print(b) -@ -\end{warningbox} - -By printing at each iteration variable \code{b}, the partial results at each iteration can be observed. Brackets are needed to form a compound statement from the two simple statements so that \code{print(b)} is also executed at each iteration. - -<>= -a <- c(1, 4, 3, 6, 8) -for(x in a) { - b <- x*2 - print(b) - } -@ - -\begin{warningbox} -The iteration constructs \Rloop{for}, \Rloop{while}, and \code{repeat} always silently return \code{NULL}, which is a different behaviour than that of \code{if}. - -<>= -b <- for(x in a) x*2 -x -b -@ - -Thus as shown in earlier examples of \Rloop{for} loops, computed values need to be assigned to one or more variables within the loop so that they are not lost. -\end{warningbox} - -While in the examples above the code directly walked through the values in the vector, and alternative approach is to walk through a sequence of indices using the extraction operator \Roperator{[ ]} to access the values in vectors or lists. This approach makes it possible to walk through more than one list or vector. In the example below, elements of vectors \code{a} and \code{b} are accessed concurrently, \code{a} providing the input and \code{b} used to store the corresponding computed value.\label{chunk:for:example} - -<>= -b <- numeric() # an empty vector -for(i in seq(along.with = a)) { - b[i] <- a[i]^2 -} -b -@ - -\begin{playground}\label{box:play:forloop} -Adding calls to \code{print()} makes visible the values taken by variables \code{i}, \code{a}, and \code{b} at each iteration. Try to understand where these values come from at each iteration, by playing with the code and modifying it. - -<>= -b <- numeric() # an empty vector -for(i in seq(along.with = a)) { - b[i] <- a[i]^2 - print(i) - print(a) - print(b) -} -b -@ - -The same approach of adding calls to \code{print()} can be used for debugging any code that does not return the expected results. -\end{playground} - -Above I used \code{seq(along.with = a)} to build a numeric vector containing a sequence of the same length as vector \code{a}. Using this \emph{idiom} ensures that a vector, in this example \code{a}, with length zero will be handled correctly, with \code{numeric(0)} assigned to \code{b}. - -\begin{advplayground} -Run the examples below and explain why the two approaches are equivalent only when the length of \code{a} is one or more. Find the answer by assigning to \code{A}, vectors of different lengths, including zero (using \code{A <- numeric(0)}). - -<>= -# assign a numeric vector to variable A -B <- numeric(length(A)) -for(i in seq(along.with = A)) { - B[i] <- A[i]^2 -} -B - -C <- numeric(length(A)) -for(i in 1:length(A)) { - C[i] <- A[i]^2 -} -C -@ - -\end{advplayground} - -\begin{explainbox} -Using \code{seq(along.with = A)}, its equivalent \code{seq\_along(A)}, as above creates a sequence of integers in \code{i}, that indexes all members of \code{a} in the ``walk-trough''. There is no requirement in the \Rlang for this, and including only some of the valid indexes, or including them in arbitrary order is possible if needed, however this is rarely the case. On exit from the loop, the iterator \code{i} remains accessible and contains its value at the last iteration. -\end{explainbox} - -Vectorization usually results in the simplest and fastest code, as shown below (see section \ref{sec:loops:slow} on \pageref{sec:loops:slow}). However, not all \Rloop{for} loops can be replaced by vectorized statements. - -<>= -b <- a^2 -b -@ - -\begin{explainbox} -\Rloop{for} loops as described above, in the absence of errors, have statically predictable behavior. The compound statement in the loop will be executed once for each member of the vector or list. Special cases may require the alteration of the normal flow of execution in the loop. Two cases are easy to deal with, one is stopping iteration early with a call to \Rloop{break()}, and another is jumping ahead to the next iteration with a call to \Rloop{next()}. The example below shows the use of these two functions: we ignore negative values contained in \code{a}, and exit or break out of the loop when the accumulated sum \code{b} exceeds 100. - -<>= -b <- 0 -a <- -10:100 -idxs <- seq_along(a) -for(i in idxs) { - if (a[i] < 0) next() - b <- b + a[i] - if (b > 100) break() -} -b -i -a[i] -@ - -Hint: if you find the code in the example above difficult to understand, insert \code{print()} statements and run it again inspecting how the values of \code{a}, \code{b}, \code{idxs} and \code{i} behave within the loop. - -In \Rloop{for} loops the use of \Rcontrol{break()} and \Rcontrol{next()} should be reserved for exceptional conditions. When the \Rloop{for} construct is not flexible enough for the computations being implemented, using a \Rloop{while} or a \Rloop{repeat} loop is preferable. - -\end{explainbox} - -\subsection[\texttt{while} loops]{\code{while} loops} - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.5cm] -\node (start) [startstop] {\ldots}; -\node (entry) [below of=start, color = blue, yshift=0.5cm]{$\bullet$}; -\node (dec1) [decision, color = blue, fill = blue!15, below of=entry, yshift=0.3cm] {\code{while ()}}; -\node (stat2) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.3cm] {\code{}}; -\node (stat3) [process, below of=dec1, yshift=-0.5cm] {\code{}}; -\node (stop) [startstop, below of=stat3] {\ldots}; -\draw [arrow] (start) -- (dec1); -\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\code{TRUE}} (stat2); -\draw [arrow, color=blue] (dec1) -- node[anchor=west] {\code{FALSE}} (stat3); -\draw [arrow, color = blue] (stat2) |- (entry); -\draw [arrow, color = blue] (entry) -- (dec1); -\draw [arrow] (stat3) -- (stop); -\end{tikzpicture} -\end{small} - \caption{Flowchart for a \code{while} iteration loop.}\label{fig:while:loop:diagram} -\end{figure} - -\Rloop{while} loops\index{iteration!while loop} are more flexible than \code{for} loops (Figure \ref{fig:while:loop:diagram}). Instead of walking through a list or vector, iteration is controlled by a logical condition of length one, just like in \code{if}. Differently to \code{if}, the controlled statement is executed repeatedly as long as the condition remains \code{TRUE}. - -<>= -a <- 2 -while (a < 50) { - print(a) - a <- a^2 -} -print(a) -@ - -\begin{warningbox} -To ensure that a \code{while} loop is exited instead of circling for ever, the condition, \code{a < 50} in the example above, must depend on a value that is modified by the controlled statement, like \code{a} in this case. -\end{warningbox} - -\begin{playground} -Make sure that you understand why the final value of \code{a} is larger than 50. -\end{playground} - -\begin{explainbox} -The statements above can be simplified, by nesting the assignment inside a call to print. - -<>= -a <- 2 -print(a) -while (a < 50) print(a <- a^2) -@ - -In \Rlang, statements like \code{c <- 1:5} return \emph{invisibly} (with no implicit call to \code{print()}) the value assigned. This makes possible \emph{chained} assignments to several variables within a single statement like in the example below, as well as using an assignment statement as an argument to a function or operator. - -<>= -a <- b <- c <- 1:5 -a -@ -\end{explainbox} - -\begin{advplayground} -Explain why a second \code{print(a)} has been added before \code{while()}. Hint: experiment if necessary. -\end{advplayground} - -As with \code{for} loops we can use an index variable in a \Rfunction{while} loop to walk through vectors and lists. The difference is that we have to update the index values explicitly in our own code. As example below is the code example for \code{for} from page \pageref{chunk:for:example} rewritten using \Rfunction{while}. - -<>= -a <- c(1, 4, 3, 6, 8) -b <- numeric() # an empty vector -i <- 1 -while(i <= length(a)) { - b[i] <- a[i]^2 - print(b) - i <- i + 1 -} -b -@ - -\begin{explainbox} -\Rloop{while} loops as described above will terminate when the condition tested is \code{FALSE}. In cases that require stopping iteration based on an additional test condition within the compound statement, we can call \Rloop{break()} in the body of an \code{if} or \code{else} statement within the \code{while} statement. As in the case of \code{for} loops, it is good to use \Rloop{break()} only for exceptional conditions. -\end{explainbox} - -\subsection[\texttt{repeat} loops]{\code{repeat} loops} - -\begin{figure} -\centering -\begin{small} -\begin{tikzpicture}[node distance=1.5cm] -\node (start) [startstop] {\ldots}; -\node (entry) [below of=start, color = blue, yshift=0.5cm]{$\bullet$}; -\node (dec1) [process, color = blue, fill = blue!15, below of=start, yshift=-0.3cm] {\code{repeat}}; -\node (stat2) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.3cm] {\code{}}; -\node (stat3) [process, below of=stat2, yshift=-0.1cm] {\code{}}; -\node (stop) [startstop, below of=stat3] {\ldots}; -\draw [arrow] (start) -- (dec1); -\draw [arrow, color=blue] (dec1) -- node[anchor=north] {} (stat2); -\draw [arrow, color=blue] (stat2) |- node[anchor=south east] {\textsl{continue}} (entry); -\draw [arrow, color=blue] (stat2) -- node[anchor=west] {\code{break()}} (stat3); -\draw [arrow, color = blue] (entry) -- (dec1); -\draw [arrow] (stat3) -- (stop); -\end{tikzpicture} -\end{small} - \caption{Flowchart for a \code{repeat} iteration loop.}\label{fig:repeat:loop:diagram} -\end{figure} - -The \Rloop{repeat}\index{iteration!repeat loop} construct is the most flexible as iteration only stops with a call to \Rcontrol{break()}. One or more calls to \Rcontrol{break()} can be located anywhere within the compound statement that forms the body of the loop (Figure \ref{fig:repeat:loop:diagram}). - -<>= -a <- 2 -repeat{ - print(a) - if (a > 50) break() - a <- a^2 -} -@ - -\begin{playground} -Try to explain why the example above returns the values it does. Use the approach of adding \code{print()} statements, as described on page \pageref{box:play:forloop}. -\end{playground} - -When \code{repeat} loop constructs contain more than one call to \Rcontrol{break()}, each within a different \code{if} or \code{else} statement, indentation and/or comments can be used to highlight in the listing this infrequent use case . - -\begin{advplayground} - Explain why a \Rloop{repeat} construct is equivalent to a \Rloop{while} construct with the test condition set equal to \code{logical} constant \code{TRUE}. -\end{advplayground} - -\subsection{Nesting of loops}\label{sec:nested:loops} -\index{iteration!nesting of loops}\index{nested iteration loops}\index{loops!nested} - -All the execution-flow control statements seen above can be nested, as syntactically they are themselves statements. I show an example with two \code{for} loops used to walk through rows and columns of a \code{matrix} constructed as follows. - -<>= -A <- matrix(1:50, nrow = 10) -A -@ - -The nested loops below compute the sum for each row of the matrix. In the example below, the value of \code{i} changes for each iteration of the outer loop. The value of \code{j} changes for each iteration of the inner loop, and the inner loop is run in full for each iteration of the outer loop. The inner loop index \code{j} changes fastest. - -<>= -row.sum <- numeric() -for (i in 1:nrow(A)) { - row.sum[i] <- 0 - for (j in 1:ncol(A)) - row.sum[i] <- row.sum[i] + A[i, j] -} -print(row.sum) -@ - -\begin{warningbox} -The nested loops above work correctly with any two-dimensional matrix with at least one column and one row, but \emph{crash} with an empty matrix (\code{matrix()} or \code{matrix(numeric())}). Thus it is good practice to enclose the \Rloop{for} loop in an \Rcontrol{if} statement as protection. For the example above a suitable \code{logical} condition is \code{!is.null(dim(A)) \&\& !any(dim(A) == 0}. -\end{warningbox} - -\begin{advplayground} -1) Modify the code in last chunk above so that it sums the values only in the first three columns of \code{A}, and 2) modify the same code so that it sums the values only in the last three rows of \code{A}. - -Does the code you wrote work as expected when the number of rows in \code{A} is different from \Sexpr{nrow(A)}? and, also if the number of columns in \code{A} is different from \Sexpr{ncol(A)}? What would happen if \code{A} had fewer than three columns? Try to think first what to expect based on the code you wrote. Then create matrices of different sizes and test your code. After that, if necessary, try to improve the code, so that wrong results are never returned. -\end{advplayground} - -\section[Apply functions]{\emph{Apply} functions}\label{sec:data:apply} - -\emph{Apply}\index{apply functions}\index{loops!faster alternatives} functions' role is similar to that of the iteration loops discussed above. One could say that apply functions ``walk along'' a vector, list or a dimension of a matrix or an array, calling a function with each member of the collection as argument. Notation is more concise than in \code{for} constructs. However, apply functions can be used only when the operations to be applied are \emph{independent---i.e., the results from one iteration are not used in another iteration---}. - -\begin{warningbox} -Conceptually, \code{for}, \code{while} and \code{repeat} loops are interpreted as controlling a sequential evaluation of program statements. In contrast, \Rlang's \emph{apply} functions are, conceptually, thought as evaluating a function in parallel for each of the different members of their input. So, while in loops the results of earlier iterations through a loop can be stored in variables and used in subsequent iterations, this is not possible in the case of \emph{apply} functions. -\end{warningbox} - -The different \emph{apply} functions in base \Rlang differ in the class of the values they accept for their \code{X} parameter, the class of the object they return and/or the class of the value returned by the applied function. \Rloop{lapply()}, \Rloop{vapply()} and \Rloop{sapply()} expect a \code{vector} or \code{list} as an argument passed through \code{X}. \Rloop{lapply()} returns a \code{list} or an \code{array}; and \Rloop{vapply()} always \emph{simplifies} its returned value into a vector, while \Rloop{sapply()} does the simplification according to the argument passed to its \code{simplify} parameter. All these \emph{apply} functions can be used to apply an \Rlang function that returns a value of the same or a different class as its argument. In the case of \Rloop{apply()} and \Rloop{lapply()} not even the length of the values returned for each member of the collection passed as an argument, needs to be consistent. Function \Rloop{apply()} is used to apply a function to the elements along one dimension of an object that has two or more \emph{dimensions} returning an array or a list or a vector depending on the size, and consistency in length and class among the values returned by the applied function. - -\subsection{Applying functions to vectors, lists and data frames} - -I exemplify the use of \Rloop{lapply()}, \Rloop{sapply()} and \Rloop{vapply()}. Below, they are used to apply function \Rfunction{log()} to each member of a \code{numeric} vector. This is a function defined in \Rlang itself, but user-defined functions and functions imported from packages can be applied identically. How to define packages and define new functions are the subject of chapter \ref{chap:R:functions} (on page \pageref{chap:R:functions}). - -\begin{warningbox} -The individual member objects in the list or vector passed as argument to parameter \code{x} of \textit{apply} functions are passed as a positional argument to the first formal parameter of the applied function, i.e., only some \Rlang functions can be passed as argument to \code{FUN}. -\end{warningbox} - -<>= -set.seed(123456) # so that vct1 does not change -vct1 <- runif(6) # A short vector as input to keep output short -str(vct1) -@ - -<>= -z <- lapply(X = vct1, FUN = log) -str(z) -@ - -The code above calls \code{log()} once with each of the six members of \code{vct1} as argument and collects the returned values into a list, hence the \code{l} in \Rloop{lapply()}. - -<>= -z <- sapply(X = vct1, FUN = log) -str(z) -@ - -The code above calls \code{log()} as in the previous example but collects the returned values into a vector, i.e., by default it \emph{simplifies} the list into a vector or matrix when possible, hence the \code{s} in \Rloop{sapply()}. Simplification can be skipped, in this case returning a list as \Rloop{lapply()} above (not shown). - -<>= -z <- sapply(X = vct1, FUN = log, simplify = FALSE) -str(z) -@ - -\Rloop{vapply()} always returns a vector (no example shown), hence the \code{v} in its name. The computed results are the same using \Rloop{lapply()}, \Rloop{sapply()} or \Rloop{vapply()}, but the class and structure of the objects returned can differ, as well as how numbers are printed. - -Function \Rfunction{log()} has a second parameter named \code{base} that can be passed and argument to override the default base ($e$) used to compute natural logarithms. Additional arguments like this can be passed by name, using the name of the parameter in the function passed as argument to \code{FUN}, in this case \code{base}. - -<>= -z <- sapply(X = vct1, FUN = log, base = 10) -str(z) -@ - -\begin{explainbox} -Anonymous functions can be defined (see section \ref{sec:script:functions} on page \pageref{sec:script:functions}) and directly passed as argument to \code{FUN} without the need of separately assigning them to a name. - -<>= -z <- sapply(X = vct1, FUN = function(x) {log10(x + 1)}) -str(z) -@ -\end{explainbox} - -As explained in section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}, class \code{data.frame} is derived from class \code{list}. The columns in a data frame are equivalent to members of a list, and functions can thus be applied to columns. The data frame \code{cars} from package \pkgname{datasets} contains data for speed and for stopping distance for cars stored in two columns or member variables, named \code{speed} and \code{dist}. The members of the returned \code{numeric} vector, containing the computed means, are named accordingly. - -<>= -sapply(X = cars, FUN = mean) -@ - -\begin{explainbox} -Here is a possible way of obtaining means and standard deviations of member vectors. The argument passed to \code{FUN.VALUE} provides a template for the type of the returned value and its organization into rows and columns. Notice that the rows in the output are now named according to the names in \code{FUN.VALUE}. - -A function that returns a numeric vector of length 2 containing mean and standard deviation can be defined by calling existing functions (see section \ref{sec:script:functions} on page \pageref{sec:script:functions}). - -<>= -mean_and_sd <- - function(x, na.rm = FALSE) { - c(mean = mean(x, na.rm = na.rm), sd = sd(x, na.rm = na.rm)) - } -@ - -and \Rloop{vapply()} used to apply it to each member vector of the list. The argument passed to \code{FUN.VALUE} serves as a template indicating the values returned by function \code{mean\_and\_sd()}. - -<>= -values <- vapply(X = cars, - FUN = mean_and_sd, - FUN.VALUE = c(mean = 0, sd = 0), - na.rm = TRUE) -class(values) -values -@ -\end{explainbox} - -\begin{playground} - Apply function \code{mean\_and\_sd()} defined above to the data frame \code{cars} from \pkgname{datasets}. The aim is to obtain the mean and standard deviation for each numeric column. -\end{playground} - -\begin{advplayground} -Obtain the summary of dataset \code{airquality} with function \Rfunction{summary()}, but in addition, write code with an \emph{apply} function to count the number of non-missing values in each column. Hint: using \code{sum()} on a \code{logical} vector returns the count of \code{TRUE} values as \code{TRUE}, and \code{FALSE} are transparently converted into \code{numeric} 1 and 0, respectively, when \code{logical} values are used in arithmetic expressions. -\end{advplayground} - -In the examples above the \emph{apply} functions were used to ``reduce´´ the data by applying summary functions. In the next code chunk \code{lapply()} is used to construct the \code{list} of five vectors \code{ls1} using a vector of five numbers as argument passed to parameter \code{X}. As above, additional \emph{named} arguments are relayed to each call of \code{rnorm()}. - -<>= -set.seed(123456) -ls1 <- lapply(X = c(v1 = 2, v2 = 5, v3 = 3, v4 = 1, v5 = 4), - FUN = rnorm, mean = 10, sd = 1) -#names(ls1) <- paste("vect", 1:length(ls1)) -str(ls1) -@ - -In addition to functions returning pseudo-random draws from different probability distributions, constructors for objects of various classes can be used similarly. - -\subsection{Applying functions to matrices and arrays} - -Matrices and arrays have two or more dimensions, and contrary to data frames, they are not a special kind of one-dimensional lists. In \Rlang the dimensions of a matrix, rows and columns, over which a function is applied are called \emph{margins} (see section \ref{sec:matrix:array}, and Figure \ref{fig:matrix:margins} on page \pageref{fig:matrix:margins}). The argument passed to parameter \code{MARGIN} determines \emph{over} which margin the function will be applied. Arrays can have many dimensions (see Figure \ref{fig:array:margins} on page \pageref{fig:array:margins}), and consequently more margins. In the case of arrays with more than two dimensions, it is possible and can be useful to apply functions over multiple margins at once. - -\begin{warningbox} -The individual \emph{slices} of the matrix or array passed as argument to parameter \code{X} of \textit{apply} functions are passed as a positional argument to the first formal parameter of the applied function, i.e., only some \Rlang functions can be passed as argument to \code{FUN}. -\end{warningbox} - -Matrix \code{mat1} constructed here will be used in examples. Adding names helps with understanding both here and when using matrices in real data analysis situations. - -<>= -mat1 <- matrix(rnorm(6, mean = 10, sd = 1), ncol = 2) -mat1 <- round(mat1, digits = 1) -dimnames(mat1) <- # add row and column names - list(paste("row", 1:nrow(mat1)), paste("col", 1:ncol(mat1))) -mat1 -@ - -Column (or row) means of matrices can be easily computed with \Rfunction{apply()}. However, in contrast to when using other \emph{apply} functions, an argument must be passed to parameter \code{MARGIN}. - -<>= -apply(mat1, MARGIN = 2, FUN = mean) -@ - -\begin{playground} -Edit the example above so that it computes row means instead of column means. -\end{playground} - -\begin{advplayground} -As described above, we can pass arguments by name to the applied function. Can you guess why parameter names of \emph{apply} functions are fully in uppercase, something very unusual for \Rlang coding style? -\end{advplayground} - -If the function applied returns a value of the same length as its input, then the dimensions of the value returned by \Rloop{apply()} are the same as those of its input. Using the identity function \Rfunction{I()} that returns its argument unchanged, facilitates the comparison of output against input. - -<>= -z <- apply(X = mat1, MARGIN = 2, FUN = I) -dim(z) -z -@ - -Passing \code{MARGIN = 1} as below instead of \code{MARGIN = 2} as above, rows and columns are transposed in the returned value!. - -<>= -z <- apply(X = mat1, MARGIN = 1, FUN = I) -dim(z) -z -@ - -The next, more realistic example, applies function \Rfunction{summary()} that returns a value usually shorter than its input, but longer than one. Both for column summaries (\code{MARGIN = 2}) and row summaries (\code{MARGIN = 1}), a matrix is returned. Each columns, a numeric vector in this example, contains the vector returned by a call to \Rfunction{summary()}. Column and row names from \code{mat1} are preserved, as well as the names in the value returned by \Rfunction{summary()}. - -<>= -z <- apply(X = mat1, MARGIN = 2, FUN = summary) -z -@ - -<>= -z <- apply(X = mat1, MARGIN = 1, FUN = summary) -z -@ - -In all examples above, we have used ordinary functions. Binary operators in \Rlang are functions with two formal parameters which can be called using infix notation in expressions---i.e., \code{a + b}. By back-quoting their names they can be called using the same syntax as for ordinary functions, and consequently also passed to the \code{FUN} parameter of apply functions. A toy example, equivalent to the vectorized operation \code{vct1 + 5} follows. By enclosing operator \Roperator{+} in back ticks (\code{`}) and passing by name a constant to its second formal parameter (\code{e2 = 5}) operator \Roperator{+} behaves like an ordinary function. See section \ref{sec:operator:functions} on page \pageref{sec:operator:functions}). - -<>= -set.seed(123456) # so that vct1 does not change -vct1 <- runif(10) -z <- sapply(X = vct1, FUN = `+`, e2 = 5) -str(z) -@ - -\section{Functions that replace loops}\label{sec:vectorized:functions} - -\begin{table} - \caption[Functions that replace loops]{\Rlang functions that can substitute for iteration loops. They accept vectors as argument, except for \Rfunction{rowSums()}, \Rfunction{colSums()}, \Rfunction{rowMeans()}, and \Rfunction{colMeans()} which accept \code{matrix} objects as argument. Only functions that return a value with the same dimensions as the argument passed as input are vectorized in the sense used in this book.\vspace{1ex}}\label{tab:vectorized:functions} - \centering -\noindent -\begin{tabular}{lll} - \toprule - Function & Computation & Returned class, length \\ - \midrule - \Rfunction{sum()}\strut & $\sum_{i=1}^n x_i$ & \code{numeric}, $1$ \\ - \Rfunction{rowSums()}\strut & $\sum_{j=1}^l x_i$ & \code{numeric}, $n$ \\ - \Rfunction{colSums()}\strut & $\sum_{i=1}^n x_j$ & \code{numeric}, $l$ \\ - \Rfunction{mean()}\strut & $\sum_{i=1}^n x_i$ & \code{numeric}, $1$ \\ - \Rfunction{rowMeans()}\strut & $\sum_{j=1}^l x_i / l$ & \code{numeric}, $n$ \\ - \Rfunction{colMeans()}\strut & $\sum_{i=1}^n x_j / n$ & \code{numeric}, $l$ \\ - \Rfunction{prod()}\strut & $\prod_{i=1}^n x_i$ & \code{numeric}, $1$ \\ - \Rfunction{cumsum()}\strut & $\sum_{i=1}^1 x_i, \cdots \sum_{i=1}^j x_i, \cdots \sum_{i=1}^n x_i$ & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}$ \\ - \Rfunction{cumprod()}\strut & $\prod_{i=1}^1 x_i, \cdots \prod_{i=1}^j x_i, \cdots \prod_{i=1}^n x_i$ & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}$ \\ - \Rfunction{cummax()}\strut & cumulative maximum & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}$ \\ - \Rfunction{cummin()}\strut & cumulative minimum & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}$ \\ - \Rfunction{runmed()}\strut & running median & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}$ \\ - \Rfunction{diff()}\strut & $x_2 - x_1, \cdots x_i - x_{i-1}, \cdots x_n - x_{n-1}$ & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}-1$ \\ - \Rfunction{diffinv()}\strut & inverse of diff & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}+1$ \\ - \Rfunction{factorial()}\strut & $x!$ & \code{numeric}, $n_\mathrm{out} = n_\mathrm{in}$ \\ - \Rfunction{rle()}\strut & run-length encoding & \code{rle}, $n_\mathrm{out} < n_\mathrm{in}$ \\ - \Rfunction{inverse.rle()}\strut & run-length decoding & \code{vector}, $n_\mathrm{out} > n_\mathrm{in}$ \\ - \bottomrule -\end{tabular} -\end{table} - -\Rlang provides several functions that can be used to avoid writing iterative loops. The most frequently used are taken for granted: \Rfunction{mean()}, \Rfunction{var()} (variance), \Rfunction{sd()} (standard deviation), \Rfunction{max()}, and \Rfunction{min()}. Replacing code implementing an iterative algorithm by a single function call simplifies the script's code and can make it easier to understand. These functions are written in \Clang and compiled, so even when iterative algorithms are used, they are fast (see section \ref{sec:loops:slow} on page \pageref{sec:loops:slow}). Table \ref{tab:vectorized:functions} lists several functions from base \Rlang that implement iterative algorithms. All these functions take a vector of arbitrary length as their first argument, except for \Rfunction{inverse.rle()}.\vspace{2ex} - -\begin{playground} - Build a \code{numeric} vector such as \code{x <- c(1, 9, 6, 4, 3)} and pass it as argument to the functions in Table \ref{tab:vectorized:functions}. Do the corresponding computations manually for the functions your find most relevant, trying to understand what values they calculate. -\end{playground} -\index{loops!faster alternatives|)} - -\section{The multiple faces of loops}\label{sec:R:faces:of:loops} - -\ilAdvanced\ In this advanced section I describe some uses of \Rlang loops that help with writing concise scrips. As these make heavy use of functions, if you are reading the book sequentially, you should skip this section and return to it after reading chapters \ref{chap:R:functions} and \ref{chap:R:statistics}. - -In the same way as we can assign names to \code{numeric}, \code{character} and other types of objects, we can assign names to functions and expressions. We can also create lists of functions and/or expressions. The \Rlang language has a very consistent grammar, with all lists and vectors behaving in the same way. The implication of this is that we can assign different functions or expressions to a given name, and consequently it is possible to write loops over lists of functions or expressions. - -In this first example we use a \emph{character vector of function names}, and use function \Rfunction{do.call()} call them as it accepts both character strings and function names as argument to its first parameter. We obtain a numeric vector with named members with names matching the function names. - -<>= -vct1 <- rnorm(10) -results <- numeric() -fun.names <- c("mean", "max", "min") -for (f.name in fun.names) { - results[[f.name]] <- do.call(f.name, list(vct1)) -} -results -@ - -When traversing a \emph{list of functions} in a loop, we face the problem that we cannot access the original names of the functions as what is stored in the list are the code definitions of the functions. In this case, we can hold the function definitions in the loop variable (\code{f} in the chunk below) and call the functions by use of the function call notation (\code{f()}). We obtain a numeric vector with anonymous members. - -<>= -results <- numeric() -funs <- list(mean, max, min) -for (f in funs) { - results <- c(results, f(x)) -} -results -@ - -We can use a named list of functions to gain full control of the naming of the results. We obtain a numeric vector with named members with names matching the names given to the list members. - -<>= -results <- numeric() -funs <- list(average = mean, maximum = max, minimum = min) -for (f in names(funs)) { - results[[f]] <- funs[[f]](x) -} -results -@ - -Next is an example using model formulas. We use a loop to fit three models, obtaining a list of fitted models. We cannot pass to \Rfunction{anova()} this list of fitted models, as it expects each fitted model as a separate nameless argument to its \code{\ldots} parameter. We can get around this problem using function \Rfunction{do.call()} to call \Rfunction{anova()}. Function \Rfunction{do.call()} passes the members of the list passed as its second argument as individual arguments to the function being called, using their names if present. \Rfunction{anova()} expects nameless arguments so we need to remove the names present in \code{results}. - -<>= -my.data <- data.frame(x = 1:10, y = 1:10 + rnorm(10, 1, 0.1)) -results <- list() -models <- list(linear = y ~ x, linear.orig = y ~ x - 1, quadratic = y ~ x + I(x^2)) -for (m in names(models)) { - results[[m]] <- lm(models[[m]], data = my.data) -} -str(results, max.level = 1) -do.call(anova, unname(results)) -@ - -If we had no further use for \code{results} we could simply build a list with nameless members by using positional indexing. - -<>= -results <- list() -models <- list(y ~ x, y ~ x - 1, y ~ x + I(x^2)) -for (i in seq(along.with = models)) { - results[[i]] <- lm(models[[i]], data = my.data) -} -str(results, max.level = 1) -do.call(anova, results) -@ - -\section{Iteration when performance is important}\label{sec:loops:slow} -\index{vectorization}\index{recycling of arguments}\index{iteration}\index{loops!faster alternatives|(} -When working with large data sets, or many smaller data sets, one frequently needs to take performance into account. In \Rlang, explicit \Rloop{for}, \Rloop{while} and \Rloop{repeat} are considered to be slow to run. Vectorized operations perform in general comparatively faster. As vectorization (see page \pageref{par:calc:vectorized:opers}) usually also makes code simpler, it is good style to use vectorization whenever possible. Depending on the case, loops can be replaced using vectorized arithmetic operators, \emph{apply} functions (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}) and functions implementing frequently used operations (see section \ref{sec:vectorized:functions} on page \pageref{sec:vectorized:functions}). Improved performance needs to be balanced against the effort invested in writing faster code, as in most cases our own time is more valuable than computer running time. However, using vectorized operators and optimized functions becomes nearly effortless once one is familiar with them.\qRloop{for} - -To demonstrate the magnitude of the differences in performance that can be expected, I used as a first case the computation of the differences between successive numbers in a vector, applied to vectors of lengths ranging from 10 to 100 million numbers (Figure \ref{fig:diff:benchmarks}). In relative terms the difference in computation time was huge between loops and vectorization for vectors of up to 1\,000 numbers (near $\times 500$), but the total times were very short ($5 \times 10^{-3}$\,s vs.\ $10 \times 10^{-6}$\,s). For these vectors, pre-allocation of a vector to collect the results made almost no difference and vectorization with the extraction operator \Roperator{[]} together with the minus arithmetic operator \Roperator{-} was fastest. There seems to be a significant overhead for explicit loops, as the running time was nearly independent of the length of these short vectors. - -For vectors of 10\,000 or more numbers there was only a very small advantage in using function \Rfunction{diff()} over using vectorized arithmetic and extraction operators. For \Rloop{while} and \Rloop{for} loops pre-allocation of the vector to collect results made an important difference ($\times 2$ to $\times 3$), larger in the case of \Rloop{for}. However, vectorized operators and function \Rfunction{diff()} remained nearly $\times 10$ faster than the fastest explicit loop. For the longer vectors the time increased almost linearly with their length, with similar slopes for the different approaches. Because of the computation used for this example, \emph{apply()} functions could not be used. - -\begin{figure} -\centering - -<>= -opts_chunk$set(opts_fig_wide_square) -@ - -<>= -library(scales) -library(ggplot2) -library(patchwork) - -load("benchmarks.pantera.Rda") - -fig.seconds <- - ggplot(summaries, - aes(x = size, y = median*1e-3, - color = loop, shape = loop)) + - geom_point() + - geom_line() + - scale_x_log10(name = "Vector length (n)", - breaks = trans_breaks("log10", function(x) 10^x), - labels = trans_format("log10", math_format(10^.x))) + - scale_y_log10(name = "Time (s)", - breaks = trans_breaks("log10", function(x) 10^x), - labels = trans_format("log10", math_format(10^.x))) + - scale_color_discrete(name = "Iteration\napproach") + - scale_shape(name = "Iteration\napproach", solid = TRUE) + - expand_limits(y = 1e-6) + - theme_bw(14) - -fig.rel <- - ggplot(rel.summaries, - aes(x = size, y = median, color = loop, shape = loop)) + - geom_point() + - geom_line() + - scale_x_log10(name = "Vector length (n)", - breaks = trans_breaks("log10", function(x) 10^x), - labels = trans_format("log10", math_format(10^.x))) + - scale_y_log10(name = "Time (relative to shortest)", - breaks = c(1, 2, 5, 10, 20, 50, 100, 200, 500, 1000)) + - scale_color_discrete(name = "Iteration\napproach") + - scale_shape(name = "Iteration\napproach", solid = TRUE) + - theme_bw(14) - -print(fig.seconds / fig.rel + plot_layout(guides = "collect")) -@ - -<>= -opts_chunk$set(opts_fig_narrow) -@ - -\caption[Benchmark results for running differences.]{Benchmark results for different approaches to computing running differences in numeric (double) vector of different lengths. The data in this figure were obtained in a computer with a 12-years old Xenon E3-1235 CPU with four cores, 32\,GB of RAM, Windows 10 and \Rpgrm 4.3.1.}\label{fig:diff:benchmarks} -\end{figure} - -\begin{explainbox} -The chunks below show the code for the six approaches compared in Figure \ref{fig:diff:benchmarks}, where \code{a} is a numeric vector varying length constructed with function \code{rnorm()}. - -<>= -b <- numeric() # do not pre-allocate memory -i <- 1 -while (i < length(a)) { - b[i] <- a[i+1] - a[i] - i <- i + 1 -} -@ - -<>= -b <- numeric(length(a)-1) # pre-allocate memory -i <- 1 -while (i < length(a)) { - b[i] <- a[i+1] - a[i] - i <- i + 1 -} -@ - -<>= -b <- numeric() # do not pre-allocate memory -for(i in seq(along.with = b)) { - b[i] <- a[i+1] - a[i] -} -@ - -<>= -b <- numeric(length(a)-1) # pre-allocate memory -for(i in seq(along.with = b)) { - b[i] <- a[i+1] - a[i] -} -@ - -<>= -# vectorized using extraction operators -b <- a[2:length(a)] - a[1:length(a)-1] -@ - -<>= -# vectorized function diff() -b <- diff(a) -@ -\end{explainbox} - -In nested iteration loops it is most important to vectorize or otherwise enhance the performance of the innermost loop as it is the one executed most frequently. The code for nested loops (used as example in section \ref{sec:nested:loops} on page \pageref{sec:nested:loops}) can be edited to remove the explicit use of \Rloop{for} loops. I assessed the performance of different approaches by collecting timings for square \code{matrix} objects with dimensions (rows $\times$ columns) ranging from $10 \times 10$, size = $10^2$, to $10\,000 \times 10\,000$, size = $10^8$ (Figure \ref{fig:rowsums:benchmarks}). - -In this second case, pre-allocation of memory to \code{b} did not enhance performance in good agreement with the benchmarks for the first example as when largest its length was 10\,000. The two nested loops always took the longest to run irrespective of the size of matrix \code{A}. A single loop over rows using a call to \Rfunction{sum()} for each row, improved performance compared to nested loops, most clearly for large matrices. This approach was out-performed by \Rfunction{apply()} only for small matrices, from which we can infer that \Rfunction{apply()} has a much smaller overhead than an explicit \Rloop{for} loop. \Rfunction{rowSums()} was between $\times 5$ and $\times 20$ faster than the second fastest approach depending on the size of the matrix. - -\begin{figure} -\centering - -<>= -opts_chunk$set(opts_fig_wide_square) -@ - -<>= -#library(scales) -#library(ggplot2) -#library(patchwork) - -load("benchmarks-rowSums-pantera.Rda") - -fig.seconds <- - ggplot(summaries, - aes(x = size, y = median*1e-3, color = loop, shape = loop)) + - geom_point() + - geom_line() + - scale_x_log10(name = "Matrix size (n)", - breaks = trans_breaks("log10", function(x) 10^x), - labels = trans_format("log10", math_format(10^.x))) + - scale_y_log10(name = "Time (s)", - breaks = trans_breaks("log10", function(x) 10^x), - labels = trans_format("log10", math_format(10^.x))) + - scale_color_discrete(name = "Iteration\napproach") + - scale_shape(name = "Iteration\napproach", solid = TRUE) + - expand_limits(y = 1e-6) + - theme_bw(14) - -fig.rel <- - ggplot(rel.summaries, - aes(x = size, y = median, color = loop, shape = loop)) + - geom_point() + - geom_line() + - scale_x_log10(name = "Matrix size (n)", - breaks = trans_breaks("log10", function(x) 10^x), - labels = trans_format("log10", math_format(10^.x))) + - scale_y_log10(name = "Time (relative to shortest)", - breaks = c(1, 2, 5, 10, 20, 50, 100, 200, 500, 1000)) + - scale_color_discrete(name = "Iteration\napproach") + - scale_shape(name = "Iteration\napproach", solid = TRUE) + - theme_bw(14) - -print(fig.seconds / fig.rel + plot_layout(guides = "collect")) -@ - -<>= -opts_chunk$set(opts_fig_narrow) -@ - -\caption[Benchmark results for row sums.]{Benchmark results for different approaches to computing row sums of square numeric (double) matrices of different sizes. The data in this figure were obtained in a computer with a 12-years old Xenon E3-1235 CPU with four cores, 32\,GB of RAM, Windows 10 and \Rpgrm 4.3.1.}\label{fig:rowsums:benchmarks} -\end{figure} - -\begin{explainbox} -The chunks below show the code for the six approaches compared in Figure \ref{fig:rowsums:benchmarks}, where \code{A} was a numeric matrix constructed with function \code{rnorm()}. - -The inner \Rloop{for} loop can be replaced by function \code{sum()} which returns the sum of a vector. Within the loop \code{A[i, ]} extracts whole rows, one at a time. - -<>= -@ - -<>= -row.sum <- numeric(nrow(A)) # faster -for (i in 1:nrow(A)) { - row.sum[i] <- sum(A[i, ]) -} -@ - -The\index{apply functions} outer loop can be replaced by a call to \Rfunction{apply()} (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}). - -<>= -row.sum <- apply(A, MARGIN = 1, sum) # MARGIN=1 indicates rows -@ - -Calculating row sums is a frequent operation, thus, \Rlang provides a built-in function for this. - -<>= -rowSums(A) -@ - -The simplest way of measuring the execution time of an \Rlang expression is to use function \Rfunction{system.time()}. Package \pkgname{microbenchmark}, used for the benchmarks shown in Figures \ref{fig:diff:benchmarks} and \ref{fig:rowsums:benchmarks}, provides finer time resolution. -\end{explainbox} - -As in these examples the computations in the body of the loop are very simple, the overhead of the iterative loops strongly affects the total computation time in these benchmarks. When the computations at each iteration are time consuming, the overhead of using explicit iteration loops gets diluted. Thus, removing the explicit use of iteration is most helpful, when it is easier to implement vectorized arithmetic or find optimized functions. - -\begin{warningbox} - The timings in Figures \ref{fig:diff:benchmarks} and \ref{fig:rowsums:benchmarks} are only valid for the specific computer configuration, operating system and \Rpgrm version that I used. They provide only an approximate guide to what can be expected in different conditions. The scripts used are included in package \pkgname{learnrbook} in case readers wish to run them on their computers. As replication is used, the total run time for the script is relatively long. -\end{warningbox} - -\begin{explainbox} - You may be wandering: how do the faster approaches manage to avoid the overhead of iteration? Of course they do not really avoid iteration, but the loops in functions written in \Clang, \Cpplang, or \langname{FORTRAN} are compiled into machine code as part of \Rpgrm itself or when packages binaries are created. In simpler words, the time required to convert and optimize the code written in these languages into machine code is spent during compilation, usually before we download and install \Rlang or packages. Instead, a loop coded in \Rlang is interpreted into machine code each time we source our script, and in some cases for each iteration in a loop. The \Rlang interpreter does some compilation into virtual machine code, as a preliminary stage which helps improve performance. -\end{explainbox} - -The examples in this section use numbers and arithmetic operations, but vectorization and \emph{apply} functions can be also used with vectors of other modes, such as vectors of \code{character} strings or \code{logical} values. - -With modern computer processors, or CPUs, splitting the tasks across multiple cores for concurrent execution can enhance performance. To some extent this happens invisibly due to optimizations in the translation into machine code. Explicit approaches are available in package \pkgname{parallel} included in the \Rlang distribution and contributed packages such as \pkgname{future}. Parallelization is also possible across interconnected computers. However, how to enhance performance based on parallel or distributed execution is beyond the scope of this book. - -\section{Object names as character strings} - -In\index{object names}\index{object names!as character strings} all assignment examples before this section, we have used object names included as literal character strings in the code expressions. In other words, the names are ``decided'' as part of the code, rather than at run time. In scripts or packages, the object name to be assigned may need to be decided at run time and, consequently, be available only as a character string stored in a variable. In this case, function \Rfunction{assign()} must be used instead of the operators \code{<-} or \code{->}. The statements below demonstrate its use. - -First using a \code{character} constant. - -<>= -assign("a", 9.99) -a -@ -Next using a \code{character} value stored in a variable. - -<>= -name.of.var <- "b" -assign(name.of.var, 9.99) -b -@ - -The two toy examples above do not demonstrate why one may want to use \Rfunction{assign()}. Common situations where we may want to use character strings to store (future or existing) object names are 1) when we allow users to provide names for objects either interactively or as \code{character} data, 2) when in a loop we transverse a vector or list of object names, or 3) we construct at runtime object names from multiple character strings based on data or settings. A common case is when we import data from a text file and we want to name the object according to the name of the file on disk, or a character string read from the header at the top of the file. - -Another case is when \code{character} values are the result of a computation. - -<>= -for (i in 1:5) { - assign(paste("square_of_", i, sep = ""), i^2) -} -ls(pattern = "square_of_*") -@ - -The complementary operation of \emph{assigning} a name to an object is to \emph{get} an object when we have available its name as a character string. The corresponding function is \Rfunction{get()}. - -<>= -get("a") -get("b") -@ - -If we have available a character vector containing object names and we want to create a list containing these objects we can use function \Rfunction{mget()}. In the example below we use function \code{ls()} to obtain a character vector of object names matching a specific pattern and then collect all these objects into a list. - -<>= -obj_names <- ls(pattern = "square_of_*") -obj_lst <- mget(obj_names) -str(obj_lst) -@ - -\begin{advplayground} -Think of possible uses of functions \Rfunction{assign()}, \Rfunction{get()} and \Rfunction{mget()} in scripts you use or could use to analyze your own data (or from other sources). Write a script to implement this, and iteratively test and revise this script until the result produced by the script matches your expectations. -\end{advplayground} - -\section{Clean-up} - -Sometimes we need to make sure that clean-up code is executed even if the execution of a script or function is aborted by the user or as a result of an error condition. A typical example is a script that temporarily sets a disk folder as the working directory or uses a file as temporary storage. Function \Rfunction{on.exit()} can be used to record that a user supplied expression needs to be executed when the current function, or a script, exits. Function \Rfunction{on.exit()} can also make code easier to read as it keeps creation and clean-up next to each other in the body of a function or in the listing of a script. - -<>= -file.create("temp.file") -on.exit(file.remove("temp.file")) -# code that makes use of the file goes here -@ - -Function \Rfunction{library()} attaches the namespace of the loaded packages and in some special cases one may want to detach them at the end of a script. We can use \Rfunction{detach()} similarly as with attached \code{data.frame} objects (see page \pageref{par:calc:attach}). As an example we detach the packages used in section \ref{sec:loops:slow}. It is important to remember that the order in which they can be detached is determined by their interdependencies. - -<>= -detach(package:patchwork) -detach(package:ggplot2) -detach(package:scales) -@ - -\section{Further reading} -For\index{further reading!the R language} further readings on the aspects of \Rlang discussed in the current chapter, I suggest the books \citetitle{Matloff2011} \autocite{Matloff2011} and \citetitle{Wickham2019} \autocite{Wickham2019}. ->>>>>>> Stashed changes diff --git a/R.stats.rnw b/R.stats.rnw index bd131a6e..e61eca68 100644 --- a/R.stats.rnw +++ b/R.stats.rnw @@ -1,4 +1,3 @@ -<<<<<<< Updated upstream % !Rnw root = appendix.main.Rnw <>= opts_chunk$set(opts_fig_narrow_square) @@ -1072,7 +1071,7 @@ Function \Rfunction{nls()} is \Rlang's workhorse for fitting non-linear models. In cases when algorithms exist for ``guessing'' suitable starting values, \Rlang provides a mechanism for packaging the \Rlang function to be fitted together with the \Rlang function generating the starting values. These functions go by the name of \emph{self-starting functions} and relieve the user from the burden of guessing and supplying suitable starting values. The\index{self-starting functions} self-starting functions available in \Rlang are \code{SSasymp()}, \code{SSasympOff()}, \code{SSasympOrig()}, \code{SSbiexp()}, \code{SSfol()}, \code{SSfpl()}, \code{SSgompertz()}, \code{SSlogis()}, \code{SSmicmen()}, and \code{SSweibull()}. Function \code{selfStart()} can be used to define new ones. All these functions can be used when fitting models with \Rfunction{nls} or \Rfunction{nlme}. Please, check the respective help pages for details. \begin{warningbox} -In calls to \Rfunction{nls()}, the rhs of the model \code{formula} is a function call. The names of its arguments if not present in \code{data} are assumed to be parameters to be fitted. Below, a named function +In calls to \Rfunction{nls()}, the rhs of the model \code{formula} is a function call. The names of its arguments if not present in \code{data} are assumed to be parameters to be fitted. Below, a named function \end{warningbox} As example the Michaelis-Menten equation\index{Michaelis-Menten equation} describing reaction kinetics\index{chemical reaction kinetics} in biochemistry and chemistry is fitted to the \Rdata{Puromycin} data set. The mathematical formulation is given by @@ -1700,1706 +1699,3 @@ knitter_diag() R_diag() other_diag() @ -======= -% !Rnw root = appendix.main.Rnw -<>= -opts_chunk$set(opts_fig_narrow_square) -opts_knit$set(concordance=TRUE) -opts_knit$set(unnamed.chunk.label = 'functions-chunk') -rm(plot) -@ - -\chapter{Base R: ``Verbs'' and ``Nouns'' for Statistics}\label{chap:R:statistics} - -\begin{VF} -The purpose of computing is insight, not numbers. - -\VA{Richard W. Hamming}{\emph{Numerical Methods for Scientists and Engineers}, 1987}\nocite{Hamming1987} -\end{VF} - -\section{Aims of this chapter} - -This chapter aims to give the reader an introduction to the approach used in base \Rlang for the computation of statistical summaries, the fitting of models to observations and tests of hypothesis. This chapter does \emph{not} explain data analysis methods, statistical principles or experimental designs. There are many good books on the use of \Rpgrm for different kinds of statistical analyses (see further reading on page \pageref{sec:stat:further:reading}) but most of them tend to focus on specific statistical methods rather than on the commonalities among them. Although base \R's model fitting functions target specific statistical procedures, they use a common approach to model specification and for returning the computed estimates and test outcomes. This approach, also followed by many contributed extension packages, can be considered as part of the philosophy behind the \Rlang language. In this chapter you will become familiar with the approaches used in \Rlang for calculating statistical summaries, generating (pseudo-)random numbers, sampling, fitting models and carrying out tests of significance. We will use linear correlation, \emph{t}-test, linear models, generalized linear models, non-linear models and some simple multivariate methods as examples. The focus is on how to specify statistical models, contrasts and observations, how to access different components of the objects returned by the corresponding fit and summary functions, and how to use these extracted components in further computations or for customized printing and formatting. - -%\emph{At present I use several examples adapted from the help pages for the functions described. I may revise this before publication.} -\section{Statistical summaries} -\index{summaries!statistical|(} -Being the main focus of the \Rlang language in data analysis and statistics, \Rlang provides functions both for simple and complex calculations, going from means and variances to fitting very complex models. Table \ref{tab:stat:summaries} lists some frequently used functions. All these methods accept numeric vectors and/or matrices as arguments. In addition function \Rfunction{quantile()} can be used to simultaneously compute multiple arbitrary quantiles for a vector of observations, and method \Rfunction{summary()} produces a summary that depends on the class of the argument passed to it. Please, see section \ref{sec:functions:sem} on page \pageref{sec:functions:sem} for how to define your own functions.) - -\begin{table} - \centering - \caption[Simple statistical summaries.]{Frequently used simple statistical summaries and the corresponding \Rlang functions.\vspace{1ex}}\label{tab:stat:summaries} - \begin{tabular}{llll} - \toprule - Function & Symbol & Formulation & Name \\ - \midrule - \Rfunction{mean()} & $\bar{x}$ & $\sum x / n$ & mean \\ - \Rfunction{var()} & $s^2$ & $\sum (x_i - \hat{x})^2 / (n - 1)$ & sample variance \\ - \Rfunction{sd()} & $s$ & $\sqrt[2]{s^2}$ & sample standard deviation \\ - \Rfunction{median()} & M or $\tilde{x}$ & & median \\ - \Rfunction{mad()} & MAD & median $|x_i - \hat{x}|$ & median absolute deviation \\ - \Rfunction{mode()} & MOD & & mode \\ - \Rfunction{max()} & $x_\mathrm{max}$ & & maximum \\ - \Rfunction{min()} & $x_\mathrm{min}$ & & minimum \\ - \Rfunction{range()} & $x_\mathrm{min}, x_\mathrm{max}$ & & range \\ - \bottomrule - \end{tabular} -\end{table} - -By default, if the argument contains \code{NAs} these functions return \code{NA}. The logic behind this is that if one value exists but is unknown, the true result of the computation is unknown (see page \pageref{par:special:values} for details on the role of \code{NA} in \Rlang). However, an additional parameter called \code{na.rm} allows us to override this default behavior by requesting any \code{NA} in the input to be removed (or discarded) before calculation, - -<>= -x <- c(1:20, NA) -mean(x) -mean(x, na.rm = TRUE) -@ - -Function \Rfunction{mean()} can be used to compute the mean from all values, as in the example above, as well as trimmed means, i.e., means computed after discarding extreme values. The argument passed to parameter \code{trim} decides the fraction of the observations to discard at \emph{each extreme} of the vector of values after ordering them from smallest to largest. - -<>= -x <- c(1:20, 100) -mean(x) -mean(x, trim = 0.05) -@ - -\begin{playground} - In contrast to the use of other functions, I do not provide examples of the use of all the functions listed in Table \ref{tab:stat:summaries}. Construct \code{numeric} vectors with artificial data or use real data to play with the remaining functions. Study the help pages to learn about the different parameters and their uses.% Later in the book, only the output from certain examples will be shown, with the expectation, that other examples will be run by readers. -\end{playground} - -Other more advanced functions are also available in \Rlang, such as \Rfunction{boxplot.stats()} that computes the values needed to draw boxplots (see section \ref{sec:boxplot} on page \pageref{sec:boxplot}). - -In many cases you will want to compute statistical summaries by group or treatment in addition or instead of for a whole data set or vector. See section \ref{sec:calc:df:aggregate} on page \pageref{sec:calc:df:aggregate} for details on how to compute summaries of data stored in data frames using base \Rlang functions, and section \ref{sec:dplyr:manip} on page \pageref{sec:dplyr:manip} for alternative functions from contributed packages. -\index{summaries!statistical|)} - -\section{Standard probability distributions}\label{sec:prob:dist} -\index{probability distributions!standard|(}% -\index{probability distributions!theoretical|see{--- standard}}% -\index{Normal distribution}% -Density, distribution functions, quantile functions and generation of pseudo-random values for several different standard (theoretical) probability distributions are part of the \Rlang language. Entering \code{help(Distributions)} at the \Rlang prompt will open a help page describing all the distributions available in base \Rlang. For each distribution the different functions contain the same ``root'' in their names: \code{norm} for the normal distribution, \code{unif} for the uniform distribution, and so on. The ``head'' of the name indicates the type of values returned: ``\code{d}'' for density, ``\code{q}'' for quantile, ``\code{r}'' (pseudo-)random draws, and ``\code{p}'' for probability (Table \ref{tab:prob:funs}). - -\begin{table} - \centering - \caption[Standard probability distributions]{Standard probability distributions in \Rlang. Partial list of base \Rlang functions related to probability distributions. The full list can be obtained by executing the command \code{help(Distributions)}.\vspace{1ex}}\label{tab:prob:funs} - - \begin{tabular}{llllll} - \toprule - Distribution & Symbol & Density & $P$-value & Quantiles & Draws \\ - \midrule - Normal & $N$ & \Rfunction{dnorm()} & \Rfunction{pnorm()} & \Rfunction{qnorm()} & \Rfunction{rnorm()} \\ - Student's & $t$ & \Rfunction{dt()} & \Rfunction{pt()} & \Rfunction{qt()} & \Rfunction{rt()}\\ - F & $F$ & \Rfunction{df()} & \Rfunction{pf()} & \Rfunction{qf()} & \Rfunction{rf()} \\ - binomial & $B$ & \Rfunction{dbinom()} & \Rfunction{pbinom()} & \Rfunction{qbinom()} & \Rfunction{rbinom()} \\ - multinomial & $M$ & \Rfunction{dmultinom()} & \Rfunction{pmultinom()} & \Rfunction{qmultinom()} & \Rfunction{rmultinom()} \\ - Poisson & & \Rfunction{dpois()} & \Rfunction{ppois()} & \Rfunction{qpois()} & \Rfunction{rpois()} \\ - $\Chi$-squared & $\Chi^2$ & \Rfunction{dchisq()} & \Rfunction{pchisq()} & \Rfunction{qchisq()} & \Rfunction{rchisq()} \\ - lognormal & & \Rfunction{dlnorm()} & \Rfunction{plnorm()} & \Rfunction{qlnorm()} & \Rfunction{rlnorm()} \\ - uniform & & \Rfunction{dunif()} & \Rfunction{punif()} & \Rfunction{qunif()} & \Rfunction{runif()} \\ - \bottomrule - \end{tabular} -\end{table} - -Theoretical distributions are defined by mathematical functions that accept parameters that control the exact shape and location. In the case of the Normal distribution, these parameters are the \emph{mean} (\code{mean}) controlling location and \emph(standard deviation) (\code{sd}) controlling the spread around the center of the distribution. The four different functions differ in which values are calculated (the unknowns) and which values are supplied as arguments (the known inputs). - -In what follows we use the normal distribution as an example, but with differences in their parameters, the functions for other theoretical distributions follow a similar naming pattern. - -\subsection{Density from parameters}\label{sec:prob:dens} -\index{probability distributions!density from parameters} -To obtain a single point from the distribution curve we pass a vector of length one as an argument for \code{x}. -<>= -dnorm(x = 1.5, mean = 1, sd = 0.5) -@ - -To obtain multiple values we can pass a longer vector as an argument. - -<>= -dnorm(x = seq(from = -1, to = 1, length.out = 5), mean = 1, sd = 0.5) -@ - -With 50 equally spaced values for $x$ we can plot a line (\code{type = "l"}) that shows that the 50 generated data points give the illusion of a continuous curve. We also add a point showing the value for $x = 1.5$ calculated above. - -<>= -vct1 <- seq(from = -1, to = 3, length.out = 50) - -df1 <- data.frame(x = vct1, - y = dnorm(x = vct1, mean = 1, sd = 1)) -plot(y~x, data = df1, type = "l", xlab = "z", ylab = "f(z)") -points(x = 2, y = dnorm(x = 2, mean = 1, sd = 1)) -@ - -\subsection{Probabilities from parameters and quantiles}\label{sec:prob:quant} -\index{probability distributions!probabilities from quantiles} - -With a known quantile value, it is possible to look up the corresponding $P$-value from the Normal distribution, i.e., the area under the curve, either to the right or to the left of a given value of \code{q} (by default, integrating the lower or left tail). When working with observations, the quantile, mean and standard deviation are in most cases computed from the same observations under the null hypothesis. In the example below, we use invented values for all parameters \code{q}, the quantile, \code{mean}, and \code{sd}, the standard deviation. - -<>= -pnorm(q = 2, mean = 1, sd = 1) -pnorm(q = 2, mean = 1, sd = 1, lower.tail = FALSE) -pnorm(q = 2, mean = 1, sd = 4, lower.tail = FALSE) -pnorm(q = c(2, 4), mean = 1, sd = 1, lower.tail = FALSE) -@ - -\begin{explainbox} - In tests of significance, empirical $z$-values and $t$-values are computed by subtracting from the observed mean for one group or raw quantile, the ``expected'' mean (possibly a hypothesized theoretical value, the mean of a control condition used as reference, or the mean computed over all treatments under the assumption of no effect of treatments) and then dividing by the standard deviation. Consequently, the $p$-values corresponding to these empirical $z$-values and $t$-values need to be looked up using \code{mean = 0} and \code{sd = 1} when calling \Rfunction{pnorm()} or \Rfunction{pt()} respectively. These frequently used values are the defaults. -\end{explainbox} - -\subsection{Quantiles from parameters and probabilities}\label{sec:quant:prob} -\index{probability distributions!quantiles from probabilities} - -The reverse computation from that in the previous section is to obtain the quantile corresponding to a known $P$-value or area under one of the tails of the distribution curve. These quantiles are equivalent to the values in the tables of precalculated quantiles used in earlier times to assess significance with statistical tests. - -<>= -qnorm(p = 0.01, mean = 0, sd = 1) -qnorm(p = 0.05, mean = 0, sd = 1) -qnorm(p = 0.05, mean = 0, sd = 1, lower.tail = FALSE) -@ - -\begin{warningbox} -Quantile functions like \Rfunction{qnorm()} and probability functions like \Rfunction{pnorm()} always do computations based on a single tail of the distribution, even though it is possible to specify which tail we are interested in. If we are interested in obtaining simultaneous quantiles for both tails, we need to do this manually. If we are aiming at quantiles for $P = 0.05$, we need to find the quantile for each tail based on $P / 2 = 0.025$. - -<>= -qnorm(p = 0.025, mean = 0, sd = 1) -qnorm(p = 0.025, mean = 0, sd = 1, lower.tail = FALSE) -@ - -We see above that in the case of a symmetric distribution like the Normal, the quantiles in the two tails differ only in sign. This is not the case for asymmetric distributions. - -When calculating a $P$-value from a quantile in a test of significance, we need to first decide whether a two-sided or single-sided test is relevant, and in the case of a single sided test, which tail is of interest. For a two-sided test we need to multiply the returned value by 2. - -<>= -pnorm(q = 4, mean = 0, sd = 1) * 2 -@ - -\end{warningbox} - -\subsection{``Random'' draws from a distribution}\label{sec:stat:random} -\index{random draws|see{probability distributions, pseudo-random draws}}\index{probability distributions!pseudo-random draws} - -True random sequences can only be generated by physical processes. All ``pseudo-random'' sequences of numbers generated by computation are really deterministic although they share some properties with true random sequences (e.g., in relation to autocorrelation). - -It is possible to compute not only pseudo-random draws from a uniform distribution but also from the Normal, $t$, $F$ and other distributions. In each case, the probability with which different values are ``drawn'' approximates the probabilities set by the corresponding theoretical distribution. Parameter \code{n} indicates the number of values to be drawn, or its equivalent, the length of the vector returned (see section \ref{sec:plot:histogram} on page \pageref{sec:plot:histogram} for example plots).\qRfunction{rnorm()}\qRfunction{runif()}% - -<>= -set.seed(12234) -rnorm(5) -rnorm(n = 10, mean = 10, sd = 2) -@ - -\begin{playground} -Edit the examples in sections \ref{sec:prob:quant}, \ref{sec:quant:prob} and \ref{sec:stat:random} to do computations based on different distributions, such as Student's \emph{t}, \emph{F} or uniform. -\end{playground} - -\begin{explainbox} -\index{random numbers|see{pseudo-random numbers}}\index{pseudo-random numbers} -It is impossible to generate truly random sequences of numbers by means of a deterministic process such as a mathematical computation. ``Random numbers'' as generated by \Rpgrm and other computer programs are \emph{pseudo random numbers}, long deterministic series of numbers that resemble random draws. Random number generation uses a \emph{seed} value that determines where in the series we start. The usual way of automatically setting the value of the seed is to take the milliseconds or similar rapidly changing set of digits from the real time clock of the computer. However, in cases when we wish to repeat a calculation using the same series of pseudo-random values, we can use \Rfunction{set.seed()} with an arbitrary integer as an argument to reset the generator to the same point in the underlying (deterministic) sequence. -\end{explainbox} - -\begin{advplayground} -Execute the statement \code{rnorm(3)}\qRfunction{rnorm()} by itself several times, paying attention to the values obtained. Repeat the exercise, but now executing \code{set.seed(98765)}\qRfunction{set.seed()} immediately before each call to \code{rnorm(3)}, again paying attention to the values obtained. Next execute \code{set.seed(98765)}, followed by \code{c(rnorm(3), rnorm(3))}, and then execute \code{set.seed(98765)}, followed by \code{rnorm(6)} and compare the output. Repeat the exercise using a different argument in the call to \code{set.seed()}. analyze the results and explain how \code{setseed()} affects the generation of pseudo-random numbers in \Rlang. -\end{advplayground} -\index{probability distributions!standard|)} - -\section{Observed probability distributions} -\index{probability distributions!observed|(}% -\index{empirical probability distributions|see{probability distributions, observed}}% -It is common to estimate the value of the parameters for a standard distribution like Student's $t$ or Normal distributions from observational data, assuming a priori the suitability of the distribution. If we compute the and standard deviation for a large sample, these two parameters define a specific Normal distribution curve. If we add the estimate of the degrees of freedom, $v = n - 1$, the three parameters define a specific $t$-distribution curve. Thus it is possible to use the functions described in section \ref{sec:prob:dist} on page \ref{sec:prob:dist}, in statistical inference. - -\begin{explainbox} -Package \pkgname{mixtools} provides tools for fitting and analysing \emph{mixture models} such as the mix of two or more univariate Normal distributions. An example of its use could be to estimate mean and standard deviations for males and females in a dataset where the gender was not recorded at the time of observation. -\end{explainbox} - -It is also possible to describe the observed shape of the distribution, or empirical distribution, for a data set without relying on a standard distribution. The fitted empirical distribution can later be used to compute probabilities, quantiles, and random draws similarly as from standard distributions. This also allows statistical inference, using methods such as the bootstrap or some additive models. - -Function \Rfunction{density()} computes kernel density estimates, using different methods. A curve is used to describe the shape, and the bandwith determines how flexible this curve is. The curve is a flexible smoother that adapts to the observed shape. The object returned is a complex list that can be used to plot the estimate shape. - -In the example below we estimate the empirical distribution for the waiting time in minutes between eruptions of the Old Faithful geyser at Yelowstone, a dataset from \Rlang. - -<>= -d <- density(faithful$waiting, bw = "sj") -@ - -\begin{explainbox} -Using \Rfunction{str()} we can explore the structure of the object returned by function \Rfunction{density()}. - -<>= -str(d) -@ - -The object saved as \code{d} is a \code{list} with seven members. The two numeric vectors, \code{x} and \code{y} describe the estimated probability distribution and produce the curve in the plot below. The numerical bandwidth estimated using method \code{"sj"} is in \code{bw}, and the length of vector \code{faithful\$waiting}, the data used, is in \code{n}. Member \code{call} is the command used to call the function, the remaining two members have self explanatory names. The returned object belongs to class \Rclass{density}. The overall pattern is similar, but simpler than for the model fitting functions that we will see later in the chapter. The class name of the object is the same as the name of the function that created it, \code{call} provides a \emph{trace} of how the object was created. Other members, facilitate computation of derived quantities and plotting. Being a list, the individual members can be extracted by name. - -<>= -d$n -@ -\end{explainbox} - -As a \Rmethod{plot()} method is available for class \Rclass{density} we can easily produce a plot of the estimated empirical density distribution. In this case a bimodal curve, with two maxima, and thus, far from Normal. - -<>= -plot(d) -@ - -Observed probability distributions, especially empirical ones, nowadays play a central role in data visualization including 1D and 2D empirical density plots based on the use of functions like \Rfunction{density()}, as well as traditional histograms (see section \ref{sec:plot:density} on page \pageref{sec:plot:density} for examples of more elaborate and elegant plots). -\index{probability distributions!observed|)} - -\section{``Random'' sampling} -\index{random sampling|see{pseudo-random sampling}}% -\index{pseudo-random sampling|(}% - -In addition to drawing values from a theoretical distribution, we can draw values from an existing set or collection of values. We call this operation (pseudo-)random sampling. The draws can be done either with replacement or without replacement. In the second case, all draws are taken from the whole set of values, making it possible for a given value to be drawn more than once. In the default case of not using replacement, subsequent draws are taken from the values remaining after removing the values chosen in earlier draws. - -<>= -sample(x = LETTERS) -sample(x = LETTERS, size = 12) -sample(x = LETTERS, size = 12, replace = TRUE) -@ - -In practice, pseudo-random sampling is useful when we need to select subsets of observations. One such case is assigning treatments to experimental units in an experiment or selecting persons to interview in a survey. Another use is in bootstrapping to estimate variation in parameter estimates using empirical distributions. - -\begin{faqbox}{How to sample random rows from a data frame?} -As described in section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}, data frames are commonly used to store one observation per row. To sample a subset of rows we need to generate a random set of indices to use with the extraction operator (\Roperator{[ ]}). Here we sample four rows from data frame \code{cars} included in \Rlang. These data consist of stopping distances for cars moving at different speeds as described in the documentation available by entering \code{help(cars)}). - -<>= -cars[sample(x = 1:nrow(cars), size = 4), ] -@ - -\end{faqbox} - -\begin{advplayground} -Consult the documentation of \Rfunction{sample()} and explain why the code below is equivalent to that in the example immediately above. - -<>= -cars[sample(x = nrow(cars), size = 4), ] -@ - -\end{advplayground} -\index{pseudo-random sampling|)}% - -\section{Correlation}\label{sec:stats:correlation} -\index{correlation|(} -Both parametric (Pearson's) and non-parametric robust (Spearman's and Kendall's) methods for the estimation of the (linear) correlation between pairs of variables are available in base \Rlang. The different methods are selected by passing arguments to a single function. While Pearson's method is based on the actual values of the observations, non-parametric methods are based on the ordering or rank of the observations, and consequently less affected by observations with extreme values. - -\subsection{Pearson's $r$} -\index{correlation!parametric} -\index{correlation!Pearson} - -Function \Rfunction{cor()} can be called with two vectors of the same length as arguments. In the case of the parametric Pearson method, we do not need to provide further arguments as this method is the default one. We use data set \code{cars}. - -<>= -cor(x = cars$speed, y = cars$dist) -@ - -It is also possible to pass a data frame (or a matrix) as the only argument. When the data frame (or matrix) contains only two columns, the returned correlation estimate is equivalent to that of passing the two columns individually as vectors. The object returned is a $2 \times 2$ \code{matrix} instead of a vector of length one. - -<>= -cor(cars) -@ - -When the data frame or matrix contains more than two numeric vectors, the returned value is a matrix of estimates of pairwise correlations between columns. We here use \Rfunction{rnorm()} described above to create a long vector of pseudo-random values drawn from the Normal distribution and \Rfunction{matrix()} to convert it into a matrix with three columns (see page \pageref{sec:matrix:array} for details about \Rlang matrices). - -<>= -mat1 <- matrix(rnorm(54), ncol = 3, - dimnames = list(rows = 1:18, cols = c("A", "B", "C"))) -cor(mat1) -@ - -\begin{playground} -Modify the code in the chunk immediately above constructing a matrix with six columns and then computing the correlations. -\end{playground} - -While \Rfunction{cor()} returns and estimate for $r$ the correlation coefficient, \Rfunction{cor.test()} also computes the $t$-value, $P$-value, and confidence interval for the estimate. - -<>= -cor.test(x = cars$speed, y = cars$dist) -@ - -Above we passed two numeric vectors as arguments, one to parameter \code{x} and one to parameter \code{y}. Alternatively, we can pass a data frame as argument to \code{data}, and a \emph{model formula} to parameter \code{formula}. The argument passed to \code{formula} determines which variables from \code{data} are to be used, and in which role. Briefly, the variabel(s) to the left of the tilde (\code{~}) are response variables, and those to the right independent variables. In the case of correlation, no assumption is made on cause and effect, and both variables appear to the right of the tilde. The code below is equivalent to that above. See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for details on the use of model formulas and section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for examples of their use in model fitting. - -<>= -cor.test(formula = ~ speed + dist, data = cars) -@ - -\begin{playground} -Functions \Rfunction{cor()} and \Rfunction{cor.test()} return \Rlang objects, that when using \Rlang interactively get automatically ``printed'' on the screen. One should be aware that \Rfunction{print()} methods do not necessarily display all the information contained in an \Rlang object. This is almost always the case for complex objects like those returned by \Rlang functions implementing statistical tests. As with any \Rlang object we can save the result of an analysis into a variable. As described in section \ref{sec:calc:lists} on page \pageref{sec:calc:lists} for lists, we can peek into the structure of an object with method \Rfunction{str()}. We can use \Rfunction{class()} and \Rfunction{attributes()} to extract further information. Run the code in the chunk below to discover what is actually returned by \Rfunction{cor()}. - -<>= -mat1 <- cor(cars) -class(mat1) -attributes(mat1a) -str(mat1) -@ - -Methods \Rfunction{class()}, \Rfunction{attributes()} and \Rfunction{str()} are very powerful tools that can be used when we are in doubt about the data contained in an object and/or how it is structured. Knowing the structure allows us to retrieve the data members directly from the object when predefined extractor methods are not available. -\end{playground} - -\subsection{Kendall's $\tau$ and Spearman's $\rho$} -\index{correlation!non-parametric} -\index{correlation!Kendall} -\index{correlation!Spearman} - -We use the same functions as for Pearson's $r$ but explicitly request the use of one of these methods by passing an argument. - -<>= -cor(x = cars$speed, y = cars$dist, method = "kendall") -cor(x = cars$speed, y = cars$dist, method = "spearman") -@ - -Function \Rfunction{cor.test()}, described above, also allows the choice of method with the same syntax as shown for \Rfunction{cor()}. - -\begin{playground} -Repeat the exercise in the playground immediately above, but now using non-parametric methods. How does the information stored in the returned \code{matrix} differ depending on the method, and how can we extract information about the method used for calculation of the correlation from the returned object. -\end{playground} -\index{correlation|)} - -\section{$t$-test}\label{sec:stats:ttest} -\index{t-test@$t$-test|(}% -\index{Student's t-test@Student's $t$-test|see{$t$-test}}% -The $t$-test is based on Student's $t$-distribution. It can be applied to any parameter estimate for which its standard deviation is available, and the $t$-distribution is a plausible assumption. It is most frequently used to compare an estimate of the mean against a constant value, or the estimate of a difference between two means and a target difference, usually no difference. In \Rlang these can be computed manually using functions \Rfunction{mean()}, \Rfunction{sd()}, and \Rfunction{pt()} or with \Rfunction{t.test()}. - -Although rarely presented is such a way, the $t$-test can be thought of as a special case of a linear model fit. Consistently with functions used to fit models to observations we can use a \emph{formula} to describe a $t$-test. A formula such as \code{y\,\char"007E\,x} is read as $y$ is explained by $x$. We use \emph{lhs} (left-hand-side) and \emph{rhs} (right-hand-side) to signify all terms to the left and right of the tilde (\code{\,\char"007E\,}), respectively (\code{\,\char"007E\,}). (See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for a detailed discussion of model formulas, and section \ref{sec:stat:mf} on page \pageref{sec:stat:mf} for examples of their use in model fitting.) - -<>= -df1 <- data.frame(some.size = c(rnorm(10, mean = 2.5), rnorm(10, mean = 2.0)), - group = factor(rep(c("A", "B"), each = 10))) -@ - -The formula \code{some.size\,\char"007E\,1} is read as ``the mean of variable \code{some.size} is explained by a constant value''. The value estimated from observations, $\bar{x}$, is compared against the value of $\mu$ set as the null hypothesis, where $\mu$ is the \emph{unknown} mean of the sampled population. By default, \Rfunction{t.test()} applies a two-sided test (\code{alternative = "two.sided"}) against \code{mu = 0}, but here we use \code{mu = 2} instead. - -<>= -t.test(some.size ~ 1, mu = 2, data = df1) -@ - -The same test can be calculated step by step. In this case this approach is not needed, but it is useful when we have a parameter estimate (not just mean) and its standard error available, as in model fits (see the advanced playground on page \pageref{box:stats:slope:ttest} for an example). - -<>= -sem = sqrt(var(df1$some.size) / nrow(df1)) -t.value = (mean(df1$some.size) - 2) / sem # Ho: mu = 2 -p.value <- pt(t.value, df = nrow(df1) - 1, lower.tail = FALSE) * 2 # two tails -signif(c(t = t.value, df = nrow(df1) - 1, P = p.value), 4) # 4 digits -@ - -The same function, with a different formula, tests the for the difference between the means of two groups or treatments, $H_o: {\mu}_A - {\mu}_B = 0$. We read the formula \code{some.size\,\char"007E\,group} as ``differences in \code{some.size} are explained by factor \code{group}''. The difference between the means for the two groups is estimated and compared against the hypothesis. (In this case, the value of the argument passed to \code{mu}, zero by default, describes this difference.) By default variances in the two groups are not to assumed equal, - -<>= -t.test(some.size ~ group, data = df1) -@ - -and with \code{var.equal = TRUE} pooling the variances of groups A and B. - -<>= -t.test(some.size ~ group, var.equal = TRUE, data = df1) -@ - -The $t$-test serves as an example of how statistical tests are usually carried out in \Rlang. Table \ref{tab:stats:tests} lists \Rlang functions for frequently used statistical tests. - -\begin{table} - \caption[Statistical tests]{\Rlang functions implementing frequently used statistical tests. Student's $t$-test and correlation tests are described on pages \pageref{sec:stats:ttest} and \pageref{sec:stats:correlation}, respectively.}\vspace{1ex}\label{tab:stats:tests} - - \centering\noindent - \begin{tabular}{ll} - \toprule - Statistical test & Function name \\ - \midrule - Student's $t$-test (1 and 2 samples) & \code{t.test()} \\ - Wilcoxon rank sum and signed rank tests & \code{wilcox.test} \\ - Kolmogorov-Smirnov tests & \code{ks.test()} \\ - Correlation tests (Pearson, Kendall, Spearman) & \code{cor.test()} \\ - $F$-test to compare two variances & \code{var.test()} \\ - Fisher's exact test for count data & \code{fisher.test()} \\ - Pearson's Chi-squared ($\chi^2$) test for count data & \code{chisq.test()} \\ - Exact Binomial Test & \code{binom.test()} \\ - Test of equal or given proportions & \code{prop.test()} \\ - \bottomrule - \end{tabular} -\end{table} - -\index{t-test@$t$-test|)} - -\section[Model fitting in R]{Model fitting in \Rlang}\label{sec:stat:mf} -\index{models!fitting|(} -The general approach to model fitting in \Rlang is to separate the actual fitting of a model from the inspection of the fitted model. A model fitting function minimally requires a description of the model to fit, as a model \code{formula} and a data frame or vectors with the data or observations to which to fit the model. These functions in \Rlang return a model fit object. This object contains the data, the model formula, call and the result of fitting the model. Several methods are available for querying it. The diagram in Figure \ref{fig:model:fit:diagram} summarises the approach used in \Rlang for data analysis based on fitted models. - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model specification}}; -\node (data) [tprocess, below of=model] {\textsl{observations}}; -\node (fitfun) [tprocess, right of=model, yshift=-0.7cm, xshift=2.5cm] {\textsl{fitting function}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm] {\textsl{fitted model}}; -\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\textsl{query methods}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (summary); -\end{tikzpicture} -\end{small} - \caption[Model fitting in \Rlang]{Model fitting in \Rlang is done in steps, and can be represented schematically as a flow of information.}\label{fig:model:fit:diagram} -\end{figure} - -Models are described using model formulas such as \code{y\,\char"007E\,x} which we read as $y$ is explained by $x$. We use \emph{lhs} (left-hand-side) and \emph{rhs} (right-hand-side) to signify all terms to the left and right of the tilde (\code{\,\char"007E\,}), respectively (\code{\,\char"007E\,}). Model formulas are used in different contexts: fitting of models, plotting, and tests like $t$-test. The syntax of model formulas is consistent throughout base \Rlang and numerous independently developed packages. However, their use is not universal, and several packages extend the basic syntax to allow the description of specific types of models. As most things in \Rlang, model formulas are objects and can be stored in variables. See section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas} for a detailed discussion of model formulas. - -Although there is some variation, especially for fitted model classes defined in extension packages, in most cases the \textsl{query functions} bulked together in the rightmost box in the diagram include methods \Rfunction{summary()}, \Rfunction{anova()} and \Rfunction{plot()}, with several other methods such as \Rfunction{coef()}, \Rfunction{residuals()}, \Rfunction{fitted()}, \Rfunction{predict()}, \Rfunction{AIC()}, \Rfunction{BIC()} usually also available. Additional methods may be available. However, as model fit objects are derived from class \code{list}, these and other components can be extract or computed programmatically when needed. Consequently, the examples in this chapter can be adapted to the fitting of types of models not described here. - -\begin{explainbox} - Fitted model objects in \Rlang are self contained and include a copy of the data to which the model was fit, as well as residuals and possibly even intermediate results of computations. Although this can make the size of these objects large, it allows querying and even updating them in the absence of the data in the current \Rlang workspace. -\end{explainbox} - -\section{Fitting linear models}\label{sec:stat:LM} -\index{models!linear|see{linear models}} -\index{linear models|(} -\index{LM|see{linear models}} - -Regression, analysis of variance (ANOVA) and analysis of covariance (ANCOVA) are all linear models, differing only on the type of explanatory variables included in the statistical model fitted. If in the fitted model all explanatory variables are continuous, i.e., \code{numeric}, vectors, the model is a regression model. If all explanatory variables are discrete, i.e., \code{factors}, the model is ANOVA. Finally, if the model contains but \code{numeric} variables and \code{factors} it is named ANCOVA. As in all cases the fitting approach is the same, based on ordinary least squares (OLS), in \Rlang, they are all implemented in function \Rfunction{lm()}. - -There is another meaning of ANOVA, referring only to the tests of significance rather than to an approach to model fitting. Consequently, rather confusingly, results for tests of significance can both in the case of regression, ANOVA and ANCOVA, be presented in an ANOVA table. In this second, stricter meaning, ANOVA means a test of significance based on the ratios between pairs of variances. - -\begin{warningbox} -If you do not clearly remember the difference between numeric vectors and factors, or how they can be created, please, revisit chapter \ref{chap:R:as:calc} on page \pageref{chap:R:as:calc}. -\end{warningbox} - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}}; -\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}}; -\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}}; -\node (fitfun) [tprocess, right of=data, xshift=2.5cm, fill=blue!5] {\code{lm()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}}; -\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}}; -\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}}; -\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow, dashed] (weights) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (plot); -\draw [arrow] (fm) -- (anova); -\draw [arrow] (fm) -- (summary); -\end{tikzpicture} -\end{small} - \caption[Linear model fitting in \Rlang]{Linear model fitting in \Rlang is done in steps. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a linear model fit. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{lm()}. Only the three most frequently used query methods are shown, while both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:lm:fit:diagram} -\end{figure} - -Figure \ref{fig:lm:fit:diagram} shows the steps needed to fit a linear model and extract the estimates and test results. The observations are stored in a data frame, one case or event per row, with values for both response and explanatory variables in variables or columns. The model formula is used to indicate which variables in the data frame are to be used and in which role: either -response or explanatory, and when explanatory how they contribute to the estimated response. The object containing the results from the fit is queried to assess validity and make conclusions or predictions. - -\begin{explainbox} -Weights are multiplicative factors used to alter the \emph{weight} given to individual residuals when fitting a model to observations that are not equally informative. A frequent case is fitting a model to observations that are means of drastically different numbers of individual measurements. Some model fit functions compute the weights, but in most cases they are supplied as an argument to parameter \code{weights}. By default, \code{weights} have a value of \code{1} and thus do not affect the resulting model fit, when supplied or computed, the weights are saved to the model fit object. -\end{explainbox} - -\subsection{Regression}\label{sec:stat:LM:regression} -%\index{linear regression} -\index{linear regression|see{linear models, linear regression}}% -\index{linear models!linear regression|(}% -In this section we continue using the \Rdata{cars} data set, which contains two \code{numeric} variables. - -A simple linear model $y = \alpha \cdot 1 + \beta \cdot x$ where $y$ corresponds to stopping distance (\code{dist}) and $x$ to initial speed (\code{speed}) is formulated in \Rlang as \code{dist \char"007E\ 1 + speed}. The fitted model object is assigned to variable \code{fm1} (a mnemonic for fitted-model one).\label{chunk:lm:models1} - -<>= -fm1 <- lm(dist ~ 1 + speed, data=cars) -class(fm1) -@ - -The next step is diagnosis of the fit. Are assumptions of the linear model procedure used reasonably close to being fulfilled? In \Rlang it is most common to use plots to this end. We show here only one of the plots normally produced. This quantile vs.\ quantile plot is used to assess how much the distribution of the residuals deviates from the assumed Normal distribution. - -<>= -plot(fm1, which = 2) -@ - -In the case of a regression, calling \Rfunction{summary()} with the fitted model object as argument is most useful as it provides a table of coefficient estimates and their errors. Remember that as is the case for most \Rlang functions, the value returned by \Rfunction{summary()} is printed when we call this method at the \Rlang prompt. - -<>= -summary(fm1) -@ - -The summary\index{linear models!summary table} is organized in sections. ``Call:'' shows \code{dist\ \char"007E\ 1 + speed} or the specification of the model fitted, plus the data used. ``Residuals:'' displays the extremes, quartiles and median of the residuals, or deviations between observations and the fitted line. ``Coefficients:'' contains estimates of the model parameters and their variation plus corresponding $t$-tests. In the last three lines there is information on overall standard error and its degrees of freedom and overall coefficient of determination ($R^2$) and $F$-statistic. - -Replacing $\alpha$ and $\beta$ in $y = \alpha \cdot 1 + \beta \cdot x$ by the estimates for the intercept, $a = -17.6$, and slope, $b = 3.93$, we obtain an estimate for the regression line $y = -17.6 + 3.93 x$. However, given the nature of the problem, we \emph{know based on first principles} that stopping distance must be zero when speed is zero. This suggests that we should not estimate the value of $\alpha$ but instead set $\alpha = 0$, or in other words, fit the model $y = \beta \cdot x$. - -In \Rlang models, the intercept is included by default, so the model fitted above can be formulated as \code{dist\ \char"007E\ speed}---i.e., the missing \code{+ 1} does not change the model. To exclude the intercept we need to specify it as \code{dist\ \char"007E\ speed - 1} (or its equivalent \code{dist\ \char"007E\ speed + 0}), for a straight line passing through the origin ($x = 0$, $y = 0$). In the summary for this model there is an estimate for the slope but not for the intercept. - -<>= -fm2 <- lm(dist ~ speed - 1, data = cars) -summary(fm2) -@ - -The equation for \code{fm2} is $y = 2.91 x$. From the residuals, it can be seen that it is inadequate, as the straight line does not follow the curvature of the cloud of observations. - -\begin{playground} -You will now fit a second-degree polynomial\index{linear models!polynomial regression}\index{polynomial regression}, a different linear model: $y = \alpha \cdot 1 + \beta_1 \cdot x + \beta_2 \cdot x^2$. The function used is the same as for linear regression, \Rfunction{lm()}. We only need to alter the formulation of the model. The identity function \Rfunction{I()} is used to protect its argument from being interpreted as part of the model formula. Instead, its argument is evaluated beforehand and the result is used as the, in this case second, explanatory variable.\label{chunk:stats:fm3} - -<>= -fm3 <- lm(dist ~ speed + I(speed^2), data = cars) -plot(fm3, which = 3) -summary(fm3) -@ - -The ``same'' fit using an orthogonal polynomial can be specified using function \Rfunction{poly()}. Polynomials of different degrees can be obtained by supplying as the second argument to \Rfunction{poly()} the corresponding positive integer value. In this case, the different terms of the polynomial are bulked together in the summary. - -<>= -fm3a <- lm(dist ~ poly(speed, 2), data = cars) -summary(fm3a) -@ - -It is possible to compare two model fits using \Rfunction{anova()}, testing whether one of the models describes the data better than the other. It is important in this case to take into consideration the nature of the difference between the model formulas, most importantly if they can be interpreted as nested---i.e., interpreted as a base model vs. the same model with additional terms. - -<>= -anova(fm2, fm1) -@ - -Three or more models can also be compared in a single call to \Rfunction{anova()}. However, care is needed, as the order of the arguments matters. - -<>= -anova(fm2, fm3, fm3a) -anova(fm2, fm3a, fm3) -@ - -\label{par:stats:AIC}% -\index{Akaike's An Information Criterion@Akaike's \emph{An Information Criterion}}% -\index{AIC|see{\emph{An Information Criterion}}}% -\index{Bayesian Information Criterion@\emph{Bayesian Information Criterion}}% -\index{BIC|see{\emph{Bayesian Information Criterion}}}% -\index{Schwarz's Bayesian Criterion@Schwarz's \emph{Bayesian criterion}|see{\emph{Bayesian Information Criterion}}}% -Different criteria can be used to choose the ``best'' model: significance based on $p$-values or information criteria (AIC, BIC). AIC (Akaike's ``An Information Criterion'') and BIC (``Bayesian Information Criterion'' = SBC, ``Schwarz's Bayesian criterion'') that penalize the resulting ``goodness'' based on the number of parameters in the fitted model. In the case of AIC and BIC, a smaller value is better, and values returned can be either positive or negative, in which case more negative is better. Estimates for both BIC and AIC are returned by \Rfunction{anova()}, and on their own by \Rfunction{BIC()} and \Rfunction{AIC()} - -<>= -BIC(fm2, fm1, fm3, fm3a) -AIC(fm2, fm1, fm3, fm3a) -@ - -Once you have run the code in the chunks above, you will be able see that these three criteria do not necessarily agree on which is the ``best'' model. Find in the output $P$-value, BIC and AIC estimates, for the different models and conclude which model is favored by each of the three criteria. In addition you will notice that the two different formulations of the quadratic polynomial are equivalent. - -\end{playground} - -Additional query methods give easy access to different aspects of fitted models: \Rfunction{vcov()} returns the variance-covariance matrix, \Rfunction{coef()} and its alias \Rfunction{coefficients()} return the estimates for the fitted model coefficients, \Rfunction{fitted()} and its alias \Rfunction{fitted.values()} extract the fitted values, and \Rfunction{resid()} and its alias \Rfunction{residuals()} the corresponding residuals (or deviations) (Figure \ref{fig:lm:fit:query:more}). Less frequently used accessors are \Rfunction{getCall()}, \Rfunction{effects()}, \Rfunction{terms()}, \Rfunction{model.frame()} and \Rfunction{model.matrix()}. - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}}; -\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}}; -\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}}; -\node (fitfun) [tprocess, right of=data, xshift=2.5cm, fill=blue!5] {\code{lm()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}}; -\node (fitted) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{fitted()}}; -\node (resid) [tprocess, color = black, above of=fitted, yshift = -0.4cm] {\code{residuals()}}; -\node (coef) [tprocess, color = black, above of=resid, yshift = -0.4cm] {\code{coef()}}; -\node (aic) [tprocess, color = black, below of=fitted, yshift = 0.4cm] {\code{AIC()}; \code{BIC()}}; -\node (predict) [tprocess, color = black, below of=aic, yshift = 0.4cm] {\code{predict()}}; -\node (newdata) [tprocess, color = black, left of=predict, xshift = -2.4cm] {\textsl{expl.\ vars.} $\to$ \code{newdata}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow, dashed] (weights) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (coef); -\draw [arrow] (fm) -- (resid); -\draw [arrow] (fm) -- (fitted); -\draw [arrow] (fm) -- (aic); -\draw [arrow] (fm) -- (predict); -\draw [arrow] (newdata) -- (predict); -\end{tikzpicture} -\end{small} - \caption[Linear model fitting, more query methods ]{Diagram including additional methods used to query fitted model objects using linear models as an example. For other details please see the legend to Figure \ref{fig:lm:fit:diagram}.}\label{fig:lm:fit:query:more} -\end{figure} - -\begin{playground} -Familiarize yourself with these extraction and summary methods by reading their documentation and use them to explore \code{fm1} fitted above or model fits to other data of your interest. -\end{playground} - -\begin{explainbox}\label{box:LM:fit:object} -The\index{linear models!structure of model fit object} objects returned by model fitting functions contain the full information, including the data to which the model was fit to. Their structure resembles a nested list. In most cases the class of the objects returned by model fit functions agrees in name with the name of the model-fit function (\code{"lm"} in this example) but is not derived from \Rlang class \code{"list"}. The different functions described above, either extract parts of the object or do additional calculations and formatting based on them. There are different specializations of these methods which are called depending on the class of the model-fit object. (See section \ref{sec:methods} on page \pageref{sec:methods}.) - -<>= -class(fm1) -names(fm1) -@ - -The structure of model-fit objects is of interest only when the query or accessor functions do not provide the needed information and when we are interested to learn how model fitting works in \Rlang. As with any other \Rlang object \Rfunction{str()} shows the structure. Calling \Rfunction{str()} with \code{no.list = TRUE, give.attr = FALSE, vec.len = 2} as arguments restricts the output to an overview of the structure of \code{fm1}. - -<>= -str(fm1, no.list = TRUE, give.attr = FALSE, vec.len = 2) -@ - -Member \code{call} contains the function call and arguments used to create object \code{fm1}. - -<>= -str(fm1$call) -@ - -It is usual\index{linear models!structure of summary object} to only look at the value returned by \Rfunction{anova()} and \Rfunction{summary()} as implicitly displayed by \code{print()}. However, both \Rfunction{anova()} and \Rfunction{summary()} return complex objects, derived from \code{list}, containing some members not displayed by \code{print()}. Access to members of these objects can be necessary to use them in further calculations or to print them in a format different to the default. - -The class of the object returned by \code{anova()} does not depend on the class of the model fit object, while its structure does depend. - -<>= -anova(fm1) -@ - -<>= -class(anova(fm1)) -@ - -<>= -str(anova(fm1)) -@ - -The class of the summary objects depends on the class of the model fit object; \Rfunction{summary()} is a generic method with multiple specializations. - -<>= -class(summary(fm1)) -@ - -One case where we need to extract individual members is when adding annotations to plots. Another case is when writing reports to include programmatically the computed values within the text. \Rfunction{str()} can be used to display the structure in a simplified way, and later of one member as above. - -<>= -str(summary(fm1), no.list = TRUE, give.attr = FALSE, vec.len = 2) -@ - -Extraction of members follows the usual \Rlang rules using \Roperator{\$}, \Roperator{[ ]}, or \Roperator{[[ ]]}. - -<>= -summary(fm1)$adj.r.squared -@ - -The \code{coefficients} estimates in the summary are accompanied by estimates for the corresponding standard errors, \emph{t}-value and \emph{P}-value, while in the model object \code{fm1} these are not included. - -<>= -coef(fm1) -str(fm1$coefficients) -print(summary(fm1)$coefficients) -str(summary(fm1)$coefficients) -@ - -\end{explainbox} - -\begin{explainbox}\label{box:stats:slope:ttest} -As\index{linear models!ad-hoc tests for parameters}\index{t-test@$t$-test|(}\index{calibration curves} an example of the use of values extracted from the \code{summary.lm} object, I show how to test if the slope from a linear regression fit deviates significantly from a constant value different from the usual zero, which tests for the presence of an ``effect'' of the explanatory variable. When testing for deviations from a calibration by comparing two instruments or an instrument and a reference, a null hypothesis of one for the slope tests for deviations from the true readings. In some cases, when comparing the effectiveness of interventions we may be interested to test if a new approach surpasses that in current use by at least a specific margin. There exist practical situations where testing if a response exceeds a threshold is of interest. - -When using \Rfunction{anova()} and \Rfunction{summary()} the null hypothesis is no effect or no response. The equivalent test with a null hypothesis of slope = 1 is easy to implement if we consider how we calculate a $t$-value (see section \ref{sec:stats:ttest} on page \pageref{sec:stats:ttest}). To compute the \emph{t}-value we need an estimate for slope and an estimate of the standard error for slope, and the degrees of freedom. All these values are available as members of the summary object of a fitted model. - -<>= -est.slope.value <- summary(fm1)$coefficients["speed", "Estimate"] -est.slope.se <- summary(fm1)$coefficients["speed", "Std. Error"] -degrees.of.freedom <- summary(fm1)$df[2] -@ - -The \emph{t}-value is computed based on the difference between the estimate for the slope and the null hypothesis and the standard error. A probability is obtained based on computed $t$-value, or quantile, and the $t$ distribution with matching degrees of freedom with a call to \Rfunction{pt()} (see section \ref{sec:prob:dist} on page \pageref{sec:prob:dist}.) For a two-tail test we multiply by two the one-tail $P$ estimate. - -<>= -hyp.null <- 1 -t.value <- (est.slope.value - hyp.null) / est.slope.se -p.value <- 2 * pt(q = t.value, df = degrees.of.freedom, lower.tail = FALSE) -cat("slope =", signif(est.slope.value, 3), - "with s.e. =", signif(est.slope.se, 3), - "\nt.value =", signif(t.value, 3), - "and P-value =", signif(p.value, 3)) -@ - -This example is for a linear model fitted with function \Rfunction{lm()} but the same approach can be applied to other model fit procedures for which parameter estimates and their corresponding standard error estimates can be extracted or computed. -\end{explainbox} - -\begin{advplayground} -Check that the computations above after replacing \code{hyp.null <- 1} by \code{hyp.null <- 0} agree with the output of printing \code{summary()}. - -Modify the example above so as to test whether the intercept is significantly larger than 5 feet, doing a one-sided test. - -\end{advplayground} -\index{t-test@$t$-test|)} - -Method \Rfunction{predict()} uses the fitted model together with new data for the independent variables to compute predictions. As \Rfunction{predict()} accepts new data as input, it allows interpolation and extrapolation to values of the independent variables not present in the original data. In the case of fits of linear- and some other models, method \Rfunction{predict()} returns, in addition to the prediction, estimates of the confidence and/or prediction intervals. The new data must be stored in a data frame with columns using the same names for the explanatory variables as in the data used for the fit, a response variable is not needed and additional columns are ignored. (The explanatory variables in the new data can be either continuous or factors, but they must match in this respect those in the original data.) - -\begin{playground} -Predict using both \code{fm1} and \code{fm2} the distance required to stop cars moving at 0, 5, 10, 20, 30, and 40~mph. Study the help page for the predict method for linear models (using \code{help(predict.lm)}). Explore the difference between \code{"prediction"} and \code{"confidence"} bands: why are they so different? -\end{playground} -\index{linear models!linear regression|)}% - -\subsection{Analysis of variance, ANOVA}\label{sec:anova} -%\index{analysis of variance} -\index{analysis of variance|see{linear models, analysis of variance}}% -\index{ANOVA|see{linear models, analysis of variance}}% -\index{linear models!analysis of variance|(}% -In ANOVA the explanatory variable is categorical, and in \Rlang, must be a \code{factor} or \code{ordered} factor (see section \ref{sec:calc:factors} on page \pageref{sec:calc:factors}). As a linear model, the fitting approach is the same as for linear and polynomial regression (Figure \ref{fig:lm:fit:diagram}). -The \Rdata{InsectSprays} data set used in the next example, gives insect counts in plots sprayed with different insecticides. In these data, \code{spray} is a factor with six levels.% -\label{xmpl:fun:lm:fm4} - -What determines that this is an ANOVA is that \code{spray}, the explanatory variable, is a \code{factor}. - -<>= -data(InsectSprays) -is.numeric(InsectSprays$spray) -is.factor(InsectSprays$spray) -levels(InsectSprays$spray) -@ - -By using a factor instead of a numeric vector, a different model matrix is built from an equivalent formula.\label{chunk:stat:fm4} - -<>= -fm4 <- lm(count ~ spray, data = InsectSprays) -@ - -Diagnostic plots are obtained in the same way as for linear regression. We show only the quantile-quantile plot for simplicity, but during data analysis it is very important to check all the diagnostics plots. As many of the residuals deviate from the one-to-one line we have to conclude the residuals do not follow the Normal distribution, and a different approach to model fitting should be used (see section \ref{sec:stat:GLM} on page \pageref{sec:stat:GLM}). - -<>= -plot(fm4, which = 2) -@ - -In ANOVA,\index{F-test@$F$-test} most frequently the interest is in testing hypotheses with function \Rfunction{anova()}, that implements the $F$-test for main effects of factors and their interactions. In this example, with a single explanatory variable there is only one effect of interest, that of \code{sprays}. - -<>= -anova(fm4) -@ - -\index{models!contrasts|(} -\begin{warningbox} -Function \Rfunction{summary()} can be used to extract parameter estimates informing of the size of the effects, but meaningfully only by using contrasts different to the default ones. -Function \Rfunction{aov()} is a wrapper on \Rfunction{lm()} that returns an object that by default displays the output of \Rfunction{anova()} also with \Rfunction{summary()}, but even in this case it can be preferable to change the default contrasts (see \code{help(aov)}). - -The contrasts used affect the estimates returned by \Rfunction{coef()} and \Rfunction{summary()} applied to an ANOVA model fit. The default used in \Rlang, \Rfunction{contr.treatment}, is different to that used in \Slang, \Rfunction{contr.helmert}. With \Rfunction{contr.treatment} the first level of the factor (assumed to be a control) is used as reference for estimation of coefficients and testing of their significance. With \Rfunction{contr.helmert} the contrasts are of the second level with the first, the third with the average of the first two, and so on. These contrasts depend on the order of factor levels. Instead, \code{contr.sum} uses as reference the mean of all levels, i.e., using as condition that the sum of the coefficient estimates is equal to zero. Obviously what type of contrast is used changes what the coefficient estimates describe, and, consequently, how the $p$-values should be interpreted. -\end{warningbox} - -\begin{explainbox} -The approach used by default for model fits and ANOVA calculations vary among programs. There exist different so-called ``types'' of sums of squares, usually called I, II, and III. In orthogonal designs the choice is of no consequence, but differences can be important for unbalanced designs, even leading to different conclusions. \Rlang's default, type~I, is usually considered to suffer milder problems than type~III, the default used by \pgrmname{SPSS} and \pgrmname{SAS}. In any case, for unbalanced data it is preferable to use the approach implemented in package \pkgname{nlme}. -\end{explainbox} - -\begin{explainbox} -The most straightforward way of setting a different default for contrasts for a whole series of model fits is by setting \Rlang option \code{contrasts}, which we here only print. - -<>= -options("contrasts") -@ - -The option is set to a named character vector of length two, with the first value, named \code{unordered} giving the name of the function used when the explanatory variable is an unordered \code{factor} (created with \Rfunction{factor()}) and the second value, named \code{ordered}, giving the name of the function used when the explanatory variable is an \code{ordered} factor (created with \Rfunction{ordered()}). - -It is also possible to select the contrast to be used in the call to \code{aov()} or \code{lm()}. - -<>= -fm4trea <- lm(count ~ spray, data = InsectSprays, - contrasts = list(spray = contr.treatment)) -fm4sum <- lm(count ~ spray, data = InsectSprays, - contrasts = list(spray = contr.sum)) -@ - -In \code{fm4trea} we used \Rfunction{contr.treatment()}, thus contrasts for individual treatments are done against \code{Spray1} taking it as the control or reference, as can be inferred from the generated contrasts matrix. For this reason, there is no row for \code{Spray1} in the summary table. Each of the rows \code{Spray2} to \code{Spray6} is a test comparing these treatments individually against \code{Spray1}. - -<>= -contr.treatment(length(levels(InsectSprays$spray))) -@ - -<>= -summary(fm4trea) -@ - -In \code{fm4sum} we used \Rfunction{contr.sum()} with the sum constrained to be zero, thus estimates for the last treatment level are determined by the sum of the previous ones, and not tested for significance. - -<>= -contr.sum(length(levels(InsectSprays$spray))) -@ - -<>= -summary(fm4sum) -@ - -\end{explainbox} - -\begin{advplayground} - Explore how taking the last level as reference in \Rfunction{contr.SAS()} instead of the first one as in \Rfunction{contr.treatment()} affects the estimates. Reorder the levels of factor \code{spray} so that the test using \Rfunction{contr.SAS()} becomes equivalent to that obtained above with \Rfunction{contr.treatment()}. Consider why \Rfunction{contr.poly()} is the default for ordered factors and when \Rfunction{contr.helmert()} could be most useful. -\end{advplayground} - -Contrasts, on the other hand, do not affect the table returned by \Rfunction{anova()} as this table does not deal with the effects of individual factor levels. The overall estimates shown at the bottom of the summary table remain unchanged. In other words, when using different contrasts what changes is how the total variation explained by the fitted model is partitioned into components to be tested for specific contributions to the overall model fit. - -\begin{explainbox} -\index{tests!adjusted P-values@{adjusted $P$-values}}% - Post-hoc test with contrasts\index{tests!post-hoc} and multiple comparisons\index{tests!multiple comparisons} tests are most frequently applied after an ANOVA to test for differences among pairs of treatments or specific combinations of treatments. \Rlang function \Rfunction{Tukey.test()} implements Tukey's HSD\index{tests!Tukey's HSD} (honestly significant difference) test for pairwise tests. Function \Rfunction{pairwise.t.test()} supports different correction methods for the $P$-values from simultaneous $t$-tests. Function \Rfunction{p.adjust()} applies adjustments to $P$-values, and can be used when the test procedure does not apply them. The most comprehensive implementation of multiple comparisons is available in package \pkgname{multcomp}. Function \Rfunction{glht()} (general linear hypothesis testing) from this package supports different contrasts and adjustment methods. -\end{explainbox} - -Contrasts and their interpretation are discussed in detail by \citeauthor{Venables2002} (\citeyear{Venables2002}) and \citeauthor{Crawley2012} (\citeyear{Crawley2012}). -\index{models!contrasts|)} -\index{linear models!analysis of variance|)}% - -\subsection{Analysis of covariance, ANCOVA} -%\index{analysis of covariance} -\index{analysis of covariance|see{linear models, analysis of covariance}} -\index{linear models!analysis of covariance} -\index{ANCOVA|see{linear models, analysis of covariance}} - -When a linear model includes both explanatory factors and continuous explanatory variables, we may call it \emph{analysis of covariance} (ANCOVA). The formula syntax is the same for all linear models and, as mentioned in previous sections, what determines the type of analysis is the nature of the explanatory variable(s). As the formulation remains the same, no specific example is given. The main difficulty of ANCOVA is in the selection of the covariate and the interpretation of the results of the analysis \autocite[e.g.][]{Smith1957}. -\index{linear models|)} - -\subsection{Model update and selection}\label{sec:stat:update:step} -\index{models!updating|(} -Model fit objects can be updated, i.e., modified, because they contain not only the results of the fit but also the data to which the model was fit (see page \pageref{box:LM:fit:object}). Given that the call is also stored, all the information needed to recalculate the same fit is available. Method \Rfunction{update()} makes possible to recalculate the fit with changes to the call, without passing again all the arguments to a new call to \Rfunction{lm()} (Figure \ref{fig:lm:update:diagram}). We can modify different arguments, including selecting part of the data by passing a new argument to formal parameter \code{subset}. - -\begin{figure} - \centering - \begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model}}; -\node (data) [tprocess, below of=model, yshift=-0.4cm] {\textsl{observations}}; -\node (fitfun) [tprocess, right of=model, yshift=-0.9cm, xshift=1cm] {\code{lm()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.8cm] {\code{lm} \textsl{object} $\to$ \code{object}}; -\node (update) [tprocess, color = black, right of=fm, xshift = 1.8cm, fill=blue!5] {\code{update()}}; -\node (newmodel) [tprocess, above of=fm, fill=blue!5, yshift=-0.3cm] {\textsl{new model} $\to$ \code{formula}}; -\node (newdata) [tprocess, below of=fm, fill=blue!5, yshift=0.3cm] {\textsl{new observs.} $\to$ \code{data}}; -\node (newfm) [tprocess, color = black, right of=update, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (update); -\draw [arrow] (newmodel) -- (update); -\draw [arrow] (newdata) -- (update); -\draw [arrow] (update) -- (newfm); -\end{tikzpicture} -\end{small} - \caption[Updating a fitted model]{Diagram showing the steps for updating a fitted model (in filled boxes) together with the previous steps in unfilled boxes. Please, see Figure \ref{fig:lm:fit:diagram} for other details.}\label{fig:lm:update:diagram} -\end{figure} - -Method \Rfunction{update()} retrieves the call from the model fit object using \Rfunction{getCall()}, modifies it and, by default, evaluates it. The default \Rfunction{update()} method works as long as the model-fit object contains a member named \code{call} or if a specialization of \Rfunction{getCall()} is available. Thus, method \Rfunction{update()} can be used with models fitted with other functions in addition to \Rfunction{lm()}. - -For the next example we recreate the model fit object \code{fm4} from page \pageref{chunk:stat:fm4}. - -<>= -fm4 <- lm(count ~ spray, data = InsectSprays) -anova(fm4) -fm4a <- update(fm4, formula = log10(count + 1) ~ spray) -anova(fm4a) -@ - -\begin{playground} -Print \code{fm4\$call} and \code{fm4a\$call}. These two calls differ in the argument to \code{formula}. What other members have been updated in \code{fm4a} compared to \code{fm4}? -\end{playground} - -In the chunk above we replaced the argument passed to \code{formula}. This is a frequent use, but for example to fit the same model to a subset of the data we can pass a suitable argument to parameter \code{subset}. - -<>= -fm4b <- update(fm4, subset = !spray %in% c("A", "B")) -anova(fm4b) -@ - -\begin{advplayground} -When having many treatments with long names, which is not the case here, instead of listing the factor levels for which to subset the data, it can be convenient to use regular expressions for pattern matching (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}). Run the code below, and investigate why \code{anova(fm4b)} and \code{anova(fm4c)} produce the same ANOVA table printout, but the fit model objects are not identical. You can use \code{str()} to explore if any members differ between the two objects. - -<>= -fm4c <- update(fm4, subset = !grepl("[AB]", spray)) -anova(fm4c) -identical(fm4b, fm4c) -@ - -\end{advplayground} - -\begin{explainbox} -Method \Rfunction{update()} plays an additional role when fitting is done by numerical approximation, as the previously computed estimates are used as the starting values for the numerical calculations required for fitting the updated model (see section \ref{sec:stat:NLS} on page \pageref{sec:stat:NLS} as an example). This can drastically decrease computation time, or even easy the task of finding suitable starting values for parameter estimates by fitting increasingly more complex nested models. -\end{explainbox} - -\index{models!updating|)} - -\index{models!stepwise selection|(}\index{linear models!stepwise model selection|(} -Method \Rfunction{update()} used together with \Rfunction{AIC()} (or \Rfunction{anova()}) gives us the tools to compare nested models, and select one out a group of them as shown above. When comparing several models doing the comparisons manually is tedious, and in scripts, in many cases difficult to write code that is flexible (or abstract) enough. Method \Rfunction{step()} automates stepwise selection of nested models such as the selection among polynomials of different degrees or which variables to retain in multiple regression. After fitting a model, method \Rfunction{step()} is used to update this model using an automatic stopping criterion (Figure \ref{fig:lm:step:diagram}). - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model}}; -\node (data) [tprocess, below of=model, yshift=-0.4cm] {\textsl{observations}}; -\node (fitfun) [tprocess, right of=model, yshift=-0.9cm, xshift=1cm] {\code{lm()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.8cm] {\code{lm} \textsl{object} $\to$ \code{object}}; -\node (step) [tprocess, color = black, right of=fm, xshift = 1.8cm, fill=blue!5] {\code{step()}}; -\node (newmodels) [tprocess, dashed, above of=fm, fill=blue!5, yshift=-0.3cm] {\textsl{new model(s)} $\to$ \code{scope}}; -\node (newfm) [tprocess, color = black, right of=update, xshift=1.5cm, fill=blue!5] {\code{lm} \textsl{object}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (update); -\draw [arrow, dashed] (newmodels) -- (step); -\draw [arrow] (step) -- (newfm); -\end{tikzpicture} -\end{small} - \caption[Stepwise model selection]{Diagram showing the steps used for stepwise model selection among nested models (in filled boxes) together with the previous steps in unfilled boxes. The range of models to select from can be set by the user. Please, see Figure \ref{fig:lm:fit:diagram} for other details.}\label{fig:lm:step:diagram} -\end{figure} - -Step-wise model selection, either in the \emph{forward} direction from simpler to more complex models, in the backward direction from more complex to simpler models or in both directions is implemented in base \Rlang's \emph{method} \Rfunction{step()} using Akaike's information criterion (AIC)\index{Akaike's An Information Criterion@Akaike's \emph{An Information Criterion}} as the selection criterion. Use of method \Rfunction{step()} from \Rlang is possible, for example, with \code{lm()} and \code{glm} fits. AIC is described on page \pageref{par:stats:AIC}. - -For the next example we use \code{fm3} from page \pageref{chunk:stats:fm3}, a linear model for a polynomial regression. If as shown here, no models are passed through formal parameter \code{scope}, the previously fit model will be simplified, if possible. Method \Rfunction{step()} by default prints to the console a trace of the models tried and the corresponding AIC estimates. - -<>= -fm3 <- lm(dist ~ speed + I(speed^2), data = cars) -fm3a <- step(fm3) -@ - -Method \Rfunction{summary()} reveals the differences between the original and updated models. - -<>= -summary(fm3) -summary(fm3a) -@ - -If we pass a model with additional terms through parameter \code{scope} this model will be taken as the most complex model to be assessed. If, instead of one model, we pass two nested models in a list and name them \code{lower} and \code{upper}, they will delimit the scope of the stepwise search. In the next example we see that first a backward search is done and term \code{speed} is removed as removal decreases AIC. Subsequently a forward search is done unsuccessfully for a model with smaller AIC. - -<>= -fm3b <- - step(fm3, - scope = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4)) -summary(fm3b) -@ - -\begin{playground} -Explain why the stepwise model selection in the code below differs from those in the two previous examples. Consult \code{help(step)} is necessary. - -<>= -fm3c <- - step(fm3, - scope = list(lower = dist ~ speed, - upper = dist ~ speed + I(speed^2) + I(speed^3) + I(speed^4))) -summary(fm3c) -@ - -\end{playground} - -Functions \Rfunction{update()} and \Rfunction{step()} are \emph{convenience functions} as they provide direct and/or simpler access to operations available through other functions or combined use of multiple functions. -\index{linear models!stepwise model selection|)}\index{models!stepwise selection|)} - -\section{Generalized linear models}\label{sec:stat:GLM} -\index{generalized linear models|(}\index{models!generalized linear|see{generalized linear models}} -\index{GLM|see{generalized linear models}} - -Linear models make the assumption of normally distributed residuals. Generalized linear models, fitted with function \Rfunction{glm()} are more flexible, and allow the assumed distribution to be selected as well as the link function (defaults are as in \code{lm()}). Figure \ref{fig:glm:fit:diagram} shows that the steps used to fit a model with \Rfunction{glm()} are the same as with \code{lm()} except that we can select the probability distribution assumed to describe the variation among observations. Frequently used probability distributions are binomial and Poisson (see \code{help(family)} for the variations and additional ones).\index{models!for binomial outcomes data}\index{models!for counts data} - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}}; -\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}}; -\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}}; -\node (family) [tprocess, below of=weights, fill=blue!5, yshift = 0.4cm] {\textsl{distribution} $\to$ \code{family}}; -\node (fitfun) [tprocess, right of=data, xshift=2.5cm, yshift = -0.4cm,fill=blue!5] {\code{glm()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{glm} \textsl{object}}; -\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}}; -\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}}; -\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow, dashed] (weights) -- (fitfun); -\draw [arrow] (family) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (plot); -\draw [arrow] (fm) -- (anova); -\draw [arrow] (fm) -- (summary); -\end{tikzpicture} -\end{small} - \caption[Generalized linear model fitting in \Rlang]{Generalized linear model fitting in \Rlang is done in steps similar to those used for linear models. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a generalized linear model fit. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{glm()}. Only the three most frequently used query methods are shown, while both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:glm:fit:diagram} -\end{figure} - -For count data, GLMs are preferred over LMs. In the example below we fit the same model as above, but assuming a quasi-Poisson distribution instead of the Normal. An argument passed to \code{family} selects the assumed error distribution. The \Rdata{InsectSprays} data set used in the next example, gives insect counts in plots sprayed with different insecticides. In these data, spray is a factor with six levels. - -<>= -fm10 <- glm(count ~ spray, data = InsectSprays, family = quasipoisson) -@ - -Method \Rfunction{plot()} as for linear-model fits, produces diagnosis plots. We show as above the q-q-plot of residuals. The Normal distribution assumed in the linear model fit was not a good approximation (section \ref{sec:anova} on page \pageref{sec:anova}), as count data are known to follow a different distribution. This is clear by comparing the quantile--quantile plot for \code{fm4} (page \pageref{sec:anova}) and the plot below for the model fit under the assumption of a Quasi-Poisson distribution. - -<>= -plot(fm10, which = 2) -@ - -The printout from the \Rfunction{anova()} method for GLM fits has some differences to that for LM fits. In \Rlang versions previous to 4.4.0 no test statistics or $P$-values were computed unless requested by passing an argument to parameter \code{test}. In newer versions of \Rlang either a chi-squared test or an $F$-test are computed by default depending on whether the dispersion is fixed or free. We here use \code{"F"} as an argument to request an $F$-test. - -<>= -anova(fm10, test = "F") -@ - -We can extract different components similarly as described for linear models (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}). - -<>= -class(fm10) -summary(fm10) -head(residuals(fm10)) -head(fitted(fm10)) -@ - -\begin{explainbox} -If we use \code{str()} or \code{names()} we can see that there are some differences with respect to linear model fits. The returned object is of a different class and contains some members not present in linear models. Two of these have to do with the iterative approximation method used, \code{iter} contains the number of iterations used and \code{converged} the success or not in finding a solution. - -<>= -class(fm10) -names(fm10) -fm10$converged -fm10$iter -@ -\end{explainbox} - -Methods \code{update()} and \code{step()}, described for \code{lm()} in section \ref{sec:stat:update:step} on page \pageref{sec:stat:update:step}, can be also used with models fitted with \code{glm()}. - -\index{generalized linear models|)} - -\section{Non-linear regression}\label{sec:stat:NLS} -\index{non-linear models|(}% -\index{models!non-linear|see{non-linear models}}% -\index{NLS|see{non-linear models}} - -By \emph{non-linear} it is meant non-linear \emph{in the parameters} whose values are being estimated through fitting the model to observations. This is different from the shape of the function when plotted---i.e., polynomials of any degree are linear models. In contrast, the Michaelis-Menten equation used in chemistry and the Gompertz equation used to describe growth are models non-linear in their parameters. - -While analytical algorithms exist for finding estimates for the parameters of linear models, in the case of non-linear models, the estimates are obtained by approximation. For analytical solutions, estimates can always be obtained (except in pathological cases affected by the limitations of floating point numbers described on page \pageref{box:integer:float}). For approximations obtained through iteration, cases when the algorithm fails to \emph{converge} onto an answer are relatively common. Iterative algorithms attempt to improve an initial guess for the values of the parameters to be estimated, a guess frequently supplied by the user. In each iteration the estimate obtained in the previous iteration is used as the starting value, and this process is repeated one time after another. The expectation is that after a finite number of iterations the algorithm will converge into a solution that ``cannot'' be improved further. In real life we stop iteration when the improvement in the fit is smaller than a certain threshold, or when no convergence has been achieved after a certain maximum number of iterations. In the first case, we usually obtain good estimates; in the second case, we do not obtain usable estimates and need to look for different ways of obtaining them. - -When convergence fails, the first thing to do is to try different starting values and if this also fails, switch to a different computational algorithm. These steps usually help, but not always. Good starting values are in many cases crucial and in some cases ``guesses'' can be obtained using either graphical or analytical approximations. - -Function \Rfunction{nls()} is \Rlang's workhorse for fitting non-linear models. The steps for its use are similar to those for LM and GLM (Figure \ref{fig:nls:fit:diagram}). One difference is that starting values are needed, and another difference is in how the model to be fitted is specified: the user provides the names of the parameters and a model equation that includes in the \emph{rhs} a call to an \Rlang function. - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}}; -\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}}; -\node (weights) [tprocess, dashed, below of=data, yshift = 0.4cm, fill=blue!1] {\textsl{weights} $\to$ \code{weights}}; -\node (guess) [tprocess, below of=weights, fill=blue!5, yshift = 0.4cm] {\textsl{guesses} $\to$ \code{start}}; -\node (fitfun) [tprocess, right of=data, xshift=2.5cm, yshift = -0.4cm, fill=blue!5] {\code{nls()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{nls} \textsl{object}}; -\node (summary) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{summary()}}; -\node (anova) [tprocess, color = black, below of=summary, yshift = 0.4cm] {\code{anova()}}; -\node (plot) [tprocess, color = black, above of=summary, yshift = -0.4cm] {\code{plot()}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow, dashed] (weights) -- (fitfun); -\draw [arrow] (guess) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (plot); -\draw [arrow] (fm) -- (anova); -\draw [arrow] (fm) -- (summary); -\end{tikzpicture} -\end{small} - \caption[Non-linear model fitting in \Rlang]{Non-linear model fitting in \Rlang is done in steps. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show a non-linear model fit. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{nls()}. Only the three most frequently used query methods are shown, while both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:nls:fit:diagram} -\end{figure} - -In cases when algorithms exist for ``guessing'' suitable starting values, \Rlang provides a mechanism for packaging the \Rlang function to be fitted together with the \Rlang function generating the starting values. These functions go by the name of \emph{self-starting functions} and relieve the user from the burden of guessing and supplying suitable starting values. The\index{self-starting functions} self-starting functions available in \Rlang are \code{SSasymp()}, \code{SSasympOff()}, \code{SSasympOrig()}, \code{SSbiexp()}, \code{SSfol()}, \code{SSfpl()}, \code{SSgompertz()}, \code{SSlogis()}, \code{SSmicmen()}, and \code{SSweibull()}. Function \code{selfStart()} can be used to define new ones. All these functions can be used when fitting models with \Rfunction{nls} or \Rfunction{nlme}. Please, check the respective help pages for details. - -\begin{warningbox} -In calls to \Rfunction{nls()}, the rhs of the model \code{formula} is a function call. The names of its arguments if not present in \code{data} are assumed to be parameters to be fitted. Below, a named function -\end{warningbox} - -As example the Michaelis-Menten equation\index{Michaelis-Menten equation} describing reaction kinetics\index{chemical reaction kinetics} in biochemistry and chemistry is fitted to the \Rdata{Puromycin} data set. The mathematical formulation is given by - -\begin{equation}\label{eq:michaelis:menten} -v = \frac{\mathrm{d} [P]}{\mathrm{d} t} = \frac{V_{\mathrm{max}} [S]}{K_{\mathrm{M}} + [S]} -\end{equation} - -and is implemented in \Rlang under the name \Rfunction{SSmicmen()} as a self-starting function. - -<>= -data(Puromycin) -names(Puromycin) -@ - -<>= -fm21 <- nls(rate ~ SSmicmen(conc, Vm, K), data = Puromycin, - subset = state == "treated") -@ - -As for other fitted models we use query methods (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}). - -<>= -class(fm21) -summary(fm21) -residuals(fm21) -fitted(fm21) -@ - -\begin{explainbox} -Methods \code{str()} and \code{names()} can reveal differences with respect to linear and generalized linear models. The the fitted model object is of class \code{nls} contains additional members but lacks others. Two members are related to the iterative approximation method used, \code{control} containing nested members holding iteration settings, and \code{convInfo} (convergence information) with nested members with information on the outcome of the iterative algorithm. - -<>= -class(fm21) -names(fm21) -@ - -<>= -fm21$convInfo -@ -\end{explainbox} - -Method \code{update()}, described for \code{lm()} in section \ref{sec:stat:update:step} on page \pageref{sec:stat:update:step}, can be also used with models fitted with \code{nls()}. Reuse of previous estimates as guesses for updates is an important feature. - -\index{non-linear models|)} - -\section{Splines and local regression}\label{sec:stat:splines} -\index{smoothing splines|(}\index{local polynomial regression|(}% -\index{LOESS|see{local polynomial regression}}% -The name ``spline'' derives from the tool used by draftsmen to draw smooth curves. Originally, a spline of soft wood was used as a flexible guide to draw arbitrary curves. Later the wood splines were replaced by a rod of flexible metal, such as lead, encased in plastic or similar material but the original name persisted. In mathematics, splines are functions that describe smooth and flexible curves. - -Most of the model fits given above as examples produce estimates for parameters that are interpretable in the real world, directly in the case of mechanistic models like the estimate of reaction constants or at least indicating broadly a relationship between two variables as in the case of linear regression. In the case of polynomials with degree higher than 2, parameter estimates no longer directly describe features of the data. - -Splines take this a step farther and parameter estimates have no practical interest. The interest resides in the overall shape and position of the predicted curve. Splines consist in knots (or connection points) joined by straight or curved fitted lines, i.e., they are functions that are \emph{piecewise}. The simplest splines, are piece-wise linear, given by chained straight line segments connecting knots. - -In more complex splines the segments are polynomials, frequently cubic polynomials, that fulfil certain constraints at the knots. For example, that the slope or first derivative is the same for the two connected curve ``pieces'' at the knot where they are connected. This constraint ensures that the curve is smooth. In some cases similar constraints are imposed on higher order derivatives, for example to the second derivative to ensure that the curve of the first derivative is also smooth at the knots. - -Splines are used in free-hand drawing with computers to draw arbitrary smooth curves. They are also be used for interpolation, in which case observations, assumed to be error-free, become the knots of a spline used to approximate intermediate values. Finally, splines can be used as models to be fit to observations subject to random variation. In this case splines fulfil the role of smoothers, as a curve that broadly describes a relationship among variables. - -Splines are frequently used as smooth curves in plots as described in section \ref{sec:plot:smoothers} on page \pageref{sec:plot:smoothers}. Function \Rfunction{spline()} is used for interpolation and function \Rfunction{smooth.spline()} for smoothing by fitting a cubic spline (a spline where the knots are connected by third degree polynomials). Function \Rfunction{smooth.spline()} has a different user interface than that we used for model fit functions described above, as it only accepts \code{numeric} vectors as arguments to parameters \code{x} and \code{y} (Figure \ref{fig:spline:fit:diagram}). Additional parameters make it possible to override the defaults for number of knots and adjust the stiffness or tendency towards a straight line. The \code{plot()} method differently to other fit functions produces a plot of the prediction. As no model formula is used only one curve at a time is fitted and no statistical tests involving groups are possible. The most commonly used query functions are thus not the same as for linear and non-linear models. - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (x) [tprocess, fill=blue!5] {\textsl{obs.} $\to$ \code{x}}; -\node (y) [tprocess, below of=model, fill=blue!5] {\textsl{obs.} $\to$ \code{y}}; -\node (fitfun) [tprocess, right of=model, yshift=-0.7cm, xshift=2cm, fill=blue!5] {\code{smooth.spline()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=2.3cm, fill=blue!5] {\code{smooth.spline} \textsl{obj.}}; -\node (pred) [tprocess, color = black, right of=fm, xshift=2.3cm] {\code{predict()}}; -\node (fitted) [tprocess, color = black, above of=pred, yshift = -0.4cm] {\code{fitted()}}; -\node (resid) [tprocess, color = black, below of=pred, yshift = +0.4cm] {\code{residuals()}}; -\draw [arrow] (x) -- (fitfun); -\draw [arrow] (y) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (pred); -\draw [arrow] (fm) -- (fitted); -\draw [arrow] (fm) -- (resid); -\end{tikzpicture} -\end{small} - \caption[Fitting smooth splines in \Rlang]{Fitting of smooth splines in \Rlang. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show the fitting of splines. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{smooth.spline()}. Only the three most frequently used query methods are shown, while response and explanatory variables are passed separately to \textsl{x} and \textsl{y}.}\label{fig:spline:fit:diagram} -\end{figure} - -<>= -fs1 <- smooth.spline(x = cars$speed, y = cars$dist) -print(fs1) -plot(fs1, type = "l") -points(x = cars$speed, y = cars$dist) -@ - -Function \Rfunction{loess} implements \emph{local polynomial regression}. It fits a polynomial curve or surface (i.e., more than one explanatory variable can be included in the model formula) using local-weighted fitting. Its user interface is rather similar to that of \code{glm()} with \code{formula}, \code{family} and \code{data} formal parameters (Figure \ref{fig:loess:fit:diagram}). Additional parameters control ``stiffness'' or the extent of the local data used for fitting (how much weight is given to observations as a function of their distance). The type of fit local or not used for individual explanatory variables can be controlled through parameter \code{parametric}. - -\begin{figure} - \centering -\begin{small} -\begin{tikzpicture}[node distance=1.4cm, scale=0.5] -\node (model) [tprocess] {\textsl{model} $\to$ \code{formula}}; -\node (data) [tprocess, below of=model, yshift = 0.4cm] {\textsl{observations} $\to$ \code{data}}; -\node (weights) [tprocess, dashed, below of=data, fill=blue!1, yshift = 0.4cm] {\textsl{weights} $\to$ \code{weights}}; -\node (family) [tprocess, below of=weights, fill=blue!5, yshift = 0.4cm] {\textsl{distribution} $\to$ \code{family}}; -\node (fitfun) [tprocess, right of=data, xshift=2.5cm, yshift = -0.4cm, fill=blue!5] {\code{loess()}}; -\node (fm) [tprocess, color = black, right of=fitfun, xshift=1.5cm, fill=blue!5] {\code{loess} \textsl{object}}; -\node (pred) [tprocess, color = black, right of=fm, xshift=1.7cm] {\code{predict()}}; -\node (fitted) [tprocess, color = black, above of=pred, yshift = -0.4cm] {\code{fitted()}}; -\node (resid) [tprocess, color = black, below of=pred, yshift = +0.4cm] {\code{residuals()}}; -\draw [arrow] (model) -- (fitfun); -\draw [arrow] (data) -- (fitfun); -\draw [arrow, dashed] (weights) -- (fitfun); -\draw [arrow] (family) -- (fitfun); -\draw [arrow] (fitfun) -- (fm); -\draw [arrow] (fm) -- (pred); -\draw [arrow] (fm) -- (fitted); -\draw [arrow] (fm) -- (resid); -\end{tikzpicture} -\end{small} - \caption[Loess model fitting in \Rlang]{Loess model fitting in \Rlang is done in steps. Generic diagram from Figure \ref{fig:model:fit:diagram} redrawn to show local polynomial regression model fitting. Non-filled boxes are shared with fitting of other types of models, and filled ones are specific to \Rfunction{loess()}. Only the three most frequently used query methods are shown, while both response and explanatory variables are under \textsl{observations}. Dashed boxes and arrows are optional as defaults are provided.}\label{fig:loess:fit:diagram} -\end{figure} - -<>= -floc <- loess(dist ~ speed, data = cars) -class(floc) -summary(floc) -@ - -\begin{warningbox} - Function \Rfunction{anova()} can be used to compare two or more loess fits, but not on a single one. -\end{warningbox} - -\begin{explainbox} - Several modern approaches to data analysis, which do provide estimates of effects' significance and sizes, are based on the use of splines to describe the responses and even variance. Among them are additive models such as GAM and related methods \autocite[see][]{Wood2017} and functional data analysis (FDA) \autocite{Ramsay2009}. These methods are outside the scope of this book and implemented in specialized extension packages. -\end{explainbox} -\index{smoothing splines|)}\index{local polynomial regression|)}\index{models!fitting|)}% - -\section{Model formulas}\label{sec:stat:formulas} -\index{model formulas|(}\index{models!specification|see{model formulas}}% -Model formulas, such as \code{y\,\char"007E\,x} are widely used in \Rlang, both in model fitting as exemplified in previous sections of this chapter and in plotting when using base \Rlang \Rmethod{plot()} methods. - -\Rlang is consistent and flexible in how it treats various objects, to a extent that can be surprising to those familiar with other computer languages. Model formulas are objects of class \Rclass{formula} and mode \Rclass{call} and can be manipulated and stored similarly to objects of other classes. - -<>= -class(y ~ x) -mode(y ~ x) -@ - -Like any other \Rlang object formulas can be assigned to variables and be members of lists and vectors. Consequently, the first linear model fit example from page \pageref{chunk:lm:models1} can be rewritten as follows. - -<>= -my.formula <- dist ~ 1 + speed -fm1 <- lm(my.formula, data=cars) -@ - -In some situations, e.g., calculation of correlations, models lacking a \emph{lhs} term (a term on the left hand side of \code{\,\char"007E\,}) are used. At least one term must be present in the \emph{rhs} of model formulas, as an expression ending in \code{\,\char"007E\,} is syntactically incomplete. - -<>= -class(~ x + y) -mode(~ x + y) -is.empty.model(~ x + y) -@ - -\begin{explainbox} -Some details of \Rlang formulas can be important in advanced scripts. Two kinds of ``emptiness'' are possible for formulas. As with other classes, empty objects or vectors of length zero are valid and can be created with the class constructor. In the case of formulas there is an additional kind of emptiness, a formula describing a model with no explanatory terms on its \emph{rhs}. - -An ``empty'' object of class \Rclass{formula} can be created by a call to \code{formula()} with no arguments, similarly as a numeric vector of length zero is created by the call \code{numeric()}. The last, commented out, statement in the code below triggers an error as the argument passed to \Rfunction{is.empty.model()} is of length zero. (This behaviour is not consistent with \Rclass{numeric} vectors of length zero; see for example the value returned by \code{is.finite(numeric())}.) - -<>= -class(formula()) -mode(formula()) -length(formula()) -# is.empty.model(formula()) -@ - -A model formula describing a model with no explanatory terms on the rhs, is considered empty even if it is a valid object of class \Rclass{formula} and, thus, not missing. While \code{y\ \char"007E\ 1} describes a model with only an intercept (estimating $a = \bar{x}$), \code{y\ \char"007E\ 0} or its equivalent \code{y\,\char"007E\,-1}, describes an empty model that cannot be fitted to data. - -<>= -class(y ~ 0) -mode(y ~ 0) -is.empty.model(y ~ 0) -is.empty.model(y ~ 1) -is.empty.model(y ~ x) -@ - -The value returned by \Rmethod{length()} on a single formula is not always 1, the number of formulas in the vector of formulas, but instead the number of components in the formula. For longer vectors, it does return the number of member formulae. Because of this, it is better to store model formulas in objects of class \Rclass{list} than in vectors, as \Rfunction{length()} consistently returns the expected value on lists. - -<>= -length(formula()) -length(y ~ 0) -length(y ~ 1) -length(y ~ x) -length(c(y ~ 1, y ~ x)) -length(list(y ~ 1)) -length(list(y ~ 1, y ~ x)) -@ - -As described above, \Rfunction{length()} applied to a single formula and to a list of formulas behaves differently. To call \Rfunction{length()} on each member of a list of formulas, we can use \code{sapply()} (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}). As function \Rfunction{is.empty.model()} is not vectorized, we also have to use \code{sapply()} with a list of formulas. - -<>= -sapply(list(y ~ 0, y ~ 1, y ~ x), length) -sapply(list(y ~ 0, y ~ 1, y ~ x), is.empty.model) -@ - -\end{explainbox} - -In the examples in previous sections we fitted simple models. More complex ones can be easily formulated using the same syntax. First of all, one can avoid use of operator \code{*} and explicitly define all individual main effects and interactions using operators \code{+} and \code{:}\,. The syntax implemented in base \Rlang allows grouping by means of parentheses, so it is also possible to exclude some interactions by combining the use of \code{*} and parentheses. - -The same symbols as for arithmetic operators are used for model formulas. Within a formula, symbols are interpreted according to formula syntax. When we mean an arithmetic operation that could be interpreted as being part of the model formula we need to ``protect'' it by means of the identity function \Rfunction{I()}. The next two examples define formulas for models with only one explanatory variable. With formulas like these, the explanatory variable will be computed on the fly when fitting the model to data. In the first case below we need to explicitly protect the addition of the two variables into their sum, because otherwise they would be interpreted as two separate explanatory variables in the model. In the second case, \Rfunction{log()} cannot be interpreted as part of the model formula, and consequently does not require additional protection, neither does the expression passed as its argument. - -<>= -y ~ I(x1 + x2) -y ~ log(x1 + x2) -@ - -\Rlang formula syntax allows alternative ways for specifying interaction terms. They allow ``abbreviated'' ways of entering formulas, which for complex experimental designs saves typing and can improve clarity. As seen above, operator \code{*} saves us from having to explicitly indicate all the interaction terms in a full factorial model. - -<>= -y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 + x1:x2:x3 -@ - -Can be replaced by a concise equivalent. - -<>= -y ~ x1 * x2 * x3 -@ - -When the model to be specified does not include all possible interaction terms, we can combine the concise notation with parentheses. Below, equivalent formulas are shown using concise and verbose notation. - -<>= -y ~ x1 + (x2 * x3) -y ~ x1 + x2 + x3 + x2:x3 -@ - -<>= -y ~ x1 * (x2 + x3) -y ~ x1 + x2 + x3 + x1:x2 + x1:x3 -@ - -The \code{\textasciicircum{}} operator provides a concise notation to limit the order of the interaction terms included in a formula. - -<>= -y ~ (x1 + x2 + x3)^2 -y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 -@ - -Operator \code{\%in\%} can also be used as a shortcut for including only some of all the possible interaction terms in a formula. - -<>= -y ~ x1 + x2 + x1 %in% x2 -@ - -\begin{playground} -Whether two model formulas above are equivalent or not, can be investigated using function \Rfunction{terms()}. - -<>= -terms(y ~ x1 + (x2 * x3)) -terms(y ~ x1 * (x2 + x3)) -terms(y ~ (x1 + x2 + x3)^2) -terms(y ~ x1 + x2 + x1 %in% x2) -@ -\end{playground} - -\begin{advplayground} -For operator \code{\textasciicircum{}} to behave as expected, its first operand should be a formula with no interactions! Compare the result of expanding these two formulas with \Rfunction{terms()}. - -<>= -y ~ (x1 + x2 + x3)^2 -y ~ (x1 * x2 * x3)^2 -@ -\end{advplayground} - -\begin{advplayground} -Run the code examples below using the \Rdata{npk} data set from \Rlang. They demonstrate the use of different model formulas in ANOVA\index{analysis of variance!model formula}. Use these examples plus your own variations on the same theme to build your understanding of the syntax of model formulas. Based on the terms displayed in the ANOVA tables, first work out what models are being fitted in each case. In a second step, write each of the models using a mathematical formulation. Finally, think how model choice may affect the conclusions from an analysis of variance. - -% runs fine but crashes LaTeX -<>= -data(npk) -anova(lm(yield ~ N * P * K, data = npk)) -anova(lm(yield ~ (N + P + K)^2, data = npk)) -anova(lm(yield ~ N + P + K + P %in% N + K %in% N, data = npk)) -anova(lm(yield ~ N + P + K + N %in% P + K %in% P, data = npk)) -@ -\end{advplayground} - -Nesting of factors in experiments using hierarchical designs such as split-plots or repeated measures, results in the need to compute additional error terms, differing in their degrees of freedom. In a nested design with fixed effects, effects are tested based on different error terms depending on the design of an experiment, i.e., depending on the randomization of the assignment of treatments to experimental units. In base-\Rlang model-formulas, nesting is described by explicit definition of error terms by means of \code{Error()} within the formula. - -The syntax described above does not support complex statistical models as implemented in extension packages. For example, Nowadays, fitting of linear mixed-effects (LME) models is the preferred approach for the analysis of data from experiments and surveys based on hierarchical designs. These methods are implemented in packages \pkgname{nlme} \autocite{Pinheiro2000} and \pkgname{lme4} \autocite{Bates2015} that define extensions to the model formula syntax. The extensions make it possible to describe nesting and distinguish fixed and random effects. Packages implementing fitting of additive models have needed other extensions to the formula syntax. Additive model methods are described by \citeauthor{Wood2017} (\citeyear{Wood2017}) and \citeauthor{Zuur2012} (\citeyear{Zuur2012}). Although the overall approach and syntax are followed in most contributed packages, different packages have extended the formula syntax in different ways. These extensions fall outside the scope of this book. - -\begin{warningbox} - \Rlang will accept any syntactically correct model formula, even when the results of the fit are not interpretable. It is \emph{the responsibility of the user to ensure that models are meaningful}\index{models!nesting of factors}. The most common, and dangerous, mistake is specifying for factorial experiments, models that are missing lower-order terms. - - Fitting models like those below to data from an experiment based on a three-way factorial design should be avoided. In both cases simpler terms are missing, while higher-order interaction(s) that include the missing term are included in the model. Such models are not interpretable, as the variation from the missing term(s) ends being ``disguised'' within the remaining terms, distorting their apparent significance and parameter estimates. - -<>= -y ~ A + B + A:B + A:C + B:C -y ~ A + B + C + A:B + A:C + A:B:C -@ - - In contrast to those above, the models below are interpretable, even if not ``full'' models (not including all possible interactions). - -<>= -y ~ A + B + C + A:B + A:C + B:C -y ~ (A + B + C)^2 -y ~ A + B + C + B:C -y ~ A + B * C -@ - -\end{warningbox} - -\begin{explainbox} -\index{model formulas!manipulation}\textbf{Manipulation of model formulas.} Because this is a book about the \Rlang language, it is pertinent to describe how formulas can be manipulated. Formulas, as any other \Rlang objects, can be saved in variables including lists. Why is this useful? For example, if we want to fit several different models to the same data, we can write a \code{for} loop that walks through a list of model formulas (see section \ref{sec:R:faces:of:loops} on page \pageref{sec:R:faces:of:loops}). Obviously, user-defined functions can accept formulas as arguments as \code{lm()} and other model fitting functions do. In addition, it is relatively simple for user code to programmatically create and edit \Rlang formulas, in the same way as functions \code{update()} and \code{step()} do under the hood. - -A conversion constructor is available with name \Rfunction{as.formula()}. It is useful when formulas are input interactively by the user or read from text files. With \Rfunction{as.formula()} we can convert a character string into a formula. - -<>= -as.formula("y ~ x") -@ - -As\index{model formulas!conversion from character strings} there are many functions for the manipulation of character strings available in base \Rlang and through extension packages, it is easiest to build model formulas as strings. We can use functions like \code{paste()} to assemble a formula as text, and then use \Rfunction{as.formula()} to convert it to an object of class \code{formula}, usable for fitting a model. - -<>= -paste("y", "x", sep = "~") |> as.formula() -@ - -For\index{model formulas!conversion into character strings} the reverse operation of converting a formula into a string, we have available methods \code{as.character()} and \code{format()}. The first of these methods returns a character vector containing the components of the formula as individual strings, while \code{format()} returns a single character string with the formula formatted for printing. - -<>= -as.character(y ~ x) -format(y ~ x) -@ - -This conversion makes it possible to edit a formula as a character string. - -<>= -format(y ~ x) |> gsub("x", "x + z", x = _) |> as.formula() -@ - -It\index{model formulas!updating} is also possible to \emph{edit} formula objects with method \Rfunction{update()}. In the replacement formula, a dot can replace either the left-hand side (lhs) or the right-hand side (rhs) of the existing formula in the replacement formula. We can also remove terms as can be seen below. In some cases the dot corresponding to the lhs can be omitted, but including it makes the syntax clearer. - -<>= -my.formula <- y ~ x1 + x2 -update(my.formula, . ~ . + x3) -update(my.formula, . ~ . - x1) -update(my.formula, . ~ x3) -update(my.formula, z ~ .) -update(my.formula, . + z ~ .) -@ - -As \Rlang provides high-level functions for model selection editing model formulas is not very frequently needed for model fitting. -\end{explainbox} - -A model matrix of dummy coefficients is used in the actual computations. This matrix can be derived from a model formula, a contrast name, and the data for the explanatory variables using function \Rfunction{model.matrix()}. -\index{model formulas|)} - -\section{Time series}\label{sec:stat:time:series} -\index{time series|(} -Longitudinal data consist of repeated measurements, usually done over time, on the same experimental units. Longitudinal data, when replicated on several experimental units at each time point, are called repeated measurements, while when not replicated, they are called time series. Base \Rlang provides special support for the analysis of time series data, while repeated measurements can be analyzed with nested linear models, mixed-effects models, and additive models. - -Time series data are data collected in such a way that there is only one observation, possibly of multiple variables, available at each point in time. This brief section introduces only the most basic aspects of time-series analysis. In most cases time steps are of uniform duration and occur regularly, which simplifies data handling and storage. \Rlang not only provides methods for the analysis and manipulation of time-series, but also a specialized class for their storage, \Rclass{"ts"}. Regular time steps allow more compact storage---e.g., a \code{ts} object does not need to store time values for each observation but instead a combination of two of start time, step size and end time. When analysing time-series data it is frequently necessary to convert time data stored in one of the special \Rlang classes for temporal data and to operate with them (see section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}). - -By now, you surely guessed that to create an object of class \Rclass{"ts"} one needs to use a constructor called \Rfunction{ts()} or a conversion constructor called \Rfunction{as.ts()} and that you can look up the arguments they accept by consulting help using \code{help(ts)}. - -<>= -my.ts <- ts(1:10, start = 2019, deltat = 1/12) -@ - -The \code{print()} method for \code{ts} objects is special, and adjusts the printout according to the time step or \code{deltat} of the series. - -<>= -print(my.ts) -@ - -The structure of the \code{ts} object is simple. Its mode is \code{numeric} but its class is \code{ts}. It is similar to a numeric vector with the addition of one attribute named \code{tsp} describing the time steps, as a numeric vector of length 3, giving start and end time and the size of the steps. - -<>= -mode(my.ts) -class(my.ts) -is.ts(my.ts) -str(my.ts) -attributes(my.ts) -@ - -Data set \Rdata{nottem}, included in \Rlang, contains meteorological data for Nottingham. The annual cycle of mean air temperatures (in degrees Fahrenheit) as well variation among years are clear when data are plotted. - -\begin{explainbox} -Reexpression of the temperatures in the time-series from degrees Fahrenheit into degrees Celsius can be achieved as in \code{numeric} vectors using vectorized arithmetic and recycling. - -<>= -nottem.celcius <- (nottem - 32) * 5 / 9 -@ -\end{explainbox} - -<>= -is.ts(nottem.celcius) -@ - -<>= -opts_chunk$set(opts_fig_wide) -@ - -<>= -plot(nottem.celcius) -@ - -\begin{playground} -Explore the structure of the \code{nottem.celcius} object (or the \code{nottem} object), and consider how and why it differs or not from that of the object \code{my.ts} that we created above. Similarly explore time series \code{ausres}, another of the data sets included in \Rlang. - -<>= -str(nottem.celcius) -attributes(nottem.celcius) -@ - -\end{playground} - -Many time series of observations display cyclic variation at different frequencies. Outdoors, air temperature varies cyclically between day and nigh and through the year. Superimposed on these regular cycles there can be faster random variation and long-term trends. One approach to the analysis of time series data is to estimate the separate contribution of these components. - -An efficient approach to time series decomposition, based on LOESS\index{time series!decomposition}\index{STL|see{time series decomposition}} (see section \ref{sec:stat:splines} on \pageref{sec:stat:splines}), is STL (Seasonal and Trend decomposition using Loess).\qRfunction{decompose()}\qRfunction{stl()} - -A seasonal window of 7 months, the minimum accepted, allows the extraction of the annual cycles and a long-term trend leaving as reminder unexplained variation. In the plot is important to be aware that the scale limits in the different panels are different, and re-set for each plot. - -<>= -opts_chunk$set(opts_fig_wide_square) -@ - -<>= -nottem.stl <- stl(nottem.celcius, s.window = 7) -plot(nottem.stl) -@ - -It is interesting to explore the class and structure of the object returned by \Rfunction{stl()}, as we may want to extract components. We can see that the structure of this object is rather similar to model-fit objects of classes \code{lm} and \code{glm}. - -<>= -class(nottem.stl) -str(nottem.stl, no.list = TRUE, give.attr = FALSE, vec.len = 2) -@ - -As with other fit methods, method \Rfunction{summary()} is available. However, this method for class \code{stl} returns the \code{stl} object received as argument and displays a summary. In other words, it behaves similarly to \code{print()} methods with respect to the returned object, but produces a different printout than \code{print()} as its side effect. - -<>= -summary(nottem.stl) -@ - -\begin{playground} -Consult \code{help(stl)} and \code{help(plot.stl)} and create different plots and decompositions by passing different arguments to the formal parameters of these methods. - -Method \code{print()} shows the different components. Extract the seasonal component and plot is on its own against time. -\end{playground} - -In the Nottingham temperature time series the period of the variation is clearly annual, but for many time series an interesting feature to characterise is autocorrelation and its periodicity. Function \Rfunction{acf()} computes and plots the autocorrelation function (ACF) vs. the lag. The time series has monthly data, while the scale for lag in the plot below is in years. The autocorrelation is one at zero lag, and slightly less with a lag of one year, while its is negative between winter and summer temperatures. - -<>= -opts_chunk$set(opts_fig_narrow) -@ - -<>= -acf(nottem) -@ - -More advanced time-series analysis and forecasting methods are beyond the scope of this book, and mostly implemented in contributed packages. The textbook \citetitle{Hyndman2021} \autocite{Hyndman2021} is comprehensive starting with an introduction to time series and continuing all the way to the description of modern forecasting methods, using \Rlang throughout. -\index{time series|)} - -\section{Multivariate statistics}\label{sec:stat:MV} -\index{multivariate statistics|(} -All the methods presented above are univariate, as even if in some cases we considered multiple explanatory variables on the rhs of model formulas, the lhs contained at most one response variable. There are many different multivariate methods available, and a few of them are implemented in base \Rlang functions. The current section does not describe these methods in depth, it only provides a few simple examples for some of the frequently used ones. - -\subsection{Multivariate analysis of variance} -\index{multivariate analysis of variance|(} -\index{MANOVA|see{multivariate analysis of variance}} -Multivariate methods take into account several response variables simultaneously, as part of a single analysis. In practice it is usual to use contributed packages for multivariate data analysis in \Rlang, except for simple cases. We will look first at \emph{multivariate} ANOVA or MANOVA. In the same way as \Rfunction{aov()} is a wrapper that uses internally \Rfunction{lm()}, \Rfunction{manova()} is a wrapper that uses internally \Rfunction{aov()}. - -Multivariate model formulas in base \Rlang require the use of column binding (\code{cbind()}) on the left-hand side (lhs) of the model formula. For the next examples we use the well-known \Rdata{iris} data set, containing size measurements for flowers of three species of \emph{Iris}. - -<>= -mmf2 <- manova(cbind(Petal.Length, Petal.Width) ~ Species, data = iris) -anova(mmf2) -summary(mmf2) -@ - -\begin{advplayground} -Modify the example above to use \code{aov()} instead of \code{manova()} and save the result to a variable named \code{mmf3}. -Use \code{class()}, \code{attributes()}, \code{names()}, \code{str()} and extraction of members to explore objects \code{mmf1}, \code{mmf2} and \code{mmf3}. Are they different? -\end{advplayground} - -\index{multivariate analysis of variance|)} - -\subsection{Principal components analysis}\label{sec:stat:PCA} -\index{principal components analysis|(}\index{PCA|see {principal components analysis}} - -Principal components analysis (PCA) is used to simplify a data set by combining variables with similar and ``mirror'' behavior into principal components. At a later stage, we frequently try to interpret these components in relation to known and/or assumed independent variables. Base \Rlang's function \Rfunction{prcomp()} computes the principal components and accepts additional arguments for centering and scaling. - -<>= -pc <- prcomp(iris[c("Sepal.Length", "Sepal.Width", - "Petal.Length", "Petal.Width")], - center = TRUE, - scale = TRUE) -@ - -By printing the returned object we can see the loadings of each variable in the principal components \code{P1} to \code{P4}. -<>= -class(pc) -pc -@ - -In the summary, the rows ``Proportion of Variance'' and ``Cumulative Proportion'' are most informative of the contribution of each principal component (PC) to explaining the variation among observations. - -<>= -summary(pc) -@ - -<>= -opts_chunk$set(opts_fig_wide_square) -@ - -Method \Rfunction{biplot()} produces a plot with one principal component (PC) on each axis, plus arrows for the loadings. - -<>= -biplot(pc) -@ - -<>= -opts_chunk$set(opts_fig_narrow) -@ - -Method \code{plot()} generates a bar plot of variances corresponding to the different components. - -<>= -plot(pc) -@ - -Visually more elaborate plots of the principal components and their loadings can be obtained using package \pkgnameNI{ggplot} described in chapter \ref{chap:R:plotting} on page \pageref{chap:R:plotting}. Package \pkgnameNI{ggfortify} extends \pkgnameNI{ggplot} so as to make it easy to plot principal components and their loadings. - -\begin{playground} -For growth and morphological data, a log-transformation can be suitable given that variance is frequently proportional to the magnitude of the values measured. We leave as an exercise to repeat the above analysis using transformed values for the dimensions of petals and sepals. How much does the use of transformations change the outcome of the analysis? -\end{playground} - -\begin{advplayground} -As for other fitted models, the object returned by function \Rfunction{prcomp()} is list-like with multiple components and belongs to a class of the same name as the function, not derived from class \code{"list"}. - -<>= -class(pc) -str(pc, max.level = 1) -@ -\end{advplayground} - -\index{principal components analysis|)} - -\subsection{Multidimensional scaling}\label{sec:stat:MDS} -\index{multidimensional scaling|(}\index{MDS|see {multidimensional scaling}} - -The aim of multidimensional scaling (MDS) is to visualize in 2D space the similarity between pairs of observations. The values for the observed variable(s) are used to compute a measure of distance among pairs of observations. The nature of the data will influence what distance metric is most informative. -For MDS we start with a matrix of distances among observations. We will use, for the example, distances in kilometers between geographic locations in Europe from data set \Rdata{eurodist}. - -<>= -loc <- cmdscale(eurodist) -@ - -We can see that the returned object \code{loc} is a \code{matrix}, with names for one of the dimensions. - -<>= -class(loc) -dim(loc) -dimnames(loc) -head(loc) -@ - -To make the code easier to read, two vectors are first extracted from the matrix and named \code{x} and \code{y}. We force aspect to equality so that distances on both axes are comparable. - -<>= -opts_chunk$set(opts_fig_wide_square) -@ - -<>= -x <- loc[, 1] -y <- -loc[, 2] # change sign so North is at the top -plot(x, y, type = "n", asp = 1, - main = "cmdscale(eurodist)") -text(x, y, rownames(loc), cex = 0.6) -@ - -\begin{advplayground} - Find data on the mean annual temperature, mean annual rainfall and mean number of sunny days at each of the locations in the \code{eurodist} data set. Next, compute suitable distance metrics, for example, using function \Rfunction{dist}. Finally, use MDS to visualize how similar the locations are with respect to each of the three variables. Devise a measure of distance that takes into account the three climate variables and use MDS to find how distant the different locations are. -\end{advplayground} - -\index{multidimensional scaling|)} - -\subsection{Cluster analysis}\label{sec:stat:cluster} -\index{cluster analysis|(} - -In cluster analysis, the aim is to group observations into discrete groups with maximal internal homogeneity and maximum group-to-group differences. In the next example we use function \Rfunction{hclust()} from the base-\Rlang package \pkgname{stats}. We use, as above, the \Rdata{eurodist} data which directly provides distances. In other cases a matrix of distances between pairs of observations needs to be first calculated with function \Rfunction{dist} which supports several methods. - -<>= -hc <- hclust(eurodist) -print(hc) -@ - -<>= -plot(hc) -@ - -We can use \Rfunction{cutree()} to limit the number of clusters by directly passing as an argument the desired number of clusters or the height at which to cut the tree. - -<>= -cutree(hc, k = 5) -@ - -The object returned by \Rfunction{hclust()} contains details of the result of the clustering, which allows further manipulation and plotting. -<>= -str(hc) -@ - -\index{cluster analysis|)} - -%\subsection{Discriminant analysis}\label{sec:stat:DA} -%\index{discriminant analysis|(} -% -%In discriminant analysis the categories or groups to which objects belong are known \emph{a priori} for a training data set. The aim is to fit/build a classifier that will allow us to assign future observations to the different non-overlapping groups with as few mistakes as possible. -% -% -%\index{discriminant analysis|)} -\index{multivariate statistics|)} - -\section{Further reading}\label{sec:stat:further:reading} - -\Rlang\index{further reading!statistics with R} and its extension packages provide implementations of most known statistical methods. For some methods alternative implementations exist in different packages. The present chapter only attempts to shown how some of the most used implementations are used, as this knowledge is frequently taken for granted in specialized books, several of which I list here. Two recent text books on statistics, following a modern approach, and using \Rlang for examples, are \citetitle{Diez2019} \autocite{Diez2019} and \citetitle{Holmes2019} \autocite{Holmes2019}. They differ in the subjects emphasized, with the second one focusing more on genetic and molecular biology. Three examples of books introducing statistical computations in \Rlang are \citetitle{Dalgaard2008} \autocite{Dalgaard2008}, \citetitle{Everitt2010} \autocite{Everitt2010} and \citetitle{Zuur2009} \autocite{Zuur2009}. The book \citetitle{Mehtatalo2020} \autocite{Mehtatalo2020} presents both the statistical theory and code examples. The comprehensive \citebooktitle{Crawley2012} \autocite{Crawley2012} and the classic reference \citebooktitle{Venables2002} \autocite{Venables2002} both present statistical theory in parallel with the \Rlang code examples. More specific books are also available from which a few suggestions for further reading are \citebooktitle{Everitt2011} \autocite{Everitt2011}, \citebooktitle{Faraway2004} \autocite{Faraway2004}, \citebooktitle{Faraway2006} \autocite{Faraway2006}, \citebooktitle{Hyndman2021} \autocite{Hyndman2021}, \citebooktitle{James2013} \autocite{James2013}, \citebooktitle{Pinheiro2000} \autocite{Pinheiro2000} and \citebooktitle{Wood2017} \autocite{Wood2017}. - -<>= -knitter_diag() -R_diag() -other_diag() -@ ->>>>>>> Stashed changes diff --git a/appendixes.prj b/appendixes.prj index f07e411d..5f2cf8b0 100644 --- a/appendixes.prj +++ b/appendixes.prj @@ -1,189 +1,88 @@ -<<<<<<< Updated upstream -36 Patch Control - -1 -1 -1 -using-r-main-crc.Rnw -30 -17 -6 - -rbooks.bib -BibTeX -1049586 2 878 52 878 55 38 38 1429 959 1 1 979 646 -1 -1 0 0 21 0 0 21 1 0 55 878 0 -1 0 -references.bib -BibTeX -1049586 2 219 39 219 42 76 76 1467 997 1 1 784 748 -1 -1 0 0 23 0 0 23 1 0 42 219 0 -1 0 -R.intro.Rnw -TeX:RNW -17838075 2 -1 25768 -1 25771 228 228 1214 1286 1 1 694 646 -1 -1 0 0 30 -1 -1 30 1 0 25771 -1 0 -1 0 -preface.Rnw -TeX:RNW -1060859 2 -1 14693 -1 14696 266 266 1252 1324 1 1 1144 1258 -1 -1 0 0 18 -1 -1 18 1 0 14696 -1 0 -1 0 -R.data.containers.Rnw -TeX:RNW -17838075 2 -1 83677 -1 83680 190 190 1324 1242 1 1 739 1292 -1 -1 0 0 31 -1 -1 31 3 0 83680 -1 1 192 -1 2 192 -1 0 -1 0 -R.scripts.Rnw -TeX:RNW -17838075 2 -1 75273 -1 75200 152 152 1138 1210 1 1 169 646 -1 -1 0 0 31 -1 -1 31 3 0 75200 -1 1 58097 -1 2 65929 -1 0 -1 0 -R.stats.rnw -TeX:RNW -286273531 0 -1 118669 -1 118674 418 418 1404 1476 1 1 949 680 -1 -1 0 0 31 -1 -1 31 1 0 118674 -1 0 -1 0 -R.data.io.Rnw -TeX:RNW -17838075 2 -1 75305 -1 75308 494 494 1480 1522 1 1 799 646 -1 -1 0 0 31 -1 -1 31 1 0 75308 -1 0 -1 0 -R.data.Rnw -TeX:RNW -17838075 1 -1 56413 -1 56430 342 342 1328 1400 1 1 1279 442 -1 -1 0 0 31 -1 -1 31 1 0 56430 -1 0 -1 0 -using-r-main-crc.Rnw -TeX:RNW:UTF-8 -134217730 0 190 13 190 17 6074 -1 6522 208 1 1 409 646 1 906 254 255 -1 0 0 33 1 0 17 190 0 -1 0 -R.plotting.Rnw -TeX:RNW -17838075 2 -1 40487 -1 40490 380 380 1366 1438 1 1 739 476 -1 -1 0 0 31 -1 -1 31 1 0 40490 -1 0 -1 0 -using-r-main-crc.ind -TeX:AUX:UNIX -1159154 7 824 33 824 39 308 308 1684 1215 1 0 715 646 -1 -1 0 0 17 0 0 17 1 0 39 824 0 0 0 -usingr.sty -TeX:STY -1060850 2 148 2 148 5 190 190 1176 1248 1 0 205 646 -1 -1 0 0 25 0 0 25 1 0 5 148 0 0 0 -abbrev.sty -TeX:STY -1060850 0 0 1 0 1 88 88 1471 1018 1 0 145 0 -1 -1 0 0 2 0 0 2 1 0 1 0 0 0 0 -R.functions.Rnw -TeX:RNW -17838075 0 -1 11716 -1 11747 456 456 1442 1484 1 1 1624 -1462 -1 -1 0 0 31 -1 -1 31 1 0 11747 -1 0 -1 0 -R.as.calculator.Rnw -TeX:RNW -17838075 1 -1 93302 -1 93316 190 190 1324 1242 1 1 1129 544 -1 -1 0 0 31 -1 -1 31 3 0 93316 -1 1 122277 -1 2 16955 -1 0 -1 0 -R.learning.Rnw -TeX:RNW -17838075 0 -1 12615 -1 12622 304 304 1290 1362 1 1 1549 646 -1 -1 0 0 40 -1 -1 40 1 0 12622 -1 0 -1 0 -using-r-main-crc.log -DATA:UNIX -307331314 2 7544 16 7544 19 440 440 1840 1347 1 0 415 646 -1 -1 0 0 117 0 0 117 1 0 19 7544 0 0 0 -using-r-main-crc.tex -TeX -269496315 2 -1 4070 -1 4073 0 0 1400 907 1 1 409 1292 -1 -1 0 0 49 -1 -1 49 1 0 4073 -1 0 -1 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_Public.dat -DATA -273678578 0 0 1 0 1 264 264 1647 1229 1 0 145 0 -1 -1 0 0 100 0 0 100 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_DataTableInfo.dat -DATA -273678578 0 0 1 0 1 308 308 1691 1273 1 0 145 0 -1 -1 0 0 107 0 0 107 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_Status.dat -DATA -273678578 0 0 1 0 1 352 352 1735 1317 1 0 145 0 -1 -1 0 0 100 0 0 100 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_TableHour.dat -DATA -273678578 0 0 1 0 1 396 396 1779 1361 1 0 145 0 -1 -1 0 0 103 0 0 103 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_TableDay.dat -DATA -273678578 0 0 1 0 1 440 440 1823 1405 1 0 145 0 -1 -1 0 0 102 0 0 102 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_CPIStatus.dat -DATA -273678578 0 0 1 0 1 0 0 1383 930 1 0 145 0 -1 -1 0 0 103 0 0 103 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_TableMinute.dat -DATA -273678578 0 0 1 0 1 44 44 1427 974 1 0 145 0 -1 -1 0 0 105 0 0 105 1 0 1 0 0 0 0 -knitr.sty -TeX:STY -269496306 0 29 1 29 1 380 380 2073 1301 1 0 145 918 -1 -1 0 0 45 0 0 45 1 0 1 29 0 0 0 -R.as.calculator.tex -TeX -269496315 0 -1 0 -1 0 76 76 1769 997 1 1 148 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 0 0 -krantz.cls -TeX:STY:UNIX -269594610 1 194 1 194 36 76 76 1062 1134 1 0 619 625 -1 -1 0 0 3 0 0 3 1 0 36 194 0 0 0 -:\Users\Public\Desktop\Lenovo ThinkColor.lnk -DATA -273744114 0 0 1 2 1 190 190 1549 1111 1 0 129 50 -1 -1 0 0 -1 -1 -1 -1 1 0 1 2 0 0 0 - -*using-r-main-crc.Rnw -> -*rbooks.bib -*references.bib -*preface.Rnw -*R.learning.Rnw -*R.intro.Rnw -*R.as.calculator.Rnw -*R.data.containers.Rnw -*R.scripts.Rnw -*R.functions.Rnw -*R.stats.Rnw -*R.data.Rnw -*R.plotting.Rnw -*R.data.io.Rnw -< -======= 41 Patch Control 1 1 1 using-r-main-crc.Rnw -22 -16 -8 +28 +20 +19 rbooks.bib -BibTeX -1049586 2 878 52 878 55 38 38 1429 959 1 1 848 783 -1 -1 0 0 21 0 0 21 1 0 55 878 0 -1 0 +BibTeX:UNIX +1147890 0 959 1 959 1 38 38 1429 959 1 1 146 1334 -1 -1 0 0 1 1 1 1 1 0 1 959 0 -1 0 references.bib -BibTeX -1049586 2 219 39 219 42 76 76 1467 997 1 1 679 783 -1 -1 0 0 23 0 0 23 1 0 42 219 0 -1 0 +BibTeX:UNIX +1147890 0 0 1 0 1 76 76 1467 997 1 1 146 0 -1 -1 0 0 1 1 1 1 1 0 1 0 0 -1 0 R.intro.Rnw TeX:RNW -17838075 2 -1 36492 -1 36496 228 228 1214 1286 1 1 1017 1363 -1 -1 0 0 30 -1 -1 30 1 0 36496 -1 0 -1 0 +17837307 0 -1 0 -1 0 228 228 1358 1175 1 1 146 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 -1 0 preface.Rnw -TeX:RNW -1060859 2 -1 13719 -1 13723 266 266 1252 1324 1 1 1212 870 -1 -1 0 0 18 -1 -1 18 1 0 13723 -1 0 -1 0 +TeX:RNW:UNIX +1159163 0 326 34 -1 14696 266 266 1252 1324 1 1 614 928 -1 -1 0 0 18 -1 -1 18 1 0 14696 -1 0 -1 0 R.data.containers.Rnw +TeX:RNW:UNIX +17936379 0 -1 82947 -1 82947 190 190 1324 1242 1 1 146 812 -1 -1 0 0 82947 -1 -1 82947 3 0 82947 -1 1 0 -1 2 0 -1 0 -1 0 +R.data.containers-2.Rnw TeX:RNW -17838075 2 -1 83677 -1 83680 190 190 1324 1242 1 1 640 928 -1 -1 0 0 31 -1 -1 31 3 0 83680 -1 1 192 -1 2 192 -1 0 -1 0 +17837307 0 -1 0 -1 0 266 266 1396 1213 1 1 146 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 -1 0 R.scripts.Rnw +TeX:RNW:UNIX +17936379 0 -1 110983 -1 0 152 152 1138 1210 1 1 146 0 -1 -1 0 0 110983 -1 -1 110983 3 0 0 -1 1 0 -1 2 68 411 0 -1 0 +R.scripts-2.Rnw TeX:RNW -17838075 0 -1 98190 -1 98669 152 152 1138 1210 1 1 341 232 -1 -1 0 0 31 -1 -1 31 3 0 98669 -1 1 58097 -1 2 65929 -1 0 -1 0 +17837307 0 -1 0 -1 0 304 304 1434 1251 1 1 146 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 -1 0 R.stats.rnw +TeX:RNW:UNIX +17936379 0 -1 127497 -1 127497 418 418 1404 1476 1 1 146 1334 -1 -1 0 0 127497 -1 -1 127497 1 0 127497 -1 0 -1 0 +R.stats-2.rnw TeX:RNW -17838075 0 1703 169 -1 118674 418 418 1404 1476 1 1 835 1392 -1 -1 0 0 31 -1 -1 31 1 0 118674 -1 0 -1 0 +17837307 0 -1 0 -1 0 342 342 1472 1289 1 1 146 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 -1 0 R.data.io.Rnw -TeX:RNW -17838075 2 -1 75305 -1 75308 494 494 1480 1522 1 1 692 899 -1 -1 0 0 31 -1 -1 31 1 0 75308 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 75244 -1 75244 494 494 1480 1522 1 1 146 899 -1 -1 0 0 31 -1 -1 31 1 0 75244 -1 0 -1 0 R.data.Rnw -TeX:RNW -286273531 0 -1 13452 -1 17241 342 342 1328 1400 1 1 406 638 -1 -1 0 0 31 -1 -1 31 1 0 17241 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 1618 66 -1 12290 342 342 1328 1400 1 1 939 203 -1 -1 0 0 31 -1 -1 31 1 0 12290 -1 0 -1 0 using-r-main-crc.Rnw TeX:RNW:UTF-8 -134217730 0 190 13 2 19 6074 -1 6522 208 1 1 380 58 1 906 254 255 -1 0 0 33 1 0 19 2 0 -1 0 +134217730 0 190 13 197 1 6074 -1 6522 208 1 1 146 290 1 906 254 255 -1 0 0 33 1 0 1 197 0 -1 0 R.plotting.Rnw -TeX:RNW -17838075 2 -1 40487 -1 40490 380 380 1366 1438 1 1 640 928 -1 -1 0 0 31 -1 -1 31 1 0 40490 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 249401 -1 249401 380 380 1366 1438 1 1 146 1334 -1 -1 0 0 249401 -1 -1 249401 1 0 249401 -1 0 -1 0 usingr.sty TeX:STY -1060850 1 93 21 93 31 190 190 1176 1248 1 0 515 986 -1 -1 0 0 25 0 0 25 1 0 31 93 0 0 0 +1158386 0 0 1 0 1 190 190 1176 1248 1 0 125 0 -1 -1 0 0 1 1 1 1 1 0 1 0 0 0 0 +usingr-2.sty +TeX +1060091 0 -1 0 -1 11649 152 152 1282 1099 1 0 125 1305 -1 -1 0 0 -1 -1 -1 -1 1 0 11649 -1 0 0 0 abbrev.sty TeX:STY -1060850 0 0 1 0 1 88 88 1471 1018 1 0 125 0 -1 -1 0 0 2 0 0 2 1 0 1 0 0 0 0 +1060850 0 0 1 0 1 88 88 1471 1018 1 0 125 -5829 -1 -1 0 0 2 0 0 2 1 0 1 0 0 0 0 R.functions.Rnw -TeX:RNW -17838075 0 -1 45813 -1 45813 456 456 1442 1484 1 1 952 957 -1 -1 0 0 31 -1 -1 31 1 0 45813 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 11716 -1 11747 456 456 1442 1484 1 1 562 -19111 -1 -1 0 0 31 -1 -1 31 1 0 11747 -1 0 -1 0 R.as.calculator.Rnw -TeX:RNW -17838075 1 -1 93302 -1 93316 190 190 1324 1242 1 1 978 928 -1 -1 0 0 31 -1 -1 31 3 0 93316 -1 1 122277 -1 2 16955 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 93302 -1 93316 190 190 1324 1242 1 1 159 928 -1 -1 0 0 31 -1 -1 31 3 0 93316 -1 1 146577 -1 2 58 486 0 -1 0 R.learning.Rnw TeX:RNW -17838075 0 -1 15791 -1 15794 304 304 1290 1362 1 1 718 812 -1 -1 0 0 40 -1 -1 40 1 0 15794 -1 0 -1 0 -:\Program Files\MiKTeX\tex\latex\unicode-math\unicode-math.sty -TeX:STY:LaTeX3:UNIX -269593842 7 40 1 39 1 228 228 1358 1175 1 0 125 1131 -1 -1 0 0 -1 -1 -1 -1 1 0 1 39 0 0 0 +17837307 0 -1 0 -1 309 190 190 1320 1137 1 1 770 174 -1 -1 0 0 -1 -1 -1 -1 1 0 309 -1 0 -1 0 using-r-main-crc.ind TeX:AUX:UNIX -269594610 7 824 33 824 39 308 308 1684 1215 1 0 619 783 -1 -1 0 0 17 0 0 17 1 0 39 824 0 0 0 +269594610 0 824 33 824 39 308 308 1684 1215 1 0 619 638 -1 -1 0 0 17 0 0 17 1 0 39 824 0 0 0 +R.intro-2.Rnw +TeX:RNW +286272763 0 -1 0 -1 0 228 228 1358 1175 1 1 146 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 -1 0 +R.learning-2.Rnw +TeX:RNW +286272763 0 -1 0 -1 0 190 190 1320 1137 1 1 146 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 -1 0 using-r-main-crc.log DATA:UNIX 307331314 2 7544 16 7544 19 440 440 1840 1347 1 0 415 646 -1 -1 0 0 117 0 0 117 1 0 19 7544 0 0 0 +using-r-main-crc.tex +TeX +269496315 2 -1 4070 -1 4073 0 0 1400 907 1 1 409 1292 -1 -1 0 0 49 -1 -1 49 1 0 4073 -1 0 -1 0 knitr.sty TeX:STY 269496306 0 29 1 29 1 380 380 2073 1301 1 0 145 918 -1 -1 0 0 45 0 0 45 1 0 1 29 0 0 0 @@ -210,4 +109,3 @@ TeX:STY:UNIX *R.plotting.Rnw *R.data.io.Rnw < ->>>>>>> Stashed changes diff --git a/appendixes.prj.bak b/appendixes.prj.bak index f07e411d..f3b6b6e6 100644 --- a/appendixes.prj.bak +++ b/appendixes.prj.bak @@ -1,189 +1,76 @@ -<<<<<<< Updated upstream -36 Patch Control - -1 -1 -1 -using-r-main-crc.Rnw -30 -17 -6 - -rbooks.bib -BibTeX -1049586 2 878 52 878 55 38 38 1429 959 1 1 979 646 -1 -1 0 0 21 0 0 21 1 0 55 878 0 -1 0 -references.bib -BibTeX -1049586 2 219 39 219 42 76 76 1467 997 1 1 784 748 -1 -1 0 0 23 0 0 23 1 0 42 219 0 -1 0 -R.intro.Rnw -TeX:RNW -17838075 2 -1 25768 -1 25771 228 228 1214 1286 1 1 694 646 -1 -1 0 0 30 -1 -1 30 1 0 25771 -1 0 -1 0 -preface.Rnw -TeX:RNW -1060859 2 -1 14693 -1 14696 266 266 1252 1324 1 1 1144 1258 -1 -1 0 0 18 -1 -1 18 1 0 14696 -1 0 -1 0 -R.data.containers.Rnw -TeX:RNW -17838075 2 -1 83677 -1 83680 190 190 1324 1242 1 1 739 1292 -1 -1 0 0 31 -1 -1 31 3 0 83680 -1 1 192 -1 2 192 -1 0 -1 0 -R.scripts.Rnw -TeX:RNW -17838075 2 -1 75273 -1 75200 152 152 1138 1210 1 1 169 646 -1 -1 0 0 31 -1 -1 31 3 0 75200 -1 1 58097 -1 2 65929 -1 0 -1 0 -R.stats.rnw -TeX:RNW -286273531 0 -1 118669 -1 118674 418 418 1404 1476 1 1 949 680 -1 -1 0 0 31 -1 -1 31 1 0 118674 -1 0 -1 0 -R.data.io.Rnw -TeX:RNW -17838075 2 -1 75305 -1 75308 494 494 1480 1522 1 1 799 646 -1 -1 0 0 31 -1 -1 31 1 0 75308 -1 0 -1 0 -R.data.Rnw -TeX:RNW -17838075 1 -1 56413 -1 56430 342 342 1328 1400 1 1 1279 442 -1 -1 0 0 31 -1 -1 31 1 0 56430 -1 0 -1 0 -using-r-main-crc.Rnw -TeX:RNW:UTF-8 -134217730 0 190 13 190 17 6074 -1 6522 208 1 1 409 646 1 906 254 255 -1 0 0 33 1 0 17 190 0 -1 0 -R.plotting.Rnw -TeX:RNW -17838075 2 -1 40487 -1 40490 380 380 1366 1438 1 1 739 476 -1 -1 0 0 31 -1 -1 31 1 0 40490 -1 0 -1 0 -using-r-main-crc.ind -TeX:AUX:UNIX -1159154 7 824 33 824 39 308 308 1684 1215 1 0 715 646 -1 -1 0 0 17 0 0 17 1 0 39 824 0 0 0 -usingr.sty -TeX:STY -1060850 2 148 2 148 5 190 190 1176 1248 1 0 205 646 -1 -1 0 0 25 0 0 25 1 0 5 148 0 0 0 -abbrev.sty -TeX:STY -1060850 0 0 1 0 1 88 88 1471 1018 1 0 145 0 -1 -1 0 0 2 0 0 2 1 0 1 0 0 0 0 -R.functions.Rnw -TeX:RNW -17838075 0 -1 11716 -1 11747 456 456 1442 1484 1 1 1624 -1462 -1 -1 0 0 31 -1 -1 31 1 0 11747 -1 0 -1 0 -R.as.calculator.Rnw -TeX:RNW -17838075 1 -1 93302 -1 93316 190 190 1324 1242 1 1 1129 544 -1 -1 0 0 31 -1 -1 31 3 0 93316 -1 1 122277 -1 2 16955 -1 0 -1 0 -R.learning.Rnw -TeX:RNW -17838075 0 -1 12615 -1 12622 304 304 1290 1362 1 1 1549 646 -1 -1 0 0 40 -1 -1 40 1 0 12622 -1 0 -1 0 -using-r-main-crc.log -DATA:UNIX -307331314 2 7544 16 7544 19 440 440 1840 1347 1 0 415 646 -1 -1 0 0 117 0 0 117 1 0 19 7544 0 0 0 -using-r-main-crc.tex -TeX -269496315 2 -1 4070 -1 4073 0 0 1400 907 1 1 409 1292 -1 -1 0 0 49 -1 -1 49 1 0 4073 -1 0 -1 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_Public.dat -DATA -273678578 0 0 1 0 1 264 264 1647 1229 1 0 145 0 -1 -1 0 0 100 0 0 100 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_DataTableInfo.dat -DATA -273678578 0 0 1 0 1 308 308 1691 1273 1 0 145 0 -1 -1 0 0 107 0 0 107 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_Status.dat -DATA -273678578 0 0 1 0 1 352 352 1735 1317 1 0 145 0 -1 -1 0 0 100 0 0 100 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_TableHour.dat -DATA -273678578 0 0 1 0 1 396 396 1779 1361 1 0 145 0 -1 -1 0 0 103 0 0 103 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_TableDay.dat -DATA -273678578 0 0 1 0 1 440 440 1823 1405 1 0 145 0 -1 -1 0 0 102 0 0 102 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_CPIStatus.dat -DATA -273678578 0 0 1 0 1 0 0 1383 930 1 0 145 0 -1 -1 0 0 103 0 0 103 1 0 1 0 0 0 0 -:\Users\aphalo_2\Documents\RAnalyses\RFR-Viikki-field\data-2023-09-12\Viikki Tower_TableMinute.dat -DATA -273678578 0 0 1 0 1 44 44 1427 974 1 0 145 0 -1 -1 0 0 105 0 0 105 1 0 1 0 0 0 0 -knitr.sty -TeX:STY -269496306 0 29 1 29 1 380 380 2073 1301 1 0 145 918 -1 -1 0 0 45 0 0 45 1 0 1 29 0 0 0 -R.as.calculator.tex -TeX -269496315 0 -1 0 -1 0 76 76 1769 997 1 1 148 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 0 0 -krantz.cls -TeX:STY:UNIX -269594610 1 194 1 194 36 76 76 1062 1134 1 0 619 625 -1 -1 0 0 3 0 0 3 1 0 36 194 0 0 0 -:\Users\Public\Desktop\Lenovo ThinkColor.lnk -DATA -273744114 0 0 1 2 1 190 190 1549 1111 1 0 129 50 -1 -1 0 0 -1 -1 -1 -1 1 0 1 2 0 0 0 - -*using-r-main-crc.Rnw -> -*rbooks.bib -*references.bib -*preface.Rnw -*R.learning.Rnw -*R.intro.Rnw -*R.as.calculator.Rnw -*R.data.containers.Rnw -*R.scripts.Rnw -*R.functions.Rnw -*R.stats.Rnw -*R.data.Rnw -*R.plotting.Rnw -*R.data.io.Rnw -< -======= 41 Patch Control 1 1 1 using-r-main-crc.Rnw -22 -16 -8 +24 +18 +2 rbooks.bib -BibTeX -1049586 2 878 52 878 55 38 38 1429 959 1 1 848 783 -1 -1 0 0 21 0 0 21 1 0 55 878 0 -1 0 +BibTeX:UNIX +1147890 0 959 1 959 1 38 38 1429 959 1 1 146 1334 -1 -1 0 0 1 1 1 1 1 0 1 959 0 -1 0 references.bib -BibTeX -1049586 2 219 39 219 42 76 76 1467 997 1 1 679 783 -1 -1 0 0 23 0 0 23 1 0 42 219 0 -1 0 +BibTeX:UNIX +1147890 0 0 1 0 1 76 76 1467 997 1 1 146 0 -1 -1 0 0 1 1 1 1 1 0 1 0 0 -1 0 R.intro.Rnw -TeX:RNW -17838075 2 -1 36492 -1 36496 228 228 1214 1286 1 1 1017 1363 -1 -1 0 0 30 -1 -1 30 1 0 36496 -1 0 -1 0 +TeX:RNW:UNIX +286371835 0 -1 25616 -1 25619 228 228 1214 1286 1 1 341 10643 -1 -1 0 0 31 0 0 31 1 0 25619 -1 0 -1 0 preface.Rnw -TeX:RNW -1060859 2 -1 13719 -1 13723 266 266 1252 1324 1 1 1212 870 -1 -1 0 0 18 -1 -1 18 1 0 13723 -1 0 -1 0 +TeX:RNW:UNIX +1159163 0 326 34 -1 14696 266 266 1252 1324 1 1 614 928 -1 -1 0 0 18 -1 -1 18 1 0 14696 -1 0 -1 0 R.data.containers.Rnw -TeX:RNW -17838075 2 -1 83677 -1 83680 190 190 1324 1242 1 1 640 928 -1 -1 0 0 31 -1 -1 31 3 0 83680 -1 1 192 -1 2 192 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 82948 -1 82948 190 190 1324 1242 1 1 146 928 -1 -1 0 0 31 -1 -1 31 3 0 82948 -1 1 192 -1 2 192 -1 0 -1 0 R.scripts.Rnw -TeX:RNW -17838075 0 -1 98190 -1 98669 152 152 1138 1210 1 1 341 232 -1 -1 0 0 31 -1 -1 31 3 0 98669 -1 1 58097 -1 2 65929 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 75205 -1 75205 152 152 1138 1210 1 1 458 928 -1 -1 0 0 31 -1 -1 31 3 0 75205 -1 1 77491 -1 2 68 1976 0 -1 0 R.stats.rnw -TeX:RNW -17838075 0 1703 169 -1 118674 418 418 1404 1476 1 1 835 1392 -1 -1 0 0 31 -1 -1 31 1 0 118674 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 118590 -1 118595 418 418 1404 1476 1 1 302 1363 -1 -1 0 0 31 -1 -1 31 1 0 118595 -1 0 -1 0 R.data.io.Rnw -TeX:RNW -17838075 2 -1 75305 -1 75308 494 494 1480 1522 1 1 692 899 -1 -1 0 0 31 -1 -1 31 1 0 75308 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 75244 -1 75244 494 494 1480 1522 1 1 146 899 -1 -1 0 0 31 -1 -1 31 1 0 75244 -1 0 -1 0 R.data.Rnw -TeX:RNW -286273531 0 -1 13452 -1 17241 342 342 1328 1400 1 1 406 638 -1 -1 0 0 31 -1 -1 31 1 0 17241 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 1618 66 -1 12290 342 342 1328 1400 1 1 939 203 -1 -1 0 0 31 -1 -1 31 1 0 12290 -1 0 -1 0 using-r-main-crc.Rnw TeX:RNW:UTF-8 -134217730 0 190 13 2 19 6074 -1 6522 208 1 1 380 58 1 906 254 255 -1 0 0 33 1 0 19 2 0 -1 0 +134217730 0 190 13 197 1 6074 -1 6522 208 1 1 146 290 1 906 254 255 -1 0 0 33 1 0 1 197 0 -1 0 R.plotting.Rnw -TeX:RNW -17838075 2 -1 40487 -1 40490 380 380 1366 1438 1 1 640 928 -1 -1 0 0 31 -1 -1 31 1 0 40490 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 40413 -1 40416 380 380 1366 1438 1 1 588 928 -1 -1 0 0 31 -1 -1 31 1 0 40416 -1 0 -1 0 +using-r-main-crc.ind +TeX:AUX:UNIX +1159154 0 824 33 824 39 308 308 1684 1215 1 0 619 638 -1 -1 0 0 17 0 0 17 1 0 39 824 0 0 0 usingr.sty TeX:STY -1060850 1 93 21 93 31 190 190 1176 1248 1 0 515 986 -1 -1 0 0 25 0 0 25 1 0 31 93 0 0 0 +1158386 0 0 1 0 1 190 190 1176 1248 1 0 125 0 -1 -1 0 0 1 1 1 1 1 0 1 0 0 0 0 +usingr-2.sty +TeX +1060091 0 -1 0 -1 11649 152 152 1282 1099 1 0 125 1305 -1 -1 0 0 -1 -1 -1 -1 1 0 11649 -1 0 0 0 abbrev.sty TeX:STY -1060850 0 0 1 0 1 88 88 1471 1018 1 0 125 0 -1 -1 0 0 2 0 0 2 1 0 1 0 0 0 0 +1060850 0 0 1 0 1 88 88 1471 1018 1 0 125 -5829 -1 -1 0 0 2 0 0 2 1 0 1 0 0 0 0 R.functions.Rnw -TeX:RNW -17838075 0 -1 45813 -1 45813 456 456 1442 1484 1 1 952 957 -1 -1 0 0 31 -1 -1 31 1 0 45813 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 11716 -1 11747 456 456 1442 1484 1 1 562 -19111 -1 -1 0 0 31 -1 -1 31 1 0 11747 -1 0 -1 0 R.as.calculator.Rnw -TeX:RNW -17838075 1 -1 93302 -1 93316 190 190 1324 1242 1 1 978 928 -1 -1 0 0 31 -1 -1 31 3 0 93316 -1 1 122277 -1 2 16955 -1 0 -1 0 +TeX:RNW:UNIX +17936379 0 -1 93302 -1 93316 190 190 1324 1242 1 1 159 928 -1 -1 0 0 31 -1 -1 31 3 0 93316 -1 1 146577 -1 2 58 486 0 -1 0 R.learning.Rnw TeX:RNW -17838075 0 -1 15791 -1 15794 304 304 1290 1362 1 1 718 812 -1 -1 0 0 40 -1 -1 40 1 0 15794 -1 0 -1 0 -:\Program Files\MiKTeX\tex\latex\unicode-math\unicode-math.sty -TeX:STY:LaTeX3:UNIX -269593842 7 40 1 39 1 228 228 1358 1175 1 0 125 1131 -1 -1 0 0 -1 -1 -1 -1 1 0 1 39 0 0 0 -using-r-main-crc.ind -TeX:AUX:UNIX -269594610 7 824 33 824 39 308 308 1684 1215 1 0 619 783 -1 -1 0 0 17 0 0 17 1 0 39 824 0 0 0 +17837307 0 -1 0 -1 309 190 190 1320 1137 1 1 770 174 -1 -1 0 0 -1 -1 -1 -1 1 0 309 -1 0 -1 0 +R.learning-2.Rnw +TeX:RNW +286272763 0 -1 0 -1 0 190 190 1320 1137 1 1 146 0 -1 -1 0 0 -1 -1 -1 -1 1 0 0 -1 0 -1 0 using-r-main-crc.log DATA:UNIX 307331314 2 7544 16 7544 19 440 440 1840 1347 1 0 415 646 -1 -1 0 0 117 0 0 117 1 0 19 7544 0 0 0 +using-r-main-crc.tex +TeX +269496315 2 -1 4070 -1 4073 0 0 1400 907 1 1 409 1292 -1 -1 0 0 49 -1 -1 49 1 0 4073 -1 0 -1 0 knitr.sty TeX:STY 269496306 0 29 1 29 1 380 380 2073 1301 1 0 145 918 -1 -1 0 0 45 0 0 45 1 0 1 29 0 0 0 @@ -210,4 +97,3 @@ TeX:STY:UNIX *R.plotting.Rnw *R.data.io.Rnw < ->>>>>>> Stashed changes diff --git a/rbooks.bib b/rbooks.bib index de30b9c8..d119707f 100644 --- a/rbooks.bib +++ b/rbooks.bib @@ -1,4 +1,3 @@ -<<<<<<< Updated upstream @book{Beckerman2012, address = {Oxford}, author = {Beckerman, Andrew P. and Petchey, Owen L.}, @@ -958,964 +957,3 @@ @Book{Wickham2023 modificationdate = {2023-09-28T18:57:43}, subtitle = {Organize, Test, Document, and Share Your Code}, } -======= -@book{Beckerman2012, -address = {Oxford}, -author = {Beckerman, Andrew P. and Petchey, Owen L.}, -isbn = {0199601623}, -keywords = {R}, -mendeley-tags = {R}, -pages = {128}, -publisher = {Oxford University Press}, -title = {{Getting Started with R: An introduction for biologists}}, -year = {2012} -} -@book{Bolker2008, -author = {Bolker, Benjamin M}, -edition = {508}, -isbn = {0691125228}, -keywords = { ecological models,R}, -publisher = {Princeton University Press}, -title = {{Ecological Models and Data in R}}, -year = {2008} -} -@book{Borcard2011, -author = {Borcard, Daniel and Gillet, Francois and Legendre, Pierre}, -isbn = {1441979751}, -pages = {312}, -publisher = {Springer}, -title = {{Numerical Ecology with R}}, -year = {2011} -} -@misc{Brooms2010, -author = {Brooms, A. C.}, -booktitle = {Journal of Applied Statistics}, -doi = {10.1080/02664760903075531}, -isbn = {9780387747309}, -issn = {0266-4763}, -number = {12}, -pages = {2121--2121}, -pmid = {15159452}, -title = {{Data Manipulation with R}}, -volume = {37}, -year = {2010} -} -@book{Chambers2009, -author = {Chambers, John}, -isbn = {0387759352}, -pages = {498}, -publisher = {Springer}, -title = {{Software for Data Analysis: Programming with R (Statistics and Computing)}}, -year = {2009} -} -@book{Chang2013, -address = {Sebastopol}, -author = {Chang, Winston}, -edition = {1-2}, -isbn = {9781449316952}, -keywords = {Graphics,Plotting,R,ggplot2}, -pages = {413}, -publisher = {O'Reilly Media}, -title = {{R Graphics Cookbook}}, -year = {2013} -} -@book{Crawley2007, -author = {Crawley, Michael J.}, -isbn = {0470510242}, -pages = {950}, -publisher = {Wiley}, -title = {{The R Book}}, -year = {2007} -} -@book{Crawley2002, -address = {Chichester}, -author = {Crawley, Michael J.}, -isbn = {0-471-56040-5}, -keywords = {R,S-Plus,Statistics,statistics}, -mendeley-tags = {R,Statistics}, -pages = {x + 761}, -publisher = {Wiley}, -title = {{Statistical Computing: An Introduction to Data Analysis using \{S\}-Plus}}, -year = {2002} -} -@book{Crawley2012, -author = {Crawley, Michael J.}, -isbn = {0470973927}, -pages = {1076}, -publisher = {Wiley}, -title = {{The R Book}}, -year = {2012} -} -@book{Crawley2005, -author = {Crawley, Michael J.}, -isbn = {0470022981}, -pages = {342}, -publisher = {Wiley}, -title = {{Statistics: An Introduction using R}}, -year = {2005} -} -@book{Cryer2009, -author = {Cryer, Jonathan D. and Chan, Kung-Sik}, -isbn = {0387759581}, -pages = {508}, -publisher = {Springer}, -title = {{Time Series Analysis: With Applications in R}}, -year = {2009} -} -@book{Dalgaard2002, -address = {New York}, -author = {Dalgaard, P}, -isbn = {0 387 95475 9}, -keywords = {R,textbook,statistics}, -pages = {xv + 267}, -publisher = {Springer}, -series = {Statistics and Computing}, -title = {{Introductory Statistics with R}}, -year = {2002} -} -@book{Dalgaard2008, -author = {Dalgaard, Peter}, -isbn = {0387790543}, -keywords = {R}, -mendeley-tags = {R}, -pages = {380}, -publisher = {Springer}, -title = {{Introductory Statistics with R}}, -year = {2008} -} -@book{Eddelbuettel2013, -author = {Eddelbuettel, Dirk}, -isbn = {1461468671}, -keywords = {C++,R,Rcpp,programming}, -mendeley-tags = {C++,R,Rcpp,programming}, -pages = {248}, -publisher = {Springer}, -title = {{Seamless R and C++ Integration with Rcpp}}, -year = {2013} -} -@book{Everitt2010, -author = {Everitt, Brian S. and Hothorn, Torsten}, -edition = {2}, -isbn = {1420079336}, -keywords = {R,handbook,statistics}, -pages = {376}, -publisher = {Chapman \& Hall/CRC}, -title = {{A Handbook of Statistical Analyses Using R}}, -year = {2010} -} -@book{Everitt2011, -author = {Everitt, Brian and Hothorn, Torsten}, -isbn = {1441996494}, -pages = {288}, -publisher = {Springer}, -title = {{An Introduction to Applied Multivariate Analysis with R}}, -year = {2011} -} -@book{Faraway2004, -address = {Boca Raton, FL}, -annote = {ISBN 1-584-88425-8}, -author = {Faraway, Julian James}, -keywords = {R, linear models, statistics,S}, -pages = {240}, -publisher = {Chapman \& Hall/CRC}, -title = {{Linear Models with R}}, -year = {2004} -} -@book{Faraway2006, -author = {Faraway, Julian James}, -isbn = {158488424X}, -issn = {00319155}, -keywords = {GLM,Rbook,analysis of variance,mixed effects,nonpararametric regression model,regression}, -pages = {345}, -publisher = {Chapman \& Hall/CRC}, -title = {{Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models}}, -year = {2006} -} -@book{Fox2002, -address = {Thousand Oaks, CA, USA}, -annote = {ISBN 0-761-92279-2}, -author = {Fox, John}, -keywords = {S,regression,statistics,R}, -publisher = {Sage Publications}, -title = {{An \{R\} and \{S-Plus\} Companion to Applied Regression}}, -year = {2002} -} -@book{Fox2010, -author = {Fox, John and Weisberg, Harvey Sanford}, -isbn = {141297514X}, -pages = {472}, -publisher = {SAGE Publications, Inc}, -title = {{An R Companion to Applied Regression}}, -year = {2010} -} -@book{Gandrud2013, -author = {Gandrud, Christopher}, -isbn = {1466572841}, -pages = {294}, -publisher = {Chapman and Hall/CRC}, -title = {{Reproducible Research with R and RStudio}}, -year = {2013} -} -@book{Gentleman2008, -author = {Gentleman, Robert}, -isbn = {1420063677}, -pages = {328}, -publisher = {Chapman and Hall/CRC}, -title = {{R Programming for Bioinformatics}}, -year = {2008} -} -@book{Hahne2008, -author = {Hahne, Florian and Huber, Wolfgang and Gentleman, Robert and Falcon, Seth}, -isbn = {0387772391}, -pages = {284}, -publisher = {Springer}, -title = {{Bioconductor Case Studies (Use R!)}}, -year = {2008} -} -@book{Hyndman2008, -author = {Hyndman, Rob and Koehler, Anne B. and Ord, J. Keith and Snyder, Ralph D.}, -isbn = {3540719164}, -pages = {362}, -publisher = {Springer}, -title = {{Forecasting with Exponential Smoothing: The State Space Approach}}, -year = {2008} -} -@article{Ihaka1996, -author = {Ihaka, R and Gentleman, R}, -journal = {J. Comput. Graph. Stat.}, -keywords = {R,graphics,software,statistics}, -pages = {299--314}, -title = {{R: A Language for Data Analysis and Graphics}}, -volume = {5}, -year = {1996} -} -@book{Keen2010, -author = {Keen, Kevin J}, -isbn = {1584880872}, -pages = {489}, -publisher = {Chapman and Hall/CRC}, -title = {{Graphics for Statistics and Data Analysis with R}}, -year = {2010} -} -@book{Loo2012, -address = {Birmingham, Mumbai}, -author = {der Loo, MPJ Van and de Jonge, Edwin}, -edition = {1}, -pages = {126}, -publisher = {Packt Publishing}, -title = {{Learning RStudio for R Statistical Computing}}, -year = {2012} -} -@book{Maindonald2010, -author = {Maindonald, John and Braun, W. John}, -isbn = {0521762936}, -pages = {552}, -publisher = {Cambridge University Press}, -title = {{Data Analysis and Graphics Using R: An Example-Based Approach}}, -year = {2010} -} -@book{Matloff2011, -author = {Matloff, Norman}, -isbn = {1593273843}, -pages = {400}, -publisher = {No Starch Press}, -title = {{The Art of R Programming: A Tour of Statistical Software Design}}, -year = {2011} -} - -@Book{Murrell2019, - title = {R Graphics}, - publisher = {Chapman and Hall/CRC}, - year = {2019}, - author = {Paul Murrell}, - address = {Portland}, - edition = {3}, - isbn = {1498789056}, - location = {Portland}, - pagetotal = {423}, -} -@book{Murrell2011, -author = {Murrell, Paul}, -isbn = {1439831769}, -pages = {546}, -edition = {2}, -publisher = {Chapman and Hall/CRC}, -title = {R Graphics}, -year = {2011} -} -@book{Murrell2005, -address = {Boca Raton, FL}, -author = {Murrell, Paul}, -isbn = {1-584-88486-X}, -keywords = {R,graphics,software}, -pages = {301}, -publisher = {Chapman and Hall/CRC}, -title = {{R Graphics}}, -year = {2005} -} - -@book{Petris2009, -author = {Petris, Giovanni and Petrone, Sonia and Campagnoli, Patrizia}, -isbn = {0387772375}, -pages = {268}, -publisher = {Springer}, -title = {{Dynamic Linear Models with R (Use R!)}}, -year = {2009} -} -@book{Pinheiro2000, -address = {New York}, -author = {Pinheiro, J C and Bates, D M}, -booktitle = {Mixed-Effects Models in S and S-Plus}, -keywords = { LME, NLME, S, S-Plus, linear mixed effects, mixed effects, non-linear mixed effects,R}, -publisher = {Springer}, -title = {{Mixed-Effects Models in S and S-Plus}}, -year = {2000} -} -@book{Ritz2009a, -author = {Ritz, Christian and Streibig, Jens Carl}, -isbn = {0387096159}, -pages = {148}, -publisher = {Springer}, -title = {{Nonlinear Regression with R}}, -year = {2009} -} -@book{Robert2009, -author = {Robert, Christian and Casella, George}, -isbn = {1441915753}, -pages = {306}, -publisher = {Springer}, -title = {{Introducing Monte Carlo Methods with R}}, -year = {2009} -} -@book{Sarkar2008, -author = {Sarkar, Deepayan}, -edition = {1}, -isbn = {0387759689}, -keywords = {R,software,statistics}, -pages = {268}, -publisher = {Springer}, -title = {{Lattice: Multivariate Data Visualization with R}}, -year = {2008} -} -@book{Soetaert, -author = {Soetaert, Karline and Cash, Jeff and Mazzia, Francesca}, -isbn = {3642280692}, -keywords = {R,differential equations}, -mendeley-tags = {R,differential equations}, -publisher = {Springer}, -title = {{Solving Differential Equations in R}}, -} -@book{Stanton2013, -author = {Stanton, Jeffrey}, -edition = {Version 3}, -pages = {196}, -publisher = {Syracuse University}, -title = {{An Introduction to Data Science}}, -year = {2013} -} -@book{Tattar2013, -address = {Birmingham, Mumbai}, -author = {Tattar, Prabhanjan Narayanachar}, -edition = {1}, -isbn = {9781849519441}, -pages = {345}, -publisher = {Packt Publishing}, -title = {{R Statistical Application Development by Example Beginner's Guide}}, -year = {2013} -} -@book{Teetor2011, -address = {Sebastopol}, -author = {Teetor, P}, -edition = {1}, -isbn = {9780596809157}, -keywords = {R}, -mendeley-tags = {R}, -pages = {436}, -publisher = {O'Reilly Media}, -title = {{R Cookbook}}, -year = {2011} -} -@book{Venables2000, -address = {New York}, -author = {Venables, W N and Ripley, B D}, -isbn = {0 387 98966 8}, -keywords = {R,S,S-Plus,programming,statistics}, -pages = {x + 264}, -publisher = {Springer}, -series = {Statistics and Computing}, -title = {{S Programming}}, -year = {2000} -} -@book{Venables1999, -address = {New York}, -author = {Venables, W N and Ripley, B D}, -edition = {3rd}, -isbn = {0 387 98825 4}, -pages = {x + 501}, -publisher = {Springer}, -series = {Statistics and Computing}, -title = {Modern Applied Statistics with {S-PLUS}}, -year = {1999} -} -@book{Venables2002, -address = {New York}, -author = {Venables, William N and Ripley, Brian D}, -edition = {4th}, -isbn = {0-387-95457-0}, -publisher = {Springer}, -title = {Modern Applied Statistics with {S}}, -year = {2002} -} -@book{Verzani2004, -author = {Verzani, John}, -isbn = {1584884509}, -pages = {432}, -publisher = {Chapman \& Hall/CRC}, -title = {Using {R} for Introductory Statistics}, -year = {2004} -} -@book{Wickham2009, -author = {Wickham, Hadley}, -isbn = {0387981403}, -pages = {224}, -publisher = {Springer}, -shorttitle = {ggplot2}, -title = {ggplot2: Elegant Graphics for Data Analysis}, -year = {2009} -} -@book{Xie2013, -author = {Xie, Yihui}, -isbn = {1482203537}, -pages = {216}, -publisher = {Chapman and Hall/CRC}, -title = {Dynamic Documents with R and knitr}, -series = {The R Series}, -year = {2013} -} - -@book{Zuur2009, -author = {Zuur, Alain F. and Ieno, Elena N. and Meesters, Erik}, -edition = {1}, -isbn = {0387938362}, -keywords = {R,introduction,tutorial}, -pages = {236}, -publisher = {Springer}, -title = {{A Beginner's Guide to R}}, -year = {2009} -} -@book{Zuur2007, -author = {Zuur, Alain and Ieno, Elena N. and Smith, Graham M.}, -isbn = {0387459677}, -pages = {672}, -publisher = {Springer}, -title = {{Analysing Ecological Data (Statistics for Biology and Health)}}, -year = {2007} -} -@book{Zuur2009b, -address = {New York}, -author = {Zuur, Alain and Ieno, Elena N. and Walker, Neil and Saveliev, Anatoly A. and Smith, Graham M.}, -isbn = {978-0-387-87457-9}, -pages = {574}, -publisher = {Springer}, -title = {Mixed Effects Models and Extensions in Ecology with {R}}, -year = {2009} -} -@book{Wickham2015, - title={R Packages}, - author={Wickham, H.}, - isbn={9781491910542}, - year={2015}, - publisher={O'Reilly Media} -} -@book{Wickham2014advanced, - title={Advanced R}, - author={Wickham, H.}, - isbn={9781466586970}, - series={The R Series}, - year={2014}, - publisher={Chapman and Hall/CRC} -} -@Book{Hillebrand2015, - author = {Julian Hillebrand and Maximilian H. Nierhoff}, - title = {Mastering RStudio: Develop, Communicate, and Collaborate with R}, - year = {2015}, - publisher = {Packt Publishing}, - pagetotal = {348}, - isbn = {9781783982554}, -} - - -@Book{Horton2015, - author = {Horton, Nicholas}, - title = {Using R and RStudio for data management, statistical analysis, and graphics}, - year = {2015}, - publisher = {CRC Press}, - isbn = {9781482237368}, - address = {Boca Raton}, -} - - -@Book{vanderLoo2012, - author = {van der Loo, Mark P.J. and de Jonge, Edwin}, - title = {Learning {RStudio} for {R} Statistical Computing}, - year = {2012}, - edition = {1}, - publisher = {Packt Publishing}, - isbn = {9781782160601}, - pages = {126}, - address = {Birmingham}, -} - - -@Book{Gandrud2013, - author = {Gandrud, Christopher}, - title = {Reproducible Research with R and RStudio}, - series = {The R Series}, - year = {2013}, - publisher = {Chapman and Hall/CRC}, - isbn = {1466572841}, - pages = {294}, -} - - -@Book{Wickham2015, - Title = {R Packages}, - Author = {Wickham, H.}, - Year = {2015}, - ISBN = {9781491910542}, - Publisher = {O'Reilly Media}, -} - -@Book{Wickham2014, - Title = {Advanced R}, - Author = {Wickham, H.}, - Year = {2014}, - ISBN = {9781466586970}, - Publisher = {Chapman and Hall/CRC}, - Series = {The R Series}, -} - -@Book{Hyndman2014, - Title = {Forecasting: principles and practice}, - Author = {Hyndman, Rob}, - Year = {2014}, - Address = {Heathmont, Vic}, - Eprint = {https://www.otexts.org/book/fpp}, - ISBN = {9780987507105}, - Publisher = {OTexts}, - Url = {https://www.otexts.org/book/fpp}, -} - -@Book{Somasundaram2013, - author = {Ravishankar Somasundaram}, - title = {Git}, - year = {2013}, - publisher = {Packt Publishing}, - isbn = {1849517525}, - pagetotal = {180}, -} - -@Book{Chang2013, - Title = {R Graphics Cookbook}, - Author = {Chang, Winston}, - Year = {2013}, - Address = {Sebastopol}, - Edition = {1-2}, - ISBN = {9781449316952}, - Pages = {413}, - Publisher = {O'Reilly Media}, -} - -@Book{Beckerman2012, - Title = {Getting Started with R: An introduction for biologists}, - Series = {Oxford Biology}, - Author = {Beckerman, Andrew P. and Petchey, Owen L.}, - Year = {2012}, - Address = {Oxford}, - ISBN = {0199601623}, - Pages = {128}, - Publisher = {Oxford University Press}, -} - -@Book{Crawley2012, - Title = {The R Book}, - Author = {Crawley, Michael J.}, - Year = {2012}, - ISBN = {0470973927}, - Pages = {1076}, - Publisher = {Wiley}, -} - -@Book{Allerhand2011, - author = {Allerhand, Mike}, - title = {A Tiny Handbook of R}, - year = {2011}, - publisher = {Springer}, - isbn = {978-3-642-17980-8}, -} - -@Book{Teetor2011, - Title = {R Cookbook}, - Author = {Teetor, P}, - Year = {2011}, - Address = {Sebastopol}, - Edition = {1}, - ISBN = {9780596809157}, - Pages = {436}, - Publisher = {O'Reilly Media}, -} - - -@Book{Swicegood2010, - author = {Travis Swicegood}, - title = {Pragmatic Guide to Git}, - year = {2010}, - publisher = {Pragmatic Programmers}, - isbn = {978-1-934356-72-2}, -} - - -@Book{Horton2015a, - title = {A Student's Guide to R}, - year = {2015}, - author = {Nicholas J. Horton and Randall Pruim and Daniel T. Kaplan}, - edition = {1.2}, - pagetotal = {119}, - url = {https://github.com/ProjectMOSAIC/LittleBooks/blob/master/README.md}, - urldate = {2016-07-17}, -} - - -@Book{Paradis2005, - title = {R for Beginners}, - year = {2005}, - author = {Emmanuel Paradis}, - date = {2005}, - location = {Montpellier}, - pagetotal = {76}, - url = {https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf}, - urldate = {2016-07-17}, -} - - -@Book{Wickham2016, - author = {Hadley Wickham and Carson Sievert}, - title = {ggplot2: Elegant Graphics for Data Analysis}, - year = {2016}, - edition = {2}, - publisher = {Springer}, - isbn = {978-3-319-24277-4}, - pagetotal = {XVI + 260}, -} - -@Book{Cotton2016, - author = {Richard James Cotton}, - title = {Testing R Code}, - year = {2016}, - series = {The R Series}, - publisher = {Chapman and Hall/CRC}, - isbn = {1498763650}, -} - -@Book{Chambers2016, - author = {John M. Chambers}, - title = {Extending R}, - year = {2016}, - series = {The R Series}, - publisher = {Chapman and Hall/CRC}, - isbn = {1498775713}, -} - -@Book{Peng2022, - author = {Roger D. Peng}, - title = {R Programming for Data Science}, - date = {2022}, - publisher = {Leanpub}, - pagetotal = {182}, - url = {https://leanpub.com/rprogramming}, - urldate = {2023-07-27}, -} - -@book{Venables2000, -address = {New York}, -author = {Venables, W N and Ripley, B D}, -isbn = {0 387 98966 8}, -keywords = {R,S,S-Plus,programming,statistics}, -pages = {x + 264}, -publisher = {Springer}, -series = {Statistics and Computing}, -title = {{S Programming}}, -year = {2000} -} - -@Book{Becker1984, - title = {S: An Interactive Environment for Data Analysis and Graphics}, - publisher = {Chapman and Hall/CRC}, - year = {1984}, - author = {R. A. Becker and J. M. Chambers}, - isbn = {0-534-03313-X}, -} - -@Book{Becker1988, - title = {The New {S} Language: A Programming Environment for Data Analysis and Graphics}, - publisher = {Chapman \& Hall}, - year = {1988}, - author = {Richard A. Becker and John M. Chambers and Allan R. Wilks}, - isbn = {0-534-09192-X}, -} - - -@Book{Burns2012, - title = {Tao Te Programming}, - year = {2012}, - author = {Patrick Burns}, - isbn = {9781291130454}, - publisher = {Lulu}, - timestamp = {2017-07-27}, -} - - -@Book{Burns1998, - author = {Patrick Burns}, - title = {S Poetry}, - year = {1998}, - groups = {S and R}, - keywords = {R, programming, style, clarity}, -} - - -@Book{Burns2011, - title = {The R Inferno}, - year = {2011}, - author = {Patrick Burns}, - url = {http://www.burns-stat.com/pages/Tutor/R_inferno.pdf}, - urldate = {2017-07-27}, -} - - -@Book{Bentley1988, - author = {Bentley, Jon}, - title = {More Programming Pearls: Confessions of a Coder}, - year = {1988}, - publisher = {Addison Wesley}, - isbn = {0201118890}, - pagetotal = {224}, -} - -@Book{Bentley1986, - author = {Jon Louis Bentley}, - title = {Programming Pearls}, - year = {1986}, - publisher = {Addison-Wesley}, - isbn = {0201500191}, - pagetotal = {195}, - address = {Reading, Massachusetts}, -} - -@Book{Burchell2016, - title = {The Hitchhiker’s Guide to Ggplot2 in R}, - publisher = {Leanpub}, - year = {2016}, - author = {Jodie Burchell and Mauricio Vargas}, - isbn = {978-956-362-693-3}, - keywords = {Plotting, R, ggplot2}, - pagetotal = {237}, - timestamp = {2017-08-11}, - urldate = {2019-07-31}, -} - -@book{Crawley2012, -author = {Crawley, Michael J.}, -isbn = {0470973927}, -pages = {1076}, -publisher = {Wiley}, -title = {{The R Book}}, -year = {2012} -} - - -@Book{Xie2018, - title = {R Markdown}, - publisher = {Chapman and Hall/CRC}, - year = {2018}, - author = {Yihui Xie and J. J. Allaire and Garrett Grolemund}, - isbn = {1138359335}, - pagetotal = {304} -} - - -@Book{Wood2017, - author = {Wood, Simon N.}, - title = {Generalized Additive Models}, - year = {2017}, - publisher = {Chapman and Hall/CRC}, - isbn = {1498728332}, - pagetotal = {476}, - ean = {9781498728331}, - timestamp = {2018-12-27}, -} - - -@Book{Wickham2017, - title = {R for Data Science}, - publisher = {O'Reilly UK Ltd.}, - year = {2017}, - author = {Hadley Wickham and Garrett Grolemund}, - isbn = {1491910399}, - ean = {9781491910399}, - timestamp = {2019-07-31}, -} - - -@Book{Wickham2019, - title = {Advanced R}, - publisher = {Chapman and Hall/CRC}, - year = {2019}, - author = {Wickham, Hadley}, - edition = {2}, - isbn = {0815384572}, - ean = {9780815384571}, - pagetotal = {588}, - timestamp = {2019-07-31}, -} - - -@Book{Chang2018, - title = {R Graphics Cookbook}, - publisher = {O'Reilly UK Ltd.}, - year = {2018}, - author = {Chang, Winston}, - edition = {2}, - isbn = {1491978600}, - ean = {9781491978603}, - timestamp = {2019-07-31}, -} - - -@Book{Diez2019, - title = {OpenIntro Statistics}, - year = {2019}, - author = {David Diez and Mine Cetinkaya-Rundel and Christopher D. Barr}, - edition = {4}, - pagetotal = {422}, - url = {https://www.openintro.org/stat/os4.php}, - urldate = {2022-11-20}, -} - - -@Book{Holmes2019, - title = {Modern Statistics for Modern Biology}, - publisher = {Cambridge University Press}, - year = {2019}, - author = {Susan Holmes and Wolfgang Huber}, - isbn = {1108705294}, - ean = {9781108705295}, - pagetotal = {382}, -} - - -@Article{Smith1957, - author = {Smith, H. F.}, - title = {Interpretation of adjusted treatment means and regressions in analysis of covariance}, - journal = {Biometrics}, - year = {1957}, - volume = {13}, - pages = {281--308}, -} - - -@Book{Kernighan1999, - title = {The Practice of Programming}, - publisher = {Addison Wesley}, - year = {1999}, - author = {Brian W. Kernighan and Rob Pike}, - isbn = {020161586X}, - ean = {9780201615869}, - pagetotal = {288}, -} - -@Book{Ramsay2009, - author = {Ramsay, James}, - publisher = {Springer-Verlag New York}, - title = {Functional Data Analysis with R and MATLAB}, - isbn = {9780387981840}, - creationdate = {2023-06-03T14:21:04}, - date = {2009}, - modificationdate = {2023-06-03T14:21:04}, - pages = {214}, -} - -@Book{Wickham2023, - author = {Wickham, Hadley and Cetinkaya-Rundel, Mine and Grolemund, Garrett}, - publisher = {O'Reilly Media}, - title = {R for Data Science}, - isbn = {9781492097402}, - date = {2023}, - subtitle = {Import, Tidy, Transform, Visualize, and Model Data}, -} - -@Online{Ihaka1998, - author = {Ross Ihaka}, - creationdate = {2023-09-04T22:30:01}, - date = {1998}, - groups = {Read!, S and R}, - keywords = {R history}, - modificationdate = {2023-09-04T22:36:06}, - note = {Interface Symposium on Computer Science and Statistics}, - organization = {The University of Auckland}, - pubstate = {Draft}, - subtitle = {A Draft of a Paper for Interface 98}, - title = {R : Past and Future History}, - url = {https://www.stat.auckland.ac.nz/~ihaka/downloads/Interface98.pdf}, -} - -@Book{Zuur2012, - author = {Alain F. Zuur}, - publisher = {Highland Statistics}, - title = {A Beginner's Guide to Generalized Additive Models with R}, - edition = {1}, - isbn = {9780957174122}, - creationdate = {2023-09-26T18:05:28}, - date = {2012}, - location = {Newburgh}, - modificationdate = {2023-09-26T18:10:28}, - pagetotal = {194}, -} - -@Book{Mehtatalo2020, - author = {Mehtätalo, Lauri and Lappi, Juha}, - publisher = {Taylor \& Francis Group}, - location = {Boca Raton}, - title = {Biometry for Forestry and Environmental Data with Examples in {R}}, - isbn = {9781498711487}, - creationdate = {2023-09-26T20:37:05}, - date = {2020}, - modificationdate = {2023-09-26T20:38:01}, - pagetotal = {411}, -} - -@Book{James2013, - author = {James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert}, - title = {An Introduction to Statistical Learning: with Applications in R}, - isbn = {978-1461471370}, - pages = {426}, - publisher = {Springer}, - groups = {S and R}, - keywords = {R}, - mendeley-tags = {R}, - owner = {aphalo}, - timestamp = {2014.10.06}, - year = {2013}, -} - -@Book{Wickham2023, - author = {Wickham, Hadley and Bryan, Jenny}, - publisher = {O'Reilly Media, Incorporated}, - title = {R Packages Organize, Test, Document, and Share Your Code}, - isbn = {9781098134945}, - creationdate = {2023-09-28T18:57:43}, - date = {2023}, - modificationdate = {2023-09-28T18:57:43}, - subtitle = {Organize, Test, Document, and Share Your Code}, -} ->>>>>>> Stashed changes diff --git a/references.bib b/references.bib index f7fac87b..a8b6dfae 100644 --- a/references.bib +++ b/references.bib @@ -1,4 +1,3 @@ -<<<<<<< Updated upstream @Article{Anscombe1973, Title = {Graphs in Statistical Analysis}, Author = {F. J. Anscombe}, @@ -499,505 +498,3 @@ @Article{Wickham2010 timestamp = {2020-11-29}, year = {2010}, } -======= -@Article{Anscombe1973, - Title = {Graphs in Statistical Analysis}, - Author = {F. J. Anscombe}, - Journal = {The American Statistician}, - Year = {1973}, - Doi = {10.2307/2682899}, - Month = {feb}, - Number = {1}, - Pages = {17}, - Volume = {27}, -} - - -@Article{Knuth1984a, - author = {Knuth, Donald Ervin}, - title = {Literate programming}, - journal = {The Computer Journal}, - year = {1984}, - volume = {27}, - number = {2}, - pages = {97--111}, -} - - -@Book{Lamport1994, - author = {Lamport, L.}, - title = {\LaTeX: a document preparation system}, - year = {1994}, - language = {English}, - edition = {2}, - publisher = {Addison-Wesley}, - isbn = {0-201-52983-1}, - pages = {272}, - address = {Reading}, - booktitle = {LaTeX: a document preparation system}, -} - - -@Book{Xie2016, - author = {Yihui Xie}, - title = {bookdown: Authoring Books and Technical Documents with R Markdown}, - year = {2016}, - publisher = {Chapman and Hall/CRC}, - isbn = {9781138700109}, -} - - -@Book{Wickham2017, - title = {R for Data Science}, - publisher = {O'Reilly}, - year = {2017}, - author = {Wickham, Hadley and Grolemund, Garrett}, - isbn = {978-1-4919-1039-9}, - date = {2017-01-11}, -} - -@Book{Peng2017, - title = {Mastering Software Development in R}, - publisher = {Leanpub}, - year = {2017}, - author = {Roger D. Peng and Sean Kross and Brooke Anderson}, - date = {2017-01-05}, - timestamp = {2017-02-11}, - url = {https://leanpub.com/msdr}, -} - - -@Book{Cleveland1985, - title = {The Elements of Graphing Data}, - publisher = {Wadsworth, Inc.}, - year = {1985}, - author = {William S. Cleveland}, - isbn = {978-0534037291}, -} - - -@Book{Bentley1986, - title = {Programming Pearls}, - publisher = {Addison-Wesley Publishing Company}, - year = {1986}, - author = {Jon Louis Bentley}, - address = {Reading, Massachusetts}, - isbn = {0201500191}, - booktitle = {Programming Pearls}, - pages = {195}, -} - -@Book{Bentley1988, - title = {More Programming Pearls: Confessions of a Coder}, - publisher = {Addison Wesley Pub Co Inc}, - year = {1988}, - author = {Bentley, Jon Louis}, - isbn = {0201118890}, - date = {1988-01-11}, - ean = {9780201118896}, - pagetotal = {224}, -} - - -@Book{Tufte1983, - author = {Tufte, E. R.}, - title = {The Visual Display of Quantitative Information}, - year = {1983}, - publisher = {Graphics Press}, - isbn = {0-9613921-0-X}, - pagetotal = {197}, - address = {Cheshire, CT}, -} - - -@Book{Kernigham1981, - author = {Kernigham, B. W. and Plauger, P. J.}, - title = {Software Tools in Pascal}, - year = {1981}, - publisher = {Addison-Wesley Publishing Company}, - pagetotal = {366}, - address = {Reading, Massachusetts}, - booktitle = {Software Tools in Pascal}, - groups = {Imported}, - keywords = {0-BOOK PEDRO}, - owner = {aphalo}, - papyrus = {P2731}, - timestamp = {2014.10.05}, -} - -@Book{Rosenblatt1993, - author = {Rosenblatt, B.}, - title = {Learning the Korn Shell}, - year = {1993}, - language = {English}, - publisher = {O'Reilly and Associates}, - isbn = {1-56592-054-6}, - pages = {337}, - address = {Sebastopol}, - booktitle = {Learning the Korn Shell}, - groups = {Imported}, - keywords = {0-BOOK PEDRO; COMPUTERS-; KORN SHELL; PROGRAMMING LANGUAGE; SHELL; UNIX}, - owner = {aphalo}, - papyrus = {P1854}, - timestamp = {2014.10.05}, -} - - -@Article{Wickham2014a, - author = {Hadley Wickham}, - title = {Tidy Data}, - year = {2014}, - volume = {59}, - number = {10}, - month = sep, - issn = {1548-7660}, - url = {http://www.jstatsoft.org/v59/i10}, - accepted = {2014-05-09}, - bibdate = {2014-05-09}, - coden = {JSSOBK}, - day = {12}, - journal = {Journal of Statistical Software}, - keywords = {R, data objects, data frame, dplyr}, - owner = {aphalo}, - submitted = {2013-02-20}, - timestamp = {2015.07.12}, -} - - -@Article{Burns1998, - author = {P J Burns}, - title = {S poetry}, - year = {1998}, - publisher = {Lulu.com}, - isbn = {9781471045523}, - pages = {429}, - groups = {S and R}, - keywords = {R, programming, style, clarity}, -} - - -@Article{Aphalo2006, - author = {Pedro J Aphalo and Risto Rikala}, - title = {Spacing of silver birch seedlings grown in containers of equal size affects their morphology and its variability.}, - year = {2006}, - volume = {26}, - number = {9}, - pages = {1227--1237}, - doi = {10.1093/treephys/26.9.1227}, - groups = {Imported}, - journal = {Tree physiology}, -} - -@Article{Sagi2017, - author = {Kazutoshi Sagi and Kristell Pérot and Donal Murtagh and Yvan Orsolini}, - title = {Two mechanisms of stratospheric ozone loss in the Northern Hemisphere, studied using data assimilation of Odin/SMR atmospheric observations}, - journal = {Atmospheric Chemistry and Physics}, - year = {2017}, - volume = {17}, - pages = {1791--1803}, - doi = {10.5194/acp-17-1791-2017}, -} - - -@InProceedings{Leisch2002, - author = {Friedrich Leisch}, - title = {Dynamic generation of statistical reports using literate data analysis}, - booktitle = {Proceedings in Computational Statistics}, - year = {2002}, - editor = {W. Härdle and B. Rönz}, - pages = {575-580}, - publisher = {Physika Verlag}, - eventtitle = {Compstat 2002}, - isbn = {3-7908-1517-9}, - location = {Heidelberg, Germany}, - timestamp = {2018-01-11}, -} - - -@Book{Gandrud2015, - author = {Christopher Gandrud}, - title = {Reproducible Research with R and R Studio}, - year = {2015}, - edition = {2}, - series = {Chapman \& Hall/CRC The R Series)}, - publisher = {Chapman and Hall/CRC}, - isbn = {1498715370}, - pagetotal = {323}, -} - - -@Book{Kernigham1981, - author = {Kernigham, B. W. and Plauger, P. J.}, - title = {Software Tools in Pascal}, - year = {1981}, - publisher = {Addison-Wesley}, - pages = {366}, - address = {Reading, Massachusetts}, - booktitle = {Software Tools in Pascal}, - groups = {Imported}, -} - -@Article{Johnson2011, - author = {Kenneth A. Johnson and Roger S. Goody}, - title = {The Original Michaelis Constant: Translation of the 1913 Michaelis{\textendash}Menten Paper}, - journal = {Biochemistry}, - year = {2011}, - volume = {50}, - pages = {8264--8269}, - doi = {10.1021/bi201284u}, -} - -@Book{Hughes2004, - author = {Hughes, Thomas P.}, - title = {American Genesis}, - year = {2004}, - publisher = {The University of Chicago Press}, - isbn = {0226359271}, - pagetotal = {530} -} - - -@Article{Zachry2004, - author = {Mark Zachry and Charlotte Thralls}, - title = {An Interview with Edward R. Tufte}, - journal = {Technical Communication Quarterly}, - year = {2004}, - volume = {13}, - number = {4}, - pages = {447--462}, - month = {oct}, - doi = {10.1207/s15427625tcq1304_5}, - publisher = {Informa {UK} Limited}, - timestamp = {2019-07-27}, -} - - -@Book{Newham2005, - title = {Learning the bash Shell}, - publisher = {O'Reilly UK Ltd.}, - year = {2005}, - author = {Cameron Newham and Bill Rosenblatt}, - isbn = {0596009658}, - date = {2005-06-01}, - ean = {9780596009656}, - pagetotal = {352}, -} - - -@Article{Boas1981, - author = {Boas, Ralph P}, - title = {Can we make mathematics intelligible?}, - journal = {The American Mathematical Monthly}, - year = {1981}, - volume = {88}, - number = {10}, - pages = {727--731}, - publisher = {Taylor \& Francis}, - timestamp = {2020-02-07}, -} - - -@Book{Kernighan1999, - title = {The Practice of Programming}, - publisher = {Addison Wesley}, - year = {1999}, - author = {Brian W. Kernighan and Rob Pike}, - isbn = {020161586X}, - date = {1999-11-01}, - ean = {9780201615869}, - pagetotal = {288}, - timestamp = {2019-10-12}, -} - - -@Article{Aiken1964, - author = {Howard Aiken and A. G. Oettinger and T. C. Bartee}, - title = {Proposed automatic calculating machine}, - journal = {{IEEE} Spectrum}, - year = {1964}, - volume = {1}, - number = {8}, - pages = {62--69}, - month = {aug}, - doi = {10.1109/mspec.1964.6500770}, - publisher = {Institute of Electrical and Electronics Engineers ({IEEE})}, - timestamp = {2020-02-07}, -} - - -@Online{LemonND, - author = {Jim Lemon}, - title = {Kickstarting R}, - timestamp = {2020-02-07}, - url = {https://cran.r-project.org/doc/contrib/Lemon-kickstart/kr_intro.html}, - urldate = {2020-02-07}, -} - - -@Book{Hamming1987, - title = {Numerical Methods for Scientists and Engineers}, - publisher = {Dover Publications Inc.}, - year = {1987}, - author = {Hamming, Richard W.}, - isbn = {0486652416}, - ean = {9780486652412}, - pagetotal = {752}, - timestamp = {2020-02-07}, -} - -@Book{Aho1994, - title = {Foundations of Computer Science: C Edition}, - publisher = {W. H. Freeman}, - year = {1994}, - author = {Alfred V. Aho and Jeffrey D. Ullman}, - isbn = {0716782847}, - pagetotal = {786}, - timestamp = {2020-02-07}, -} - - -@Book{Aho1992, - title = {Foundations of computer science}, - publisher = {Computer Science Press}, - year = {1992}, - author = {Alfred V. Aho and Jeffrey D. Ullman}, - isbn = {0716782332}, -} - - -@Article{Zachry2004, - author = {Mark Zachry and Charlotte Thralls}, - title = {An Interview with Edward R. Tufte}, - journal = {Technical Communication Quarterly}, - year = {2004}, - volume = {13}, - number = {4}, - pages = {447--462}, - month = {oct}, - doi = {10.1207/s15427625tcq1304_5}, -} - -@Book{Zimmer1985, - author = {Zimmer, J. A.}, - publisher = {McGraw-Hill}, - title = {Abstraction for Programmers}, - year = {1985}, - address = {New York}, - isbn = {0070728321}, - booktitle = {Abstraction for Programmers}, - date = {1985}, - groups = {Imported}, - pages = {251}, -} - -@Article{Wirth1974, - author = {Wirth, N.}, - journal = {Computing Surveys}, - title = {On the Composition of Well-Structured Programs}, - year = {1974}, - number = {4}, - pages = {247-259}, - volume = {6}, - doi = {10.1145/356635.356639}, - keywords = {0-REPRINT PEDRO}, -} - -@Book{Coplien1999, - author = {Coplien, James O.}, - publisher = {Addison-Wesley}, - title = {Multi-paradigm design for C++}, - isbn = {0201824671}, - date = {1999}, - pages = {280}, -} - -@Book{Adams1987, - author = {James L. Adams}, - publisher = {Penguin Books Ltd.}, - title = {Conceptual Blockbusting: A Guide to Better Ideas}, - year = {1987}, - isbn = {9780140098426}, - timestamp = {2017-10-11}, -} - -@Book{Hall1997, - author = {Hall, Joseph N. and Schwartz, Randal L.}, - publisher = {Addison-Wesley}, - title = {Effective Perl Programming}, - isbn = {9780201419757}, - date = {1997}, - pagetotal = {288}, - subtitle = {Writing Better Programs with Perl}, -} - -@Book{Wirth1976, - author = {Wirth, N.}, - title = {Algorithms + Data Structures = Programs}, - year = {1976}, - publisher = {Prentice-Hall}, - pages = {366}, - address = {Englewood Cliffs}, - booktitle = {Algorithms + Data Structures = Programs}, -} - -@Book{Koponen2019, - author = {Koponen, Juuso and Hildén, Jonatan}, - title = {Data visualization handbook}, - isbn = {9789526074498}, - publisher = {Aalto University}, - address = {Espoo, Finland}, - timestamp = {2020-12-18}, - year = {2019}, -} - -@Article{Bates2015, - title = {Fitting Linear Mixed-Effects Models Using {lme4}}, - author = {Douglas Bates and Martin M{\"a}chler and Ben Bolker and Steve Walker}, - journal = {Journal of Statistical Software}, - year = {2015}, - volume = {67}, - number = {1}, - pages = {1--48}, - doi = {10.18637/jss.v067.i01}, - } - -@Book{Hyndman2021, - author = {Hyndman, R. and Athanasopoulos, G.}, - publisher = {OTexts}, - title = {Forecasting: principles and practice}, - edition = {3}, - creationdate = {2023-09-26T19:52:08}, - date = {2021}, - location = {Melbourne, Australia}, - modificationdate = {2023-09-26T20:01:10}, -} - -@Article{Ram2019, - author = {Karthik Ram and Carl Boettiger and Scott Chamberlain and Noam Ross and Maelle Salmon and Stefanie Butland}, - title = {A Community of Practice Around Peer Review for Long-Term Research Software Sustainability}, - number = {2}, - pages = {59--65}, - volume = {21}, - creationdate = {2023-09-28T17:27:54}, - date = {2019-03}, - doi = {10.1109/mcse.2018.2882753}, - journaltitle = {Computing in Science {\&}amp$\mathsemicolon$ Engineering}, - modificationdate = {2023-09-28T17:27:54}, - publisher = {Institute of Electrical and Electronics Engineers ({IEEE})}, -} - -@Article{Wickham2010, - author = {Hadley Wickham}, - title = {A Layered Grammar of Graphics}, - doi = {10.1198/jcgs.2009.07098}, - number = {1}, - pages = {3--28}, - volume = {19}, - groups = {S and R}, - journal = {Journal of Computational and Graphical Statistics}, - month = {jan}, - publisher = {Informa {UK} Limited}, - timestamp = {2020-11-29}, - year = {2010}, -} ->>>>>>> Stashed changes diff --git a/usingr-2.sty b/usingr-2.sty new file mode 100644 index 00000000..afdb6736 --- /dev/null +++ b/usingr-2.sty @@ -0,0 +1,195 @@ +\NeedsTeXFormat{LaTeX2e} +\ProvidesPackage{usingr}[2022/10/20] + +\RequirePackage{booktabs} + +\RequirePackage{xspace} + +\RequirePackage{xcolor} + +%\usepackage{amsmath,amssymb,amsthm} +\RequirePackage{unicode-math} + +\RequirePackage{fontspec} +\setmainfont{Lucida Bright OT} +\setsansfont{Lucida Sans OT} +\setmonofont{Lucida Console DK}[Scale=1.05] +\setmathfont{Lucida Bright Math OT} +\linespread{1.1} % increase line spacing as we use in-line math and code + +%\newcommand\tetxtilde{\char"007E} +% We set up some symbol fonts +%\newfontfamily\wingdingsfont{Wingdings} +%\newcommand\wingdings[1]{{\wingdingsfont\symbol{#1}}} +% +%\newfontfamily\wingdingsfontdos{Wingdings2} +%\newcommand\wingdingsdos[1]{{\wingdingsfontdos\symbol{#1}}} +% +%\newfontfamily\wingdingsfonttres{Wingdings3} +%\newcommand\wingdingstres[1]{{\wingdingsfonttres\symbol{#1}}} + +%\newfontfamily\meteoconsfont{Meteocons} +%\newcommand\meteocons[1]{{\meteoconsfont\symbol{#1}}} +%\newcommand\meteosun{\metecons{"0042}} +%\newcommand\meteosolidsun{\metecons{"0031}} + +\newfontfamily\typiconsfont{Typicons}[Scale=MatchUppercase] +\newcommand\typicons[1]{{\typiconsfont\symbol{#1}}} +\newcommand\typiadvn{\typicons{"E137}} +\newcommand\typiattn{\typicons{"E04E}} +\newcommand\Attention[1]{\marginpar{\centering\colorbox{orange}{\Large\textcolor{white}{\typicons{"E137}}}}\index{#1}} +\newcommand\ilAttention{\noindent\colorbox{orange}{\Large\textcolor{white}{\typicons{"E137}}}\xspace} +\newcommand\Advanced[1]{\marginpar{\centering\colorbox{brown}{\Large\textcolor{white}{\typicons{"E04E}}}}\index{#1}} +\newcommand\ilAdvanced{\noindent\colorbox{brown}{\Large\textcolor{white}{\typicons{"E04E}}}\xspace} +\newcommand\noticestd[1]{{\noticestdfont\symbol{#1}}} +\newcommand\playicon{\noindent{\Large\colorbox{violet}{\textcolor{white}{\typicons{"E098}}}}\xspace} +\newcommand\advplayicon{\noindent{\colorbox{purple}{\textcolor{white}{\Large\typicons{"E098}\typicons{"E04E}}}}\xspace} + +%\newfontfamily\noticestdfont{Notice2Std} +%\newcommand\noticestd[1]{{\noticestdfont\symbol{#1}}} +%\newcommand\playicon{\noindent{\large\colorbox{violet}{\textcolor{white}{\noticestd{"0055}}}}\xspace} +%\newcommand\advplayicon{\noindent{\colorbox{purple}{\textcolor{white}{{\large\noticestd{"0055}}{\Large\typicons{"E04E}}}}}\xspace} + +\newfontfamily\modpictsfont{ModernPictograms}[Scale=MatchUppercase] +\newcommand\modpicts[1]{{\modpictsfont\symbol{#1}}} +\newcommand\infoicon{\noindent{\Large\colorbox{blue}{\textcolor{white}{\modpicts{"003D}}}}\xspace} +\newcommand\faqicon{\noindent{\Large\colorbox{darkgray}{\textcolor{white}{\modpicts{"003F}}}}\xspace} + +\newcommand{\langname}[1]{\textsf{#1}\index{#1@\textsf{#1}}\index{languages!#1@\textsf{#1}}\xspace} +\newcommand{\langnameNI}[1]{\textsf{#1}\xspace} +\newcommand{\pgrmname}[1]{\textsf{#1}\index{#1@\textsf{#1}}\index{programmes!#1@\textsf{#1}}\xspace} +\newcommand{\pgrmnameNI}[1]{\textsf{#1}\xspace} +\newcommand{\pgrmnameTwo}[2]{\textsf{#1}\index{#2@\textsf{#1}}\index{programmes!#2@\textsf{#1}}\xspace} +\newcommand{\pkgname}[1]{`\textsf{#1}'\index{#1@\textsf{`#1'}}\index{packages!#1@\textsf{`#1'}}\xspace} +\newcommand{\pkgnameNI}[1]{`\textsf{#1}'\xspace} +\newcommand{\osname}[1]{\textsf{#1}\index{#1@\textsf{#1}}\index{operating systems!#1@\textsf{#1}}\xspace} +\newcommand{\osnameNI}[1]{\textsf{#1}\xspace} + +\newcommand{\ggplot}{\pkgname{ggplot2}} +\newcommand{\ggspectra}{\pkgname{ggspectra}} +\newcommand{\ggmap}{\pkgname{ggmap}} +\newcommand{\ggtern}{\pkgname{ggtern}} +\newcommand{\ggrepel}{\pkgname{ggrepel}} +\newcommand{\ggsignif}{\pkgname{ggsignif}} +\newcommand{\ggpmisc}{\pkgname{ggpmisc}} +\newcommand{\ggpp}{\pkgname{ggpp}} +\newcommand{\cowplot}{\pkgname{cowplot}} +\newcommand{\scales}{\pkgname{scales}} +\newcommand{\plyr}{\pkgname{plyr}} +\newcommand{\dplyr}{\pkgname{dplyr}} +\newcommand{\tydyr}{\pkgname{tidyr}} +\newcommand{\readr}{\pkgname{readr}} +\newcommand{\xts}{\pkgname{xts}} +\newcommand{\Hmisc}{\pkgname{Hmisc}} +\newcommand{\viridis}{\pkgname{viridis}} + +\newcommand{\R}{\textsf{R}} +%\newcommand{\Rpgrm}{\pgrmname{R}} +\newcommand{\Rpgrm}{\textsf{R}\xspace} % does not create index entry +\newcommand{\RStudio}{\pgrmname{RStudio}} +\newcommand{\git}{\pgrmname{git}} +\newcommand{\Quarto}{\pgrmname{Quarto}} + +\newcommand{\Rlang}{\textsf{R}\xspace} % does not create index entry +\newcommand{\Slang}{\langname{S}} +\newcommand{\Splang}{\langname{S-Plus}} +\newcommand{\Clang}{\langname{C}} +\newcommand{\Cpplang}{\langname{C++}} +\newcommand{\javalang}{\langname{Java}} +\newcommand{\pythonlang}{\langname{Python}} +\newcommand{\pascallang}{\langname{Pascal}} +\newcommand{\perllang}{\langname{Perl}} +\newcommand{\Markdown}{\langname{Markdown}} +\newcommand{\Rmarkdown}{\langname{R markdown}} +\newcommand{\Latex}{{\LaTeX\xspace}\index{Latex@{\LaTeX}}\index{languages!Latex@{\LaTeX}}} + +\newcommand{\stackoverflow}{\textsf{StackOverflow}\index{Stackoverflow@\textsf{StackOverflow}}\xspace} +\newcommand{\CRAN}{\textsf{CRAN}\index{CRAN@\textsf{CRAN}}\xspace} +\newcommand{\GitHub}{\textsf{GitHub}\index{GitHub@\textsf{GitHub}}\xspace} +\newcommand{\Bitbucket}{\textsf{Bitbucket}\index{Bitbucket@\textsf{Bitbucket}}\xspace} +\newcommand{\RForge}{\textsf{R-Forge}\index{R-Forge@\textsf{R-Forge}}\xspace} + +% index entry +\newcommand{\indexfaq}[1]{\index[faqindex]{#1}} + +% text and index entry +\newcommand{\Rclass}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{classes and modes!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Rfunction}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Rmethod}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Roperator}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{operators!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Rloop}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Rcontrol}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Rconst}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{constant and special values!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Rscoping}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{names and their scope!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\Rdata}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{data objects!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} + +% only index entry "quiet version" to create index entries for code chunks +\newcommand{\qRclass}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{classes and modes!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\qRfunction}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\qRmethod}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\qRoperator}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{operators!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\qRloop}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\qRcontrol}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\qRconst}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{constant and special values!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\qRscoping}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{names and their scope!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} + +\newcommand{\gggeom}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\ggposition}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\ggstat}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\ggtheme}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\ggscale}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} +\newcommand{\ggcoordinate}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} + +%\newcommand{\gggeom}[1]{\code{#1}\index{plots!geometries!#1@\texttt{#1}}\xspace} +%\newcommand{\ggstat}[1]{\code{#1}\index{plots!statistics!#1@\texttt{#1}}\xspace} +%\newcommand{\ggtheme}[1]{\code{#1}\index{plots!themes!#1@\texttt{#1}}\xspace} +%\newcommand{\ggscale}[1]{\code{#1}\index{plots!scales!#1@\texttt{#1}}\xspace} +%\newcommand{\ggcoordinate}[1]{\code{#1}\index{plots!coordinates!#1@\texttt{#1}}\xspace} + +% R commands embedded in main text +%\definecolor{codeshadecolor}{rgb}{0.969, 0.969, 0.900} % same fill as used by knitr +\definecolor{codeshadecolor}{rgb}{0.984, 0.984, 0.984} % same fill as now used in chunks +\newcommand{\code}[1]{\texttt{\addfontfeature{Scale = 0.89}{\setlength{\fboxsep}{0.05pt}\colorbox{codeshadecolor}{#1\vphantom{tp}}}}} + +% at the moment using `shaded' and code chunks in same box results in the last text paragraph being discarded. +% this is most likely because knitr uses the same environment and we then end with a shaded environment +% nested within another one. + +\newenvironment{leftbarc}[1][black]{% + \def\FrameCommand{\removelastskip\hspace{-6pt}{\color{#1}\vrule width 2pt} \hspace{3pt}}% + \MakeFramed {\advance\hsize-\width \FrameRestore}}% + {\removelastskip\endMakeFramed\removelastskip} + +\OuterFrameSep=\parskip + +% playgrounds and advanced playgrounds are numbered sharing a counter reset by chapter +\newlength{\iconsep} +\setlength{\iconsep}{.5em} +\newcounter{playground}[chapter] +\renewcommand{\theplayground}{\arabic{chapter}.\arabic{playground}} +\newenvironment{playground}[1]{\begin{leftbarc}[violet]\addtocounter{playground}{1}\playicon\ \textbf{\color{violet}\theplayground}\hspace{\iconsep}#1}{\end{leftbarc}} + +\newenvironment{advplayground}[1]{\begin{leftbarc}[purple]\addtocounter{playground}{1}\advplayicon\ \textbf{\color{purple}\theplayground}\hspace{\iconsep}#1}{\end{leftbarc}} + +\newenvironment{warningbox}[1]{\begin{leftbarc}[orange]\ilAttention\hspace{\iconsep}#1}{\end{leftbarc}} + +\newenvironment{explainbox}[1]{\begin{leftbarc}[brown]\ilAdvanced\hspace{\iconsep}#1}{\end{leftbarc}} + +%\newenvironment{infobox}[1]{\begin{leftbarc}[blue]\infoicon\hspace{\iconsep}#1}{\end{leftbarc}} + +\newenvironment{faqbox}[2]{\begin{leftbarc}[darkgray]\faqicon\hspace{\iconsep}\textbf{#1}\indexfaq{#1}\\#2}{\end{leftbarc}} +\newenvironment{faqboxNI}[2]{\begin{leftbarc}[darkgray]\faqicon\hspace{\iconsep}\textbf{#1}\\#2}{\end{leftbarc}} + +\newcommand{\citebooktitle}[1]{\citetitle{#1}} +%\newcommand{\citebooktitle}[1]{\emph{\citetitle{#1}}} + +\definecolor{warningcolor}{rgb}{0.5, 0.4, 0} + +% this is to reduce spacing above verbatim, which is used by knitr +% to show R's printed output +% and above and below call outs +\usepackage{etoolbox} +\makeatletter +\preto{\@verbatim}{\topsep=-4pt \partopsep=-4pt} +\preto{\alltt}{\removelastskip} +\makeatother diff --git a/usingr.sty b/usingr.sty index f63b8e35..a6eaa8cb 100644 --- a/usingr.sty +++ b/usingr.sty @@ -1,4 +1,3 @@ -<<<<<<< Updated upstream \NeedsTeXFormat{LaTeX2e} \ProvidesPackage{usingr}[2022/10/20] @@ -194,200 +193,3 @@ \preto{\@verbatim}{\topsep=-4pt \partopsep=-4pt} \preto{\alltt}{\removelastskip} \makeatother -======= -\NeedsTeXFormat{LaTeX2e} -\ProvidesPackage{usingr}[2022/10/20] - -\RequirePackage{booktabs} - -\RequirePackage{xspace} - -\RequirePackage{xcolor} - -%\usepackage{amsmath,amssymb,amsthm} -\RequirePackage{unicode-math} - -\RequirePackage{fontspec} -\setmainfont{Lucida Bright OT} -\setsansfont{Lucida Sans OT} -\setmonofont{Lucida Console DK}[Scale=1.05] -\setmathfont{Lucida Bright Math OT} -\linespread{1.1} % increase line spacing as we use in-line math and code - -%\newcommand\tetxtilde{\char"007E} -% We set up some symbol fonts -%\newfontfamily\wingdingsfont{Wingdings} -%\newcommand\wingdings[1]{{\wingdingsfont\symbol{#1}}} -% -%\newfontfamily\wingdingsfontdos{Wingdings2} -%\newcommand\wingdingsdos[1]{{\wingdingsfontdos\symbol{#1}}} -% -%\newfontfamily\wingdingsfonttres{Wingdings3} -%\newcommand\wingdingstres[1]{{\wingdingsfonttres\symbol{#1}}} - -%\newfontfamily\meteoconsfont{Meteocons} -%\newcommand\meteocons[1]{{\meteoconsfont\symbol{#1}}} -%\newcommand\meteosun{\metecons{"0042}} -%\newcommand\meteosolidsun{\metecons{"0031}} - -\newfontfamily\typiconsfont{Typicons}[Scale=MatchUppercase] -\newcommand\typicons[1]{{\typiconsfont\symbol{#1}}} -\newcommand\typiadvn{\typicons{"E137}} -\newcommand\typiattn{\typicons{"E04E}} -\newcommand\Attention[1]{\marginpar{\centering\colorbox{orange}{\Large\textcolor{white}{\typicons{"E137}}}}\index{#1}} -\newcommand\ilAttention{\noindent\colorbox{orange}{\Large\textcolor{white}{\typicons{"E137}}}\xspace} -\newcommand\Advanced[1]{\marginpar{\centering\colorbox{brown}{\Large\textcolor{white}{\typicons{"E04E}}}}\index{#1}} -\newcommand\ilAdvanced{\noindent\colorbox{brown}{\Large\textcolor{white}{\typicons{"E04E}}}\xspace} -\newcommand\noticestd[1]{{\noticestdfont\symbol{#1}}} -\newcommand\playicon{\noindent{\Large\colorbox{violet}{\textcolor{white}{\typicons{"E098}}}}\xspace} -\newcommand\advplayicon{\noindent{\colorbox{purple}{\textcolor{white}{\Large\typicons{"E098}\typicons{"E04E}}}}\xspace} - -%\newfontfamily\noticestdfont{Notice2Std} -%\newcommand\noticestd[1]{{\noticestdfont\symbol{#1}}} -%\newcommand\playicon{\noindent{\large\colorbox{violet}{\textcolor{white}{\noticestd{"0055}}}}\xspace} -%\newcommand\advplayicon{\noindent{\colorbox{purple}{\textcolor{white}{{\large\noticestd{"0055}}{\Large\typicons{"E04E}}}}}\xspace} - -\newfontfamily\modpictsfont{ModernPictograms}[Scale=MatchUppercase] -\newcommand\modpicts[1]{{\modpictsfont\symbol{#1}}} -\newcommand\infoicon{\noindent{\Large\colorbox{blue}{\textcolor{white}{\modpicts{"003D}}}}\xspace} -\newcommand\faqicon{\noindent{\Large\colorbox{darkgray}{\textcolor{white}{\modpicts{"003F}}}}\xspace} - -\newcommand{\langname}[1]{\textsf{#1}\index{#1@\textsf{#1}}\index{languages!#1@\textsf{#1}}\xspace} -\newcommand{\langnameNI}[1]{\textsf{#1}\xspace} -\newcommand{\pgrmname}[1]{\textsf{#1}\index{#1@\textsf{#1}}\index{programmes!#1@\textsf{#1}}\xspace} -\newcommand{\pgrmnameNI}[1]{\textsf{#1}\xspace} -\newcommand{\pgrmnameTwo}[2]{\textsf{#1}\index{#2@\textsf{#1}}\index{programmes!#2@\textsf{#1}}\xspace} -\newcommand{\pkgname}[1]{`\textsf{#1}'\index{#1@\textsf{`#1'}}\index{packages!#1@\textsf{`#1'}}\xspace} -\newcommand{\pkgnameNI}[1]{`\textsf{#1}'\xspace} -\newcommand{\osname}[1]{\textsf{#1}\index{#1@\textsf{#1}}\index{operating systems!#1@\textsf{#1}}\xspace} -\newcommand{\osnameNI}[1]{\textsf{#1}\xspace} - -\newcommand{\ggplot}{\pkgname{ggplot2}} -\newcommand{\ggspectra}{\pkgname{ggspectra}} -\newcommand{\ggmap}{\pkgname{ggmap}} -\newcommand{\ggtern}{\pkgname{ggtern}} -\newcommand{\ggrepel}{\pkgname{ggrepel}} -\newcommand{\ggsignif}{\pkgname{ggsignif}} -\newcommand{\ggpmisc}{\pkgname{ggpmisc}} -\newcommand{\ggpp}{\pkgname{ggpp}} -\newcommand{\cowplot}{\pkgname{cowplot}} -\newcommand{\scales}{\pkgname{scales}} -\newcommand{\plyr}{\pkgname{plyr}} -\newcommand{\dplyr}{\pkgname{dplyr}} -\newcommand{\tydyr}{\pkgname{tidyr}} -\newcommand{\readr}{\pkgname{readr}} -\newcommand{\xts}{\pkgname{xts}} -\newcommand{\Hmisc}{\pkgname{Hmisc}} -\newcommand{\viridis}{\pkgname{viridis}} - -\newcommand{\R}{\textsf{R}} -%\newcommand{\Rpgrm}{\pgrmname{R}} -\newcommand{\Rpgrm}{\textsf{R}\xspace} % does not create index entry -\newcommand{\RStudio}{\pgrmname{RStudio}} -\newcommand{\git}{\pgrmname{git}} -\newcommand{\Quarto}{\pgrmname{Quarto}} - -\newcommand{\Rlang}{\textsf{R}\xspace} % does not create index entry -\newcommand{\Slang}{\langname{S}} -\newcommand{\Splang}{\langname{S-Plus}} -\newcommand{\Clang}{\langname{C}} -\newcommand{\Cpplang}{\langname{C++}} -\newcommand{\javalang}{\langname{Java}} -\newcommand{\pythonlang}{\langname{Python}} -\newcommand{\pascallang}{\langname{Pascal}} -\newcommand{\perllang}{\langname{Perl}} -\newcommand{\Markdown}{\langname{Markdown}} -\newcommand{\Rmarkdown}{\langname{R markdown}} -\newcommand{\Latex}{{\LaTeX\xspace}\index{Latex@{\LaTeX}}\index{languages!Latex@{\LaTeX}}} - -\newcommand{\stackoverflow}{\textsf{StackOverflow}\index{Stackoverflow@\textsf{StackOverflow}}\xspace} -\newcommand{\CRAN}{\textsf{CRAN}\index{CRAN@\textsf{CRAN}}\xspace} -\newcommand{\GitHub}{\textsf{GitHub}\index{GitHub@\textsf{GitHub}}\xspace} -\newcommand{\Bitbucket}{\textsf{Bitbucket}\index{Bitbucket@\textsf{Bitbucket}}\xspace} -\newcommand{\RForge}{\textsf{R-Forge}\index{R-Forge@\textsf{R-Forge}}\xspace} - -% index entry -\newcommand{\indexfaq}[1]{\index[faqindex]{#1}} - -% text and index entry -\newcommand{\Rclass}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{classes and modes!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Rfunction}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Rmethod}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Roperator}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{operators!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Rloop}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Rcontrol}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Rconst}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{constant and special values!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Rscoping}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{names and their scope!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\Rdata}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{data objects!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} - -% only index entry "quiet version" to create index entries for code chunks -\newcommand{\qRclass}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{classes and modes!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\qRfunction}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\qRmethod}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\qRoperator}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{operators!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\qRloop}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\qRcontrol}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{control of execution!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\qRconst}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{constant and special values!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\qRscoping}[1]{\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{names and their scope!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} - -\newcommand{\gggeom}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\ggposition}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\ggstat}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\ggtheme}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\ggscale}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} -\newcommand{\ggcoordinate}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace} - -%\newcommand{\gggeom}[1]{\code{#1}\index{plots!geometries!#1@\texttt{#1}}\xspace} -%\newcommand{\ggstat}[1]{\code{#1}\index{plots!statistics!#1@\texttt{#1}}\xspace} -%\newcommand{\ggtheme}[1]{\code{#1}\index{plots!themes!#1@\texttt{#1}}\xspace} -%\newcommand{\ggscale}[1]{\code{#1}\index{plots!scales!#1@\texttt{#1}}\xspace} -%\newcommand{\ggcoordinate}[1]{\code{#1}\index{plots!coordinates!#1@\texttt{#1}}\xspace} - -% R commands embedded in main text -%\definecolor{codeshadecolor}{rgb}{0.969, 0.969, 0.900} % same fill as used by knitr -\definecolor{codeshadecolor}{rgb}{0.984, 0.984, 0.984} % same fill as now used in chunks -\newcommand{\code}[1]{\texttt{\addfontfeature{Scale = 0.89}{\setlength{\fboxsep}{0.05pt}\colorbox{codeshadecolor}{#1\vphantom{tp}}}}} - -% at the moment using `shaded' and code chunks in same box results in the last text paragraph being discarded. -% this is most likely because knitr uses the same environment and we then end with a shaded environment -% nested within another one. - -\newenvironment{leftbarc}[1][black]{% - \def\FrameCommand{\removelastskip\hspace{-6pt}{\color{#1}\vrule width 2pt} \hspace{3pt}}% - \MakeFramed {\advance\hsize-\width \FrameRestore}}% - {\removelastskip\endMakeFramed\removelastskip} - -\OuterFrameSep=\parskip - -% playgrounds and advanced playgrounds are numbered sharing a counter reset by chapter -\newlength{\iconsep} -\setlength{\iconsep}{.5em} -\newcounter{playground}[chapter] -\renewcommand{\theplayground}{\arabic{chapter}.\arabic{playground}} -\newenvironment{playground}[1]{\begin{leftbarc}[violet]\addtocounter{playground}{1}\playicon\ \textbf{\color{violet}\theplayground}\hspace{\iconsep}#1}{\end{leftbarc}} - -\newenvironment{advplayground}[1]{\begin{leftbarc}[purple]\addtocounter{playground}{1}\advplayicon\ \textbf{\color{purple}\theplayground}\hspace{\iconsep}#1}{\end{leftbarc}} - -\newenvironment{warningbox}[1]{\begin{leftbarc}[orange]\ilAttention\hspace{\iconsep}#1}{\end{leftbarc}} - -\newenvironment{explainbox}[1]{\begin{leftbarc}[brown]\ilAdvanced\hspace{\iconsep}#1}{\end{leftbarc}} - -%\newenvironment{infobox}[1]{\begin{leftbarc}[blue]\infoicon\hspace{\iconsep}#1}{\end{leftbarc}} - -\newenvironment{faqbox}[2]{\begin{leftbarc}[darkgray]\faqicon\hspace{\iconsep}\textbf{#1}\indexfaq{#1}\\#2}{\end{leftbarc}} -\newenvironment{faqboxNI}[2]{\begin{leftbarc}[darkgray]\faqicon\hspace{\iconsep}\textbf{#1}\\#2}{\end{leftbarc}} - -\newcommand{\citebooktitle}[1]{\citetitle{#1}} -%\newcommand{\citebooktitle}[1]{\emph{\citetitle{#1}}} - -\definecolor{warningcolor}{rgb}{0.5, 0.4, 0} - -% this is to reduce spacing above verbatim, which is used by knitr -% to show R's printed output -% and above and below call outs -\usepackage{etoolbox} -\makeatletter -\preto{\@verbatim}{\topsep=-4pt \partopsep=-4pt} -\preto{\alltt}{\removelastskip} -\makeatother ->>>>>>> Stashed changes