Skip to content

Commit

Permalink
Revise Introduction and Preface. Complete the tagging of R functions,…
Browse files Browse the repository at this point in the history
… and operators and start the tagging of R classes and modes.
  • Loading branch information
aphalo committed Apr 10, 2017
1 parent 85b0241 commit 1bf3003
Show file tree
Hide file tree
Showing 34 changed files with 283,171 additions and 274,365 deletions.
64 changes: 32 additions & 32 deletions R.as.calculator.Rnw

Large diffs are not rendered by default.

18 changes: 9 additions & 9 deletions R.data.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -554,7 +554,7 @@ thiamin.df <- read.spss(file = "data/thiamin.sav", to.data.frame = TRUE)
head(thiamin.df)
@

Another example, for a Systat file saved on an PC more than 20 years ago.
Another example, for a Systat file saved on an PC more than 20 years ago, and rea.

<<foreign-03>>=
my_systat.df <- read.systat(file = "data/BIRCH1.SYS")
Expand All @@ -565,7 +565,7 @@ The functions in \pkgname{foreign} can return data frames, but not always this i

\subsubsection[haven]{\pkgname{haven}}

The recently released package \pkgname{haven} is less ambitious in scope, providing read and write functions for only three file formats: \pgrmname{SAS}, \pgrmname{Stata} and \pgrmname{SPSS}. On the other hand \pkgname{haven} provides flexible ways to convert the different labelled values that cannot be directly mapped to normal R modes. They also decode dates and times according to the idiosyncrasies of each of these file formats. The returned \code{tibble} objects in cases when the imported file contained labelled values needs some further work from the user before obtaining `normal' data-frame-compatible \code{tibble} objects.
The recently released package \pkgname{haven} is less ambitious in scope, providing read and write functions for only three file formats: \pgrmname{SAS}, \pgrmname{Stata} and \pgrmname{SPSS}. On the other hand \pkgname{haven} provides flexible ways to convert the different labelled values that cannot be directly mapped to normal R modes. They also decode dates and times according to the idiosyncrasies of each of these file formats. The returned \Rclass{tibble} objects in cases when the imported file contained labelled values needs some further work from the user before obtaining `normal' data-frame-compatible \Rclass{tibble} objects.

We here use function \Rfunction{read\_sav()} to import here a \code{.sav} file saved by a recent version of \pgrmname{SPSS}.

Expand Down Expand Up @@ -633,7 +633,7 @@ head(latitude)

The \code{time} vector is rather odd, as it contains only month data as these are long-term averages. From the metadata we can infer that they correspond to the months of the year, and we directly generate these, instead of attempting a conversion.

We construct a \code{tibble} object with PET values for one grid point, we can take advantage of \emph{recycling} or short vectors.
We construct a \Rclass{tibble} object with PET values for one grid point, we can take advantage of \emph{recycling} or short vectors.

<<ncdf4-03>>=
pet.tb <-
Expand All @@ -645,7 +645,7 @@ pet.tb <-
pet.tb
@

If we want to read in several grid points, we can use several different approaches. In this example we take all latitudes along one longitude. Here we avoid using loops altogether when creating a \emph{tidy} \code{tibble} object. However, because of how the data is stored, we needed to transpose the intermediate array before conversion into a vector.
If we want to read in several grid points, we can use several different approaches. In this example we take all latitudes along one longitude. Here we avoid using loops altogether when creating a \emph{tidy} \Rclass{tibble} object. However, because of how the data is stored, we needed to transpose the intermediate array before conversion into a vector.

<<ncdf4-04>>=
pet2.tb <-
Expand Down Expand Up @@ -692,7 +692,7 @@ latitude <- var.get.nc(meteo_data.nc, "lat")
head(latitude)
@

We construct a \code{tibble} object with values for midday UV Index for 26 days. For convenience, we convert the strings into R's datetime objects.
We construct a \Rclass{tibble} object with values for midday UV Index for 26 days. For convenience, we convert the strings into R's datetime objects.

<<netcdf-03>>=
uvi.tb <-
Expand Down Expand Up @@ -1005,10 +1005,10 @@ Hadley Wickham, together with collaborators, has developed a set of R tools for

\subsection{Better data frames}

Package \pkgname{tibble} defines an improved class \code{tibble} that can be used in place of data frames. Changes are several, including differences in default behaviour of both constructors and methods. Objects of class \code{tibble} can non-the-less be used as arguments for most functions that expect data frames as input.
Package \pkgname{tibble} defines an improved class \Rclass{tibble} that can be used in place of data frames. Changes are several, including differences in default behaviour of both constructors and methods. Objects of class \Rclass{tibble} can non-the-less be used as arguments for most functions that expect data frames as input.

\begin{infobox}
In their first incarnation, the name for \code{tibble} was \code{data\_frame} (with a dash instead of a dot). The old name is still recognized, but it is better to only use \Rfunction{tibble()} to avoid confusion. One should be aware that although the constructor \Rfunction{tibble()} and conversion function \Rfunction{as.tibble()}, as well as the test \Rfunction{is.tibble()} use the name \code{tibble}, the class attribute is named \code{tbl}.
In their first incarnation, the name for \Rclass{tibble} was \code{data\_frame} (with a dash instead of a dot). The old name is still recognized, but it is better to only use \Rfunction{tibble()} to avoid confusion. One should be aware that although the constructor \Rfunction{tibble()} and conversion function \Rfunction{as.tibble()}, as well as the test \Rfunction{is.tibble()} use the name \Rclass{tibble}, the class attribute is named \code{tbl}.

<<tibble-info-01>>=
my.tb <- tibble(numbers = 1:3)
Expand Down Expand Up @@ -1042,7 +1042,7 @@ is.tibble(my.df)
show_classes(my.df)
@

Tibbles are data frames---or more formally class \code{tibble} is derived from class \code{data.frame}. However, data frames are not tibbles.
Tibbles are data frames---or more formally class \Rclass{tibble} is derived from class \code{data.frame}. However, data frames are not tibbles.

<<tibble-03>>=
my.tb <- tibble(codes = c("A", "B", "C"), numbers = 1:3, integers = 1L:3L)
Expand Down Expand Up @@ -1211,7 +1211,7 @@ Function \Rfunction{rename()} to rename columns---requires the use of \Rfunction
rename(long_iris, dim = dimension)
@

The first advantage a user sees of these functions is the completeness of the set of operations supported and the symmetry and consistency among the different functions. A second advantage is that almost all the functions are defined not only for objects of class \code{tibble}, but also for objects of class \code{data.table} and for accessing SQL based databases with the same syntax. The functions are also optimized for fast performance.
The first advantage a user sees of these functions is the completeness of the set of operations supported and the symmetry and consistency among the different functions. A second advantage is that almost all the functions are defined not only for objects of class \Rclass{tibble}, but also for objects of class \code{data.table} and for accessing SQL based databases with the same syntax. The functions are also optimized for fast performance.

\subsection{Group-wise manipulations}

Expand Down
2 changes: 1 addition & 1 deletion R.friends.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ class(a)
str(a)
@

Then we use base R's function \code{lapply()} to apply a user-defined R function to the elements of the Java array, obtaining as returned value an R array.
Then we use base R's function \Rfunction{lapply()} to apply a user-defined R function to the elements of the Java array, obtaining as returned value an R array.

<<rjava-03>>=
b <- sapply(a,
Expand Down
20 changes: 10 additions & 10 deletions R.functions.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The aim of this chapter is to introduce some of the frequently used function ava

\section{Loading data}
\index{data!loading data sets}
To start with, we need some data to run the examples. Here we use \code{cars}, a data set included in base R. How to read or import ``foreign'' data is discussed in R's documentation in \emph{R Data Import/Export}, and in this book, in Chapter \ref{chap:R:data} starting on page \pageref{chap:R:data}. In general \code{data()} is used load R objects saved in a file format used by R. Text files con be read with functions \code{scan()}, \code{read.table()}, \code{read.csv()} and their variants. It is also possible to `import' data saved in files of \textit{foreign} formats, defined by other programs. Packages such as 'foreign', 'readr', 'readxl', 'RNetCDF', 'jsonlite', etc.\ allow importing data from other statistic and data analysis applications and from standard data exchange formats. It is also good to keep in mind that in R urls are accepted as arguments to the \code{file} argument (see Chapter \ref{chap:R:data} starting on page \pageref{chap:R:data} for details and examples on how to import data from different ``foreign'' formats and sources).
To start with, we need some data to run the examples. Here we use \code{cars}, a data set included in base R. How to read or import ``foreign'' data is discussed in R's documentation in \emph{R Data Import/Export}, and in this book, in Chapter \ref{chap:R:data} starting on page \pageref{chap:R:data}. In general \Rfunction{data()} is used load R objects saved in a file format used by R. Text files con be read with functions \Rfunction{scan()}, \Rfunction{read.table()}, \Rfunction{read.csv()} and their variants. It is also possible to `import' data saved in files of \textit{foreign} formats, defined by other programs. Packages such as 'foreign', 'readr', 'readxl', 'RNetCDF', 'jsonlite', etc.\ allow importing data from other statistic and data analysis applications and from standard data exchange formats. It is also good to keep in mind that in R urls are accepted as arguments to the \code{file} argument (see Chapter \ref{chap:R:data} starting on page \pageref{chap:R:data} for details and examples on how to import data from different ``foreign'' formats and sources).

In the examples of the present chapter we use data included in R, as R objects, which can be loaded with function \code{data}. \code{cars} is a data frame.

Expand All @@ -25,7 +25,7 @@ data(cars)

\section{Looking at data}
\index{data!exploring at the console}
There are several functions in \langname{R} that let us obtain different `views' into objects. Function \code{print()} is useful for small data sets, or objects. Especially in the case of large data frames, we need to explore them step by step. In the case of named components, we can obtain their names, with \code{names()}. If a data frame contains many rows of observations, \code{head()} and \code{tail()} allow us to easily restrict the number of rows printed. Functions \code{nrow()} and \code{ncol()} return the number of rows and columns in the data frame (but are not applicable to lists). As earlier mentioned, \code{str()}, outputs is abbreviated but in a way that preserves the structure of the object.
There are several functions in \langname{R} that let us obtain different `views' into objects. Function \Rfunction{print()} is useful for small data sets, or objects. Especially in the case of large data frames, we need to explore them step by step. In the case of named components, we can obtain their names, with \Rfunction{names()}. If a data frame contains many rows of observations, \Rfunction{head()} and \Rfunction{tail()} allow us to easily restrict the number of rows printed. Functions \Rfunction{nrow()} and \Rfunction{ncol()} return the number of rows and columns in the data frame (but are not applicable to lists). As earlier mentioned, \Rfunction{str()}, outputs is abbreviated but in a way that preserves the structure of the object.
<<exploring-dfs-1>>=
class(cars)
nrow(cars)
Expand All @@ -37,7 +37,7 @@ str(cars)
@

\begin{playground}
Look up the help pages for \code{head()} and \code{tail()}, and edit the code above to print only the first line, or only the last line of \code{cars}, respectively. As a second exercise print the 25 topmost rows of \code{cars}.
Look up the help pages for \Rfunction{head()} and \Rfunction{tail()}, and edit the code above to print only the first line, or only the last line of \code{cars}, respectively. As a second exercise print the 25 topmost rows of \code{cars}.
\end{playground}

Data frames consist in columns of equal length (see Chapter \ref{chap:R:as:calc}, section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames} for details). The different columns of a data frame can contain data of different modes (e.g.\ numeric, factor and/or character).
Expand All @@ -54,22 +54,22 @@ The statement above returns a vector of character strings, with the mode of each
Data set \code{airquality} contains data from air quality measurements in New York, and, being included in the \Rpgrm distribution, can be loaded with \code{data(airquality)}. Load it, and repeat the steps above, to learn what variables are included, their modes, the number of rows, etc.
\end{playground}

There is in \Rpgrm a function called \code{summary()}, which can be used to obtain a suitable summary from objects of most classes. We can also use \code{sapply()} or \code{lapply()} to apply any suitable function to individual columns.
There is in \Rpgrm a function called \Rfunction{summary()}, which can be used to obtain a suitable summary from objects of most classes. We can also use \Rfunction{sapply()} or \Rfunction{lapply()} to apply any suitable function to individual columns. See section \ref{sec:data:apply} on page \pageref{sec:data:apply} for details about R's \emph{apply} functions.
<<exploring-dfs-3>>=
summary(cars)
sapply(cars, range)
@

\begin{playground}
Obtain the summary of \code{airquality} with function \code{summary}, but in addition, write code with an \emph{apply} function to count the number of non-missing values in each column.
Obtain the summary of \code{airquality} with function \Rfunction{summary()}, but in addition, write code with an \emph{apply} function to count the number of non-missing values in each column.
\end{playground}

\section{Plotting}
\index{plots!base R graphics}
The base \langname{R}'s generic function \code{plot()} can be used to plot different data. It is a generic function that has suitable methods for different kinds of objects (see section \ref{sec:script:objects:classes:methods} on page \pageref{sec:script:objects:classes:methods} for a brief introduction to objects, classes and methods). In this section we only very briefly demonstrate the use of the most common base \langname{R}'s graphics functions. They are well described in the book \citetitle{Murrell2011} \autocite{Murrell2011}. We will not describe either the Trellis and Lattice approach to plotting \autocite{Sarkar2008}. We describe in detail the use of the grammar of graphics and plotting with package \ggplot in Chapter \ref{chap:R:plotting} from page \pageref{chap:R:plotting} onwards.

<<plot-2>>=
plot(dist ~ speed, data=cars)
plot(dist ~ speed, data = cars)
@

\section{Fitting linear models}
Expand All @@ -80,7 +80,7 @@ One important thing to remember is that model `formulas' are used in different c

\subsection{Regression}
\index{linear regression}
The R function \code{lm} is used next to fit linear models. If the explanatory variable is continuous, the fit is a regression. In the example below, \code{speed} is a numeric variable (floating point in this case). In the ANOVA table calculated for the model fit, in this case a linear regression, we can see that the term for \code{speed} has only one degree of freedom (df) for the denominator.
The R function \Rfunction{lm()} is used next to fit linear models. If the explanatory variable is continuous, the fit is a regression. In the example below, \code{speed} is a numeric variable (floating point in this case). In the ANOVA table calculated for the model fit, in this case a linear regression, we can see that the term for \code{speed} has only one degree of freedom (df) for the denominator.

We first fit the model and save the output as \code{fm1} (A name I invented to remind myself that this is the first fitted-model in this chapter.%
\label{xmpl:fun:lm:fm1}
Expand All @@ -95,7 +95,7 @@ The next step is diagnosis of the fit. Are assumptions of the linear model proce
plot(fm1, which = 2)
@

In the case of a regression, calling \code{summary()} with the fitted model object as argument is most useful as it provides a table of coefficient estimates and their errors. \code{anova()} applied to the same fitted object, returns the ANOVA table.
In the case of a regression, calling \Rfunction{summary()} with the fitted model object as argument is most useful as it provides a table of coefficient estimates and their errors. \Rfunction{anova()} applied to the same fitted object, returns the ANOVA table.

<<models-1b>>=
summary(fm1) # we inspect the results from the fit
Expand All @@ -114,7 +114,7 @@ anova(fm2)
We now we fit a second degree polynomial.

<<models-3>>=
fm3 <- lm(dist ~ speed + I(speed^2), data=cars) # we fit a model, and then save the result
fm3 <- lm(dist ~ speed + I(speed^2), data = cars) # we fit a model, and then save the result
plot(fm3, which = 3) # we produce diagnosis plots
summary(fm3) # we inspect the results from the fit
anova(fm3) # we calculate an ANOVA
Expand Down Expand Up @@ -184,7 +184,7 @@ When a linear model includes both explanatory factors and continuous explanatory
\index{generalized linear models}
\index{GLM|see{generalized linear models}}

Linear models make the assumption of normally distributed residuals. Generalized linear models are more flexible, and allow the assumed distribution to be selected as well as the link function.
Linear models make the assumption of normally distributed residuals. Generalized linear models, fitted with function \Rfunction{glm()} are more flexible, and allow the assumed distribution to be selected as well as the link function.
For the analysis of the \code{InsectSpray} data set, above (section \ref{sec:anova} on page \pageref{sec:anova}) the Normal distribution is not a good approximation as count data deviates from it. This was visible in the quantile--quantile plot above.

For count data GLMs provide a better alternative. In the example below we fit the same model as above, but we assume a quasi-Poisson distribution instead of the Normal.
Expand Down
Loading

0 comments on commit 1bf3003

Please sign in to comment.