Skip to content

Commit

Permalink
Expand pipes, subset() and revise model fitting
Browse files Browse the repository at this point in the history
  • Loading branch information
aphalo committed Nov 19, 2022
1 parent 19baa25 commit b15e498
Show file tree
Hide file tree
Showing 146 changed files with 45,874 additions and 39,179 deletions.
39 changes: 31 additions & 8 deletions R.as.calculator.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -1961,24 +1961,46 @@ subset(a.df, x > 3)
What is the behavior of \code{subset()} when the condition is \code{NA}? Find the answer by writing code to test this, for a case where tests for different rows return \code{NA}, \code{TRUE} and \code{FALSE}.
\end{playground}

When calling functions that return a vector, data frame, or other structure, the square brackets can be appended to the rightmost parenthesis of the function call, in the same way as to the name of a variable holding the same data.
When calling functions that return a vector, data frame, or other structure, the extraction operators \Roperator{[ ]}, \Roperator{[[ ]]}, or \Roperator{\$} can be appended to the rightmost parenthesis of the function call, in the same way as to the name of a variable holding the same data.

<<data-frames-5>>=
subset(a.df, x > 3)[ , -3]
subset(a.df, x > 3)$x
subset(a.df, x > 3)[ , "x", drop = FALSE]
subset(a.df, x > 3)[ , "x"]
@

None of the examples in the last three code chunks alter the original data frame \code{a.df}. We can store the returned value using a new name if we want to preserve \code{a.df} unchanged, or we can assign the result to \code{a.df}, deleting in the process, the previously stored value.
\begin{advplayground}
When do extraction operators applied to data frames return a vector or factor, and when do they return a data frame?
\end{advplayground}

\begin{warningbox}
In the example above, the names in the expression passed as the second argument to \code{subset()} are first searched within \code{ad.df} but if not found, searched in the environment. There being no variable \code{A} in the data frame \code{a.df}, vector \code{A} from the environment is silently used in the expression resulting in a returned data frame with no rows.
\begin{explainbox}
In the case of \Rfunction{subset()} we can select columns directly as shown below, while for most other functions, extraction using operators \Roperator{[ ]}, \Roperator{[[ ]]} or \Roperator{\$} is needed.

<<data-frames-5a>>=
subset(a.df, x > 3, select = 2)
subset(a.df, x > 3, select = x)
subset(a.df, x > 3, select = "x")
@
\end{explainbox}

None of the examples in the last four code chunks alters the original data frame \code{a.df}. We can store the returned value using a new name if we want to preserve \code{a.df} unchanged, or we can assign the result to \code{a.df}, deleting in the process, the previously stored value.

\begin{warningbox}
In the examples above, the names in the expression passed as the second argument to \code{subset()} were searched within \code{ad.df} and found. However, if not found in the data frame objects with matching names are searched for in the environment. There being no variable \code{A} in the data frame \code{a.df}, vector \code{A} from the environment is silently used in the chunk below resulting in a returned data frame with no rows as \code{A > 3} returns \code{FALSE}.

<<data-frames-5b>>=
A <- 1
subset(a.df, A > 3)
@

The use of \Rfunction{subset()} is convenient, but more prone to result in bugs compared to directly using the extraction operator \code{[ ]}. This same ``cost'' to achieving convenience applies to functions like \Rfunction{attach()} and \Rfunction{with()} described below. The longer time that a script is expected to be used, adapted and reused, the more careful we should be when using any of these functions. An alternative way of avoiding excessive verbosity is to keep the names of data frames short.
This also applies to the expression passed as argument to parameter \code{select}, here shown as a way of selecting columns based on names stored in a character vector.

<<data-frames-5c>>=
columns <- c("x", "z")
subset(a.df, select = columns)
@

The use of \Rfunction{subset()} is convenient, but more prone to bugs compared to directly using the extraction operator \code{[ ]}. This same ``cost'' to achieving convenience applies to functions like \Rfunction{attach()} and \Rfunction{with()} described below. The longer time that a script is expected to be used, adapted and reused, the more careful we should be when using any of these functions. An alternative way of avoiding excessive verbosity is to keep the names of data frames short.
\end{warningbox}

A frequently used way of deleting a column by name from a data frame is to assign \code{NULL} to it---i.e., in the same way as members are deleted from \code{list}s. This approach modifies \code{a.df} in place.
Expand All @@ -1991,13 +2013,14 @@ colnames(aa.df)
@

\begin{explainbox}
Alternatively, we can use negative indexing to remove columns from a copy of a data frame. In this example we remove a single column. As base \Rlang does not support negative indexing by name, we need to find the numerical index of the column to delete.
Alternatively, we can use negative indexing to remove columns from a copy of a data frame. In this example we remove a single column. As base \Rlang does not support negative indexing by name with the extraction operator, we need to find the numerical index of the column to delete. (See the examples above using \code{subset()} with bare names to delete columns.)

<<data-frames-6a>>=
a.df[ , -which(colnames(a.df) == "y")]
@

Instead of using the equality test, we can use the operator \code{\%in\%} or function \code{grepl()} to delete multiple columns in a single statement.
Instead of using the equality test, we can use the operator \code{\%in\%} or function \code{grepl()} to create a \code{logical} vector useful to delete or select multiple columns in a single statement.

\end{explainbox}

\begin{playground}
Expand Down
59 changes: 4 additions & 55 deletions R.data.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -220,58 +220,7 @@ tibble(a = 1:5, b = 5:1, c = list("a", 1:2, 0:3, letters[1:3], letters[3:1]))

\section{Data pipes}\label{sec:data:pipes}
\index{chaining statements with \emph{pipes}|(}
The first obvious difference between scripts using some of the new grammars is the frequent use of \emph{pipes}. This is, however, mostly a question of preferences, as pipes can be used equally well with base \Rlang functions. Pipes have been at the core of shell scripting in \osname{Unix} since early stages of its design \autocite{Kernigham1981}. Within an OS, pipes are chains of small programs or ``tools'' that carry out a single well-defined task (e.g., \code{ed}, \code{gsub}, \code{grep}, \code{more}, etc.). Data such as text is described as flowing from a source into a sink through a series of steps at which a specific transformation takes place. In \osname{Unix}, sinks and sources are files, but files as an abstraction include all devices and connections for input or output, including physical ones as terminals and printers. The connection between steps in the pipe is usually implemented by means of temporary files.

<<pipes-x01,engine="bash",eval=FALSE>>=
stdin | grep("abc") | more
@

How can \emph{pipes} exist within a single \Rlang script? When chaining functions into a pipe, data is passed between them through temporary \Rlang objects stored in memory, which are created and destroyed automatically. Conceptually there is little difference between Unix shell pipes and pipes in \Rlang scripts, but the implementations are different.

What do pipes achieve in \Rlang scripts? They relieve the user from the responsibility of creating and deleting the temporary objects and of enforcing the sequential execution of the different steps. Pipes usually improve readability of scripts by allowing more concise code.

Currently, two main implementations of pipes are available as \Rlang extensions, in packages \pkgnameNI{magrittr} and \pkgnameNI{wrapr} and since version 4.1.0 R has pipes as part of the language.


\subsection{Base R}
\index{pipes!base R|(}
\index{pipe operator}
We describe first R's pipe syntax based on R 4.2.0.
We start with a toy example first written using separate steps and traditional \Rlang syntax

<<pipes-x02>>=
data.in <- 1:10
data.tmp <- sqrt(data.in)
data.out <- sum(data.tmp)
rm(data.tmp) # clean up!
@

next using nested function calls still using traditional \Rlang syntax

<<pipes-x03>>=
data.out <- sum(sqrt(data.in))
@

written as a pipe using \Roperator{|>}, the chaining operator from current R.

<<pipes-x03a>>=
data.in |> sqrt() |> sum() -> data.out
@

\begin{explainbox}
The \Roperator{|>} operator from base \Rlang takes two operands. The value returned by the \emph{lhs} (left-hand side) operand, which can be any \Rlang expression, is passed by default as the first argument to the \emph{rhs} operand, which must be a function accepting at least one argument. Consequently, in using this simple syntax, the function in the \emph{rhs} must have a suitable signature for the pipe to work. However, it is possible to pass piped arguments to a function by name to any parameter, including the first one, using an underscore (\code{\_}) as placeholder.

Some base \Rlang functions like \code{subset()} have a signature that is natural for use in pipes by implicitly passing the piped value as argument to its first formal parameter. Other functions like \code{assign()} in many uses we would like to pass the piped value as argument to parameters other than the first. In such cases we can use \code{\_} as a placeholder and pass it by name. Alternatively, we can define a wrapper function, with the desired order for the formal parameters.

<<pipes-box-wrapper>>=
value_assign <- function(value, x, ...) {
assign(x = x, value = value, ...)
}
@

\end{explainbox}

\index{pipes!base R|)}
The first obvious difference between scripts using some of the new grammars is the frequent use of \emph{pipes}. This is, however, mostly a question of preferences, as pipes can be as well used with base \Rlang functions. Since version 4.0.0, \Rlang includes the pipe operator \Roperator{|>}, described in section \ref{sec:script:pipes} on page \pageref{sec:script:pipes}. Here we describe other earlier implementations of pipes, and the differences among these and \Rlang's pipe operator.

\subsection{\pkgname{magrittr}}
\index{pipes!tidyverse|(}
Expand Down Expand Up @@ -325,7 +274,7 @@ data.in %>% sqrt %>% sum -> data.out
\end{explainbox}

\begin{warningbox}
In some situations the semantics of the operator \Roperator{\%>\%} from package \pkgname{magrittr} can behave unexpectedly. One example is attempting to use \Rfunction{assign()} in a pipe.
In some situations the semantics of the operator \Roperator{\%>\%} from package \pkgname{magrittr} can behave unexpectedly. One example is attempting to use \Rfunction{assign()} in a pipe. This, as seen before works as expected with \Rlang's operator \Roperator{|>}.

<<pipes-x06>>=
data.in |> assign(x = "data4.out", value = _)
Expand Down Expand Up @@ -378,13 +327,13 @@ In conclusion, \Rlang syntax for expressions is preserved when using the dot-pip
data.in %.>% (.^2 + sqrt(. + 1))
@

Under-the-hood, the implementations of operators \Roperator{|>} and \Roperator{\%>\%} and \Roperator{\%.>\%} are different, with \Roperator{|>} expected to have the best performance, followed by \Roperator{\%.>\%} and \Roperator{\%>\%} being slowest. As implementations evolve performance depends on versions. However, \Roperator{|>} being part of \Rlang is likely to remain the fastest.
Under-the-hood, the implementations of operators \Roperator{|>} and \Roperator{\%>\%} and \Roperator{\%.>\%} are different, with \Roperator{|>} expected to have the best performance, followed by \Roperator{\%.>\%} and \Roperator{\%>\%} being slowest. As implementations evolve, performance may vary among versions. However, \Roperator{|>} being part of \Rlang is likely to remain the fastest.

Being part of the \Rlang language, \Roperator{|>} will remain available and backwards compatible, while packages could be abandoned or redesigned by their maintainers. For this reason, it is preferable to use the \Roperator{|>} in scripts or code expected to be reused, if not requiring compatibility with \Rlang versions earlier than 4.2.0.

In the rest of the book when possible we will use \emph{R's pipes} and otherwise \emph{dot pipes} and avoid implicit (''invisible'') passing of arguments in examples to ensure easier understanding. In most cases the examples can be easily rewritten using operator \Roperator{\%>\%}.

Pipes can make scripts visually more compact than the use of assignments of intermediate results to temporary variables. What makes pipes most convenient is the availability of classes, functions, and methods defined in \pkgnameNI{tidyr}, \pkgnameNI{dplyr}, and other packages from the \pkgname{tidyverse}. Debugging pipes usually requires dividing them, with one approach being the insertion of calls to \Rfunction{print()}. This is possible, because \Rfunction{print()} returns its input invisibly in addition to displaying it.
What makes pipes most convenient is the availability of classes, functions, and methods defined in \pkgnameNI{tidyr}, \pkgnameNI{dplyr}, and other packages from the \pkgname{tidyverse}. Debugging pipes usually requires dividing them, with one approach being the insertion of calls to \Rfunction{print()}. This is possible, because \Rfunction{print()} returns its input invisibly in addition to displaying it.

<<pipes-x08>>=
data.in |> print() |> sqrt() |> print() |> sum() |> print() -> data10.out
Expand Down
Loading

0 comments on commit b15e498

Please sign in to comment.