-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Last tweaks before upload of a new revision to Leanpub.
- Loading branch information
Showing
7 changed files
with
1,130 additions
and
867 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
% !Rnw root = appendix.main.Rnw | ||
|
||
<<echo=FALSE, include=FALSE>>= | ||
opts_chunk$set(opts_fig_wide) | ||
opts_knit$set(concordance=TRUE) | ||
@ | ||
|
||
\chapter{If and when R needs help}\label{chap:R:performance} | ||
|
||
\dictum[Patrick J. Burns (1998) S Poetry. \url{http://www.burns-stat.com/documents/books/s-poetry/}]{Improving the efficiency of your S functions can be well worth some effort.\ \ldots But remember that large efficiency gains can be made by using a better algorithm, not just by coding the same algorithm better.} | ||
|
||
\section{Aims of this chapter} | ||
|
||
In this final chapter I highlight what in my opinion are limitations and advantages of using \langname{R} as a scripting language in data analysis, briefly describing alternative approaches that can help overcome performance bottle necks in R code. | ||
|
||
\section{Packages used in this chapter} | ||
|
||
<<eval=FALSE>>= | ||
install.packages(learnrbook::pkgs_ch_performance) | ||
@ | ||
|
||
For executing the examples listed in this chapter you need first to load the following packages from the library: | ||
|
||
<<message=FALSE>>= | ||
library(Rcpp) | ||
library(inline) | ||
# library(rPython) | ||
library(rJava) | ||
@ | ||
|
||
\section{R's limitations and strengths} | ||
|
||
\subsection{Optimizing R code} | ||
|
||
Some\index{performance}\index{code!optimization} constructs like \code{for} and \code{while} loops execute slowly in \langname{R}, as they are interpreted. Byte compiling and Just-In-Time (JIT) compiling of loops (enabled by default in R >= 3.4.0) should decrease this burden in the future. However, base R as well some packages define several \emph{apply} functions. Being compiled functions, written in \langname{C} or \langname{C++}, using apply functions instead of explicit loops can provide a major improvement in performance while keeping user's code fully written in R. Pre-allocating memory, rather than growing a vector or array at each iteration can help. One little known problem is related to consistency tests when `growing' data frames. If we add one by one variables to a large data frame the overhead is in many cases huge. This can be easily avoided in many cases by assembling the object as a list, and once assembled converting it into a data frame. | ||
|
||
You may ask, how can I know, where in the code is the performance bottleneck\index{code!performance}. During the early years of R, this was quite a difficult task. Nowadays, we there are good code profiling\index{code!profiling} and code benchmarking\index{code!benchmarking} tools, which are in the most recent version, integrated into the \pgrmname{RStudio} IDE. Profiling consists in measuring how much of the total runtime of a test is spent in different functions, or even lines of code. Benchmarking consists in timing the execution of alternative versions of some piece of code, to decide which one should preferred. | ||
|
||
There are some rules of style\index{code!writing style}, and common sense, that should be always applied, to develop good quality program code. However, as in most cases, high performance comes at the cost of a more complex program or algorithm, optimizations should be applied only at the parts of the code that are limiting overall performance. Usually even when the requirement of high performance is known in advance, it is in most cases to start with a simple implementation of a simple algorithm. Get this first solution working reliably, and use this as a reference both for performance and accuracy of returned results while attempting optimization. | ||
|
||
The book \citetitle{Matloff2011} \autocite{Matloff2011} is very good at presenting the use of R language and how to profit from its peculiar features to write concise and efficient code. | ||
Studying the book \citetitle{Wickham2014advanced} \autocite{Wickham2014advanced} will give you a deep understanding of the R language, its limitations and good and bad approaches to its use. If you aim at writing R packages, then \citetitle{Wickham2015} \autocite{Wickham2015} will guide you on how to write your own packages, using modern tools. Finally, any piece of software, benefits from thorough and consistent testing, and R packages and scripts are no exception. Building a set of test cases simplifies enormously code maintenance, as they help detect unintended changes in program behaviour \autocite{Wickham2015,Cotton2016}. | ||
|
||
\subsection{Using the best tool for each job} | ||
|
||
In many cases optimizing \langname{R} code for performance can yield more than an order of magnitude decrease in runtime. In many cases this is enough, and the most cost-effective solution. There are both packages and functions in base R, that if properly used can make a huge difference in performance. In addition, efforts in recent years to optimize the overall performance of R itself have been successful. Some of the packages with enhanced performance have been described in earlier chapters, as they are easy enough to use and have also an easy to learn user interface. Other packages like \pkgname{data.table} although achieving very fast execution, incur the cost of using a user interface and having a behaviour alien to the ``normal way of working'' with R. | ||
|
||
Sometimes, the best available tools for a certain job have not been implemented in R but are available in other languages. Alternatively, the algorithms or the size of the data are such that performance is poor when implemented in the R language, and can be better using a compiled language. | ||
|
||
\subsection{R is great, but not always best} | ||
|
||
One extremely important feature leading to the success of \langname{R} is extensibility\index{R!extensibility}. Not only by writing packages in R itself, but by allowing the development of packages containing functions written in other computer languages. The beauty of the package loading mechanism, is that even if \pgrmname{R} itself is written in \langname{C}, and compiled into an executable, packages containing interpreted \langname{R} code, and also compiled \langname{C}, \langname{C++}, \langname{FORTRAN}, or other languages, or calling libraries written in \langname{Java}, \langname{Python}, etc. can be loaded and unloaded at runtime. | ||
|
||
Most common reasons for using compiled code, are the availability of libraries written in \langname{FORTRAN}, \langname{C} and \langname{C++} that are well tested and optimized for performance. This is frequently the case for numerical calculations and time-consuming data manipulations like image analysis. In such cases the R code in packages is just a wrapper (or ``glue'') to allow the functions in the library to be called from R. | ||
|
||
In other cases we diagnose a performance bottleneck, decide to write a few functions within a package otherwise written in R, in a compiled language like \langname{C++}. In such cases is a good idea to use benchmarking, as the use of a language does not necessarily provide a worthwhile performance enhancement. Different languages do not always store data in memory in the same format, this can add overhead to function calls across languages. | ||
|
||
\section{Rcpp} | ||
|
||
<<>>= | ||
citation(package = "Rcpp") | ||
@ | ||
|
||
Nowadays, thanks to package \pkgname{Rcpp}, using \langname{C++} mixed with R language, is fairly simple \autocite{Eddelbuettel2013}. This package does not only provide R code, but a \langname{C++} header file with macro definitions that reduces the writing of the necessary ``glue'' code to the use of a simple macro in the \langname{C++} code. Although, this mechanism is most frequently used as a component packages, it is also possible to define a function written in \langname{C++} at the R console, or in a simple user's script. Of course for these to work all the tools needed to build R packages from source are needed, including a suitable compiler and linker. | ||
|
||
An example taken from the \pkgname{Rcpp} documentation follows. This is an example of how one would define a function during an interactive session at the R console, or in a simple script. When writing a package, one would write a separate source file for the function, include the \code{rcpp.h} header and use the \langname{C++} macros to build the R code side. Using \langname{C++} inline requires package \pkgname{inline} to be loaded in addition to \pkgname{Rcpp}. | ||
|
||
First we save the source code for the function written in \langname{C++}, taking advantage of types and templates defined in the \code{Rccp.h} header file. | ||
|
||
<<rcpp-01>>= | ||
src <- ' | ||
Rcpp::NumericVector xa(a); | ||
Rcpp::NumericVector xb(b); | ||
int n_xa = xa.size(), n_xb = xb.size(); | ||
|
||
Rcpp::NumericVector xab(n_xa + n_xb - 1); | ||
for (int i = 0; i < n_xa; i++) | ||
for (int j = 0; j < n_xb; j++) | ||
xab[i + j] += xa[i] * xb[j]; | ||
return xab; | ||
' | ||
@ | ||
|
||
The second step is to compile and load the function, in a way that it can be called from R code and indistinguishable from a function defined in R itself. | ||
|
||
<<rcpp-02>>= | ||
fun <- cxxfunction(signature(a = "numeric", b = "numeric"), src, plugin = "Rcpp") | ||
@ | ||
|
||
We can now use as any other R function. | ||
|
||
<<rcpp-03>>= | ||
fun(1:3, 1:4) | ||
@ | ||
|
||
As we will see below, this is not the case in the case of calling Java and Python, cases where although the integration is relatively tight, special syntax is used when calling the ``foreign'' functions. The advantage of Rcpp in this respect is very significant, as we can define functions that have exactly the same argument signature, use the same syntax and behave in the same way, using either the R or \langname{C++} language. This means that at any point during development of a package a function defined in R can be replaced by an equivalent function defined in \langname{C++}, or vice versa, with absolutely no impact on user's code, except possibly by faster execution of the \langname{C++} version. | ||
|
||
\section{FORTRAN and C} | ||
|
||
In the case of \langname{FORTRAN} and \langname{C}, the process is less automated in the R code needed to call the compiled functions needs to be explicitly written (See \emph{Writing R Extensions} in the R documentation, for up-to-date details). Once written, the building and installation of the package is automatic. This is the way how many existing libraries are called from within R and R packages. | ||
|
||
\section{Python} | ||
|
||
Package \pkgname{rPython} allows calling \langname{Python} functions and methods from R code. Currently this package is not available under MS-Windows. | ||
|
||
Example taken from the package description (not run). | ||
|
||
<<rpython-01,eval=FALSE>>= | ||
python.call( "len", 1:3 ) | ||
a <- 1:4 | ||
b <- 5:8 | ||
python.exec( "def concat(a,b): return a+b" ) | ||
python.call( "concat", a, b) | ||
@ | ||
|
||
It is also possible to call R functions from \langname{Python}. However, this is outside the scope of this book. | ||
|
||
\section{Java} | ||
|
||
Although \langname{Java} compilers exist, most frequently Java programs are compiled into intermediate byte code and this is interpreted, and usually the interpreter includes a JIT compiler. For calling \langname{Java} functions or accessing Java objects from R code, the solution is to use package \pkgname{rJava}. One important point to remember is that the Java Development Environment must be installed for this package to work. The usually installed runtime is not enough. | ||
|
||
We need first to start the Java Virtual Machine (the byte-code interpreter). | ||
|
||
<<rjava-01>>= | ||
.jinit() | ||
@ | ||
|
||
The code that follows is not that clear, and merits some explanation. | ||
|
||
We first create a \langname{Java} array from inside R. | ||
|
||
<<rjava-02>>= | ||
a <- .jarray( list( | ||
.jnew( "java/awt/Point", 10L, 10L ), | ||
.jnew( "java/awt/Point", 30L, 30L ) | ||
) ) | ||
print(a) | ||
mode(a) | ||
class(a) | ||
str(a) | ||
@ | ||
|
||
Then we use base R's function \Rfunction{lapply()} to apply a user-defined R function to the elements of the Java array, obtaining as returned value an R array. | ||
|
||
<<rjava-03>>= | ||
b <- sapply(a, | ||
function(point){ | ||
with(point, { | ||
(x + y )^2 | ||
} ) | ||
}) | ||
print(b) | ||
mode(b) | ||
class(b) | ||
str(b) | ||
@ | ||
|
||
Although more cumbersome than in the case of \pkgname{Rcpp} one can manually write wrapper code to hide the special syntax and object types from users. | ||
|
||
It is also possible to call R functions from within a \langname{Java} program. This is outside the scope of this book. | ||
|
||
\section{sh, bash} | ||
|
||
The\index{command shell}\index{sh}\index{bash} operating system shell can be accessed from within R and the output from programs and shell scripts returned to the R session. This is useful, for example for pre-processing raw data files with tools like \langname{AWK} or \langname{Perl} scripts. The problem with this approach is that when it is used, the R script cannot run portably across operating systems, or in the absence of the tools or sh or bash scripts. Except for code that will never be reused (i.e.\ it is used once and discarded) it is preferable to use R's built-in commands whenever possible, or if shell scripts are used, to make the shell script the master script from within which the R scripts are called, rather than the other way around. The reason for this is mainly making clear the developer's intention: that the code as a whole will be run in a given operating system using a certain set of tools, rather hiding shell calls inside the R script. In other words, keep the least portable bits in full view. | ||
|
||
\section{Web pages, and interactive interfaces} | ||
|
||
There is a lot to write on this aspect, and intense development efforts going on. One example is the \pkgname{Shiny} package and Shiny server \url{https://shiny.rstudio.com/}. This package allows the creation of interactive displays to be viewed through any web browser. | ||
|
||
There are other packages for generating both static and interactive graphics in formats suitable for on-line display, as well as package \pkgname{knitr} used for writing this book \url{https://yihui.name/knitr/}, which when using R Markdown for markup (with package \pkgname{rmarkdown} \url{http://rmarkdown.rstudio.com} or \pkgname{Bookdown} \url{https://bookdown.org/} can output self-contained HTML files in addition to RTF and PDF formats. |
Oops, something went wrong.