Last tweaks before upload of a new revision to Leanpub.

aphalo · May 14, 2017 · 980be8d · 980be8d
1 parent ed4825c
commit 980be8d
Show file tree

Hide file tree

Showing 7 changed files with 1,130 additions and 867 deletions.
diff --git a/R.intro.Rnw b/R.intro.Rnw
@@ -185,6 +185,6 @@ Currently, for the development of packages, I use \pgrmname{RStudio} exclusively
 
 When I started using R, nearly two decades ago, I was using other editors, using the operating system shell a lot more, and struggling with debugging as no IDE was available. The only reasonably good integration with an editor was for Emacs, which was widely available only under Unix-like systems. Given this past experience, I encourage you to use an IDE for R. \pgrmname{RStudio} is nowadays very popular, but if you do not like it, need a different set of features, such as integration with \pgrmname{ImageJ}, or are already familiar with the \pgrmname{Eclipse} IDE, you may like try the \pgrmname{Bio7} IDE, available from \url{http://bio7.org}.
 
-All data sets and files needed to run the examples in the book can be obtained by installing different R packages. One of them \pkgname{learnrbook} available through CRAN, contains datasets and files not earlier available in R packages. The \pkgname{learnrbook} package also contains installation instructions and saved names of all other packages used in the book. Instructions on installing R, git, RStudio, and, also compilers and other tools in those cases they are needed, are available on-line. In many cases the IT staff at your employer or school will know how to still them, or they may be even included in the default setup. In addition we give step by step instructions in the Appendix (currently missing).
+All data sets and files needed to run the examples in the book can be obtained by installing different R packages. One of them \pkgname{learnrbook} available through CRAN, contains datasets and files not earlier available in R packages. The \pkgname{learnrbook} package also contains installation instructions and saved names of all other packages used in the book. Instructions on installing R, git, RStudio, and, also compilers and other tools in those cases they are needed, are available on-line. In many cases the IT staff at your employer or school will know how to still them, or they may be even included in the default setup. In addition we give step by step instructions in the on-line Appendix (currently missing).
 
 
diff --git a/appendixes.prj b/appendixes.prj
@@ -6,35 +6,35 @@
 using-r-main.Rnw
 66
 13
-8
+1
 
 references.bib
 BibTeX
 1049586 0 104 7 193 1 0 0 820 242 0 1 41 160 -1 -1 0 0 23 0 0 23 1 0 1 193  0 -1 0
 using-r-main.Rnw
 TeX:RNW:UTF-8
-152055803 0 -1 15252 -1 15252 208 208 1244 731 0 1 41 224 -1 -1 0 0 198 -1 -1 198 2 0 15252 -1 1 5602 -1  0 -1 0
+420491259 0 -1 20661 -1 20677 208 208 1244 731 0 1 225 2832 -1 -1 0 0 198 -1 -1 198 2 0 20677 -1 1 5602 -1  0 -1 0
 usingr.sty
 TeX:STY
 1060850 2 56 20 56 14 234 234 1270 724 0 0 129 208 -1 -1 0 0 25 0 0 25 1 0 14 56  0 0 0
 R.data.Rnw
 TeX:RNW
-17838075 0 2035 8 -1 86683 26 26 977 443 1 1 89 528 -1 -1 0 0 31 -1 -1 31 2 0 86683 -1 1 73117 -1  0 -1 0
+17838075 0 2035 8 -1 86683 26 26 977 443 1 1 89 400 -1 -1 0 0 31 -1 -1 31 2 0 86683 -1 1 73117 -1  0 -1 0
 R.more.plotting.Rnw
 TeX:RNW
-17838075 0 -1 41551 -1 54872 26 26 924 603 1 1 97 448 -1 -1 0 0 30 -1 -1 30 1 0 54872 -1  0 -1 0
+17838075 0 -1 41551 -1 54872 26 26 924 603 1 1 97 400 -1 -1 0 0 30 -1 -1 30 1 0 54872 -1  0 -1 0
 R.friends.Rnw
 TeX:RNW
 17838075 0 -1 14181 -1 14181 104 104 853 490 0 1 41 288 -1 -1 0 0 31 -1 -1 31 1 0 14181 -1  0 -1 0
 R.plotting.Rnw
 TeX:RNW
-17838075 2 -1 6842 -1 6836 130 130 1166 559 1 1 225 192 -1 -1 0 0 31 -1 -1 31 4 0 6836 -1 1 43325 -1 2 161837 -1 3 161837 -1  0 -1 0
+17838075 2 -1 6842 -1 6836 130 130 1166 559 1 1 225 192 -1 -1 0 0 31 -1 -1 31 4 0 6836 -1 1 43325 -1 2 161802 -1 3 161802 -1  0 -1 0
 R.maps.Rnw
 TeX:RNW
 17838075 1 -1 9 -1 33 64 64 974 522 1 1 353 0 -1 -1 0 0 57 -1 -1 57 1 0 33 -1  0 -1 0
 R.intro.Rnw
 TeX:RNW
-286273531 4 -1 34669 -1 34670 182 182 1218 705 1 1 641 412 -1 -1 0 0 47 -1 -1 47 1 0 34670 -1  0 -1 0
+17838075 0 -1 31770 -1 34649 182 182 1218 705 1 1 473 2720 -1 -1 0 0 47 -1 -1 47 1 0 34649 -1  0 -1 0
 rbooks.bib
 BibTeX:UNIX
 1147890 0 161 29 162 13 52 52 872 313 1 1 185 320 -1 -1 0 0 21 0 0 21 1 0 13 162  0 -1 0
@@ -49,7 +49,7 @@ TeX:RNW
 17838075 4 -1 11545 -1 11546 130 130 1166 653 0 1 177 400 -1 -1 0 0 190 -1 -1 190 1 0 11546 -1  0 -1 0
 using-r-main.tex
 TeX
-269496315 0 -1 137052 -1 137055 96 96 1082 496 0 1 41 352 -1 -1 0 0 73 -1 -1 73 1 0 137055 -1  0 -1 0
+269496315 0 -1 137052 -1 137055 96 96 1082 496 0 1 81 352 -1 -1 0 0 73 -1 -1 73 1 0 137055 -1  0 -1 0
 :\aphalo\Documents\RPackages\learnr-pkg\inst\extdata\areatable.dat
 DATA
 273678578 0 0 1 0 1 96 96 1246 475 1 0 86 0 -1 -1 0 0 301 0 0 301 1 0 1 0  0 0 0

diff --git a/backups/R.friends.Rnw.sav b/backups/R.friends.Rnw.sav
@@ -0,0 +1,170 @@
+% !Rnw root = appendix.main.Rnw
+
+<<echo=FALSE, include=FALSE>>=
+opts_chunk$set(opts_fig_wide)
+opts_knit$set(concordance=TRUE)
+@
+
+\chapter{If and when R needs help}\label{chap:R:performance}
+
+\dictum[Patrick J. Burns (1998) S Poetry. \url{http://www.burns-stat.com/documents/books/s-poetry/}]{Improving the efficiency of your S functions can be well worth some effort.\ \ldots But remember that large efficiency gains can be made by using a better algorithm, not just by coding the same algorithm better.}
+
+\section{Aims of this chapter}
+
+In this final chapter I highlight what in my opinion are limitations and advantages of using \langname{R} as a scripting language in data analysis, briefly describing alternative approaches that can help overcome performance bottle necks in R code.
+
+\section{Packages used in this chapter}
+
+<<eval=FALSE>>=
+install.packages(learnrbook::pkgs_ch_performance)
+@
+
+For executing the examples listed in this chapter you need first to load the following packages from the library:
+
+<<message=FALSE>>=
+library(Rcpp)
+library(inline)
+# library(rPython)
+library(rJava)
+@
+
+\section{R's limitations and strengths}
+
+\subsection{Optimizing R code}
+
+Some\index{performance}\index{code!optimization} constructs like \code{for} and \code{while} loops execute slowly in \langname{R}, as they are interpreted. Byte compiling and Just-In-Time (JIT) compiling of loops (enabled by default in R >= 3.4.0) should decrease this burden in the future. However, base R as well some packages define several \emph{apply} functions. Being compiled functions, written in \langname{C} or \langname{C++}, using apply functions instead of explicit loops can provide a major improvement in performance while keeping user's code fully written in R. Pre-allocating memory, rather than growing a vector or array at each iteration can help. One little known problem is related to consistency tests when `growing' data frames. If we add one by one variables to a large data frame the overhead is in many cases huge. This can be easily avoided in many cases by assembling the object as a list, and once assembled converting it into a data frame.
+
+You may ask, how can I know, where in the code is the performance bottleneck\index{code!performance}. During the early years of R, this was quite a difficult task. Nowadays, we there are good code profiling\index{code!profiling} and code benchmarking\index{code!benchmarking} tools, which are in the most recent version, integrated into the \pgrmname{RStudio} IDE. Profiling consists in measuring how much of the total runtime of a test is spent in different functions, or even lines of code. Benchmarking consists in timing the execution of alternative versions of some piece of code, to decide which one should preferred.
+
+There are some rules of style\index{code!writing style}, and common sense, that should be always applied, to develop good quality program code. However, as in most cases, high performance comes at the cost of a more complex program or algorithm, optimizations should be applied only at the parts of the code that are limiting overall performance. Usually even when the requirement of high performance is known in advance, it is in most cases to start with a simple implementation of a simple algorithm. Get this first solution working reliably, and use this as a reference both for performance and accuracy of returned results while attempting optimization.
+
+The book \citetitle{Matloff2011} \autocite{Matloff2011} is very good at presenting the use of R language and how to profit from its peculiar features to write concise and efficient code.
+Studying the book \citetitle{Wickham2014advanced} \autocite{Wickham2014advanced} will give you a deep understanding of the R language, its limitations and good and bad approaches to its use. If you aim at writing R packages, then \citetitle{Wickham2015} \autocite{Wickham2015} will guide you on how to write your own packages, using modern tools. Finally, any piece of software, benefits from thorough and consistent testing, and R packages and scripts are no exception. Building a set of test cases simplifies enormously code maintenance, as they help detect unintended changes in program behaviour \autocite{Wickham2015,Cotton2016}.
+
+\subsection{Using the best tool for each job}
+
+In many cases optimizing \langname{R} code for performance can yield more than an order of magnitude decrease in runtime. In many cases this is enough, and the most cost-effective solution. There are both packages and functions in base R, that if properly used can make a huge difference in performance. In addition, efforts in recent years to optimize the overall performance of R itself have been successful. Some of the packages with enhanced performance have been described in earlier chapters, as they are easy enough to use and have also an easy to learn user interface. Other packages like \pkgname{data.table} although achieving very fast execution, incur the cost of using a user interface and having a behaviour alien to the ``normal way of working'' with R.
+
+Sometimes, the best available tools for a certain job have not been implemented in R but are available in other languages. Alternatively, the algorithms or the size of the data are such that performance is poor when implemented in the R language, and can be better using a compiled language.
+
+\subsection{R is great, but not always best}
+
+One extremely important feature leading to the success of \langname{R} is extensibility\index{R!extensibility}. Not only by writing packages in R itself, but by allowing the development of packages containing functions written in other computer languages. The beauty of the package loading mechanism, is that even if \pgrmname{R} itself is written in \langname{C}, and compiled into an executable, packages containing interpreted \langname{R} code, and also compiled \langname{C}, \langname{C++}, \langname{FORTRAN}, or other languages, or calling libraries written in \langname{Java}, \langname{Python}, etc. can be loaded and unloaded at runtime.
+
+Most common reasons for using compiled code, are the availability of libraries written in \langname{FORTRAN}, \langname{C} and \langname{C++} that are well tested and optimized for performance. This is frequently the case for numerical calculations and time-consuming data manipulations like image analysis. In such cases the R code in packages is just a wrapper (or ``glue'') to allow the functions in the library to be called from R.
+
+In other cases we diagnose a performance bottleneck, decide to write a few functions within a package otherwise written in R, in a compiled language like \langname{C++}. In such cases is a good idea to use benchmarking, as the use of a language does not necessarily provide a worthwhile performance enhancement. Different languages do not always store data in memory in the same format, this can add overhead to function calls across languages.
+
+\section{Rcpp}
+
+<<>>=
+citation(package = "Rcpp")
+@
+
+Nowadays, thanks to package \pkgname{Rcpp}, using \langname{C++} mixed with R language, is fairly simple \autocite{Eddelbuettel2013}. This package does not only provide R code, but a \langname{C++} header file with macro definitions that reduces the writing of the necessary ``glue'' code to the use of a simple macro in the \langname{C++} code. Although, this mechanism is most frequently used as a component packages, it is also possible to define a function written in \langname{C++} at the R console, or in a simple user's script. Of course for these to work all the tools needed to build R packages from source are needed, including a suitable compiler and linker.
+
+An example taken from the \pkgname{Rcpp} documentation follows. This is an example of how one would define a function during an interactive session at the R console, or in a simple script. When writing a package, one would write a separate source file for the function, include the \code{rcpp.h} header and use the \langname{C++} macros to build the R code side. Using \langname{C++} inline requires package \pkgname{inline} to be loaded in addition to \pkgname{Rcpp}.
+
+First we save the source code for the function written in \langname{C++}, taking advantage of types and templates defined in the \code{Rccp.h} header file.
+
+<<rcpp-01>>=
+src <- '
+ Rcpp::NumericVector xa(a);
+ Rcpp::NumericVector xb(b);
+ int n_xa = xa.size(), n_xb = xb.size();
+
+ Rcpp::NumericVector xab(n_xa + n_xb - 1);
+ for (int i = 0; i < n_xa; i++)
+ for (int j = 0; j < n_xb; j++)
+ xab[i + j] += xa[i] * xb[j];
+ return xab;
+'
+@
+
+The second step is to compile and load the function, in a way that it can be called from R code and indistinguishable from a function defined in R itself.
+
+<<rcpp-02>>=
+fun <- cxxfunction(signature(a = "numeric", b = "numeric"), src, plugin = "Rcpp")
+@
+
+We can now use as any other R function.
+
+<<rcpp-03>>=
+fun(1:3, 1:4)
+@
+
+As we will see below, this is not the case in the case of calling Java and Python, cases where although the integration is relatively tight, special syntax is used when calling the ``foreign'' functions. The advantage of Rcpp in this respect is very significant, as we can define functions that have exactly the same argument signature, use the same syntax and behave in the same way, using either the R or \langname{C++} language. This means that at any point during development of a package a function defined in R can be replaced by an equivalent function defined in \langname{C++}, or vice versa, with absolutely no impact on user's code, except possibly by faster execution of the \langname{C++} version.
+
+\section{FORTRAN and C}
+
+In the case of \langname{FORTRAN} and \langname{C}, the process is less automated in the R code needed to call the compiled functions needs to be explicitly written (See \emph{Writing R Extensions} in the R documentation, for up-to-date details). Once written, the building and installation of the package is automatic. This is the way how many existing libraries are called from within R and R packages.
+
+\section{Python}
+
+Package \pkgname{rPython} allows calling \langname{Python} functions and methods from R code. Currently this package is not available under MS-Windows.
+
+Example taken from the package description (not run).
+
+<<rpython-01,eval=FALSE>>=
+python.call( "len", 1:3 )
+a <- 1:4
+b <- 5:8
+python.exec( "def concat(a,b): return a+b" )
+python.call( "concat", a, b)
+@
+
+It is also possible to call R functions from \langname{Python}. However, this is outside the scope of this book.
+
+\section{Java}
+
+Although \langname{Java} compilers exist, most frequently Java programs are compiled into intermediate byte code and this is interpreted, and usually the interpreter includes a JIT compiler. For calling \langname{Java} functions or accessing Java objects from R code, the solution is to use package \pkgname{rJava}. One important point to remember is that the Java Development Environment must be installed for this package to work. The usually installed runtime is not enough.
+
+We need first to start the Java Virtual Machine (the byte-code interpreter).
+
+<<rjava-01>>=
+.jinit()
+@
+
+The code that follows is not that clear, and merits some explanation.
+
+We first create a \langname{Java} array from inside R.
+
+<<rjava-02>>=
+a <- .jarray( list(
+                   .jnew( "java/awt/Point", 10L, 10L ),
+                   .jnew( "java/awt/Point", 30L, 30L )
+                   ) )
+print(a)
+mode(a)
+class(a)
+str(a)
+@
+
+Then we use base R's function \Rfunction{lapply()} to apply a user-defined R function to the elements of the Java array, obtaining as returned value an R array.
+
+<<rjava-03>>=
+b <- sapply(a,
+            function(point){
+              with(point, {
+                (x + y )^2
+              } )
+            })
+print(b)
+mode(b)
+class(b)
+str(b)
+@
+
+Although more cumbersome than in the case of \pkgname{Rcpp} one can manually write wrapper code to hide the special syntax and object types from users.
+
+It is also possible to call R functions from within a \langname{Java} program. This is outside the scope of this book.
+
+\section{sh, bash}
+
+The\index{command shell}\index{sh}\index{bash} operating system shell can be accessed from within R and the output from programs and shell scripts returned to the R session. This is useful, for example for pre-processing raw data files with tools like \langname{AWK} or \langname{Perl} scripts. The problem with this approach is that when it is used, the R script cannot run portably across operating systems, or in the absence of the tools or sh or bash scripts. Except for code that will never be reused (i.e.\ it is used once and discarded) it is preferable to use R's built-in commands whenever possible, or if shell scripts are used, to make the shell script the master script from within which the R scripts are called, rather than the other way around. The reason for this is mainly making clear the developer's intention: that the code as a whole will be run in a given operating system using a certain set of tools, rather hiding shell calls inside the R script. In other words, keep the least portable bits in full view.
+
+\section{Web pages, and interactive interfaces}
+
+There is a lot to write on this aspect, and intense development efforts going on. One example is the \pkgname{Shiny} package and Shiny server \url{https://shiny.rstudio.com/}. This package allows the creation of interactive displays to be viewed through any web browser.
+
+There are other packages for generating both static and interactive graphics in formats suitable for on-line display, as well as package \pkgname{knitr} used for writing this book \url{https://yihui.name/knitr/}, which when using R Markdown for markup (with package \pkgname{rmarkdown} \url{http://rmarkdown.rstudio.com} or \pkgname{Bookdown} \url{https://bookdown.org/} can output self-contained HTML files in addition to RTF and PDF formats.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -185,6 +185,6 @@ Currently, for the development of packages, I use \pgrmname{RStudio} exclusively

		When I started using R, nearly two decades ago, I was using other editors, using the operating system shell a lot more, and struggling with debugging as no IDE was available. The only reasonably good integration with an editor was for Emacs, which was widely available only under Unix-like systems. Given this past experience, I encourage you to use an IDE for R. \pgrmname{RStudio} is nowadays very popular, but if you do not like it, need a different set of features, such as integration with \pgrmname{ImageJ}, or are already familiar with the \pgrmname{Eclipse} IDE, you may like try the \pgrmname{Bio7} IDE, available from \url{http://bio7.org}.

		All data sets and files needed to run the examples in the book can be obtained by installing different R packages. One of them \pkgname{learnrbook} available through CRAN, contains datasets and files not earlier available in R packages. The \pkgname{learnrbook} package also contains installation instructions and saved names of all other packages used in the book. Instructions on installing R, git, RStudio, and, also compilers and other tools in those cases they are needed, are available on-line. In many cases the IT staff at your employer or school will know how to still them, or they may be even included in the default setup. In addition we give step by step instructions in the Appendix (currently missing).
		All data sets and files needed to run the examples in the book can be obtained by installing different R packages. One of them \pkgname{learnrbook} available through CRAN, contains datasets and files not earlier available in R packages. The \pkgname{learnrbook} package also contains installation instructions and saved names of all other packages used in the book. Instructions on installing R, git, RStudio, and, also compilers and other tools in those cases they are needed, are available on-line. In many cases the IT staff at your employer or school will know how to still them, or they may be even included in the default setup. In addition we give step by step instructions in the on-line Appendix (currently missing).