Add section on reading and writing worksheets.

aphalo · Jan 25, 2017 · 587eabd · 587eabd
1 parent e73b7fa
commit 587eabd
Show file tree

Hide file tree

Showing 15 changed files with 181,718 additions and 16,304 deletions.
diff --git a/R.data.Rnw b/R.data.Rnw
@@ -15,6 +15,7 @@ For executing the examples listed in this chapter you need first to load the fol
 library(tibble)
 library(readr)
 library(readxl)
+library(xlsx)
 @
 
 \section{Introduction}
@@ -134,7 +135,9 @@ head(list.dirs("."))
 
 Function \code{file.path} can be used to construct a file path from its components in a way that is portable across OSs.
 
-\subsection{Text files}
+\subsection{Text files}\label{sec:files:txt}
+
+\subsubsection[Base R and `utils']{Base R and \pkgname{utils}}
 
 Text files come many different sizes and formats, but can be divided into two broad groups. Those with fixed format fields, and those with delimited fields. Fixed format fields were especially common in the early days of FORTRAN and COBOL, and computers with very limited resources. They are usually capable of encoding information using fewer characters than with delimited fields. The best way of understanding the differences is with examples. We first discuss base R functions and starting from page \pageref{sec:files:readr} we discuss the functions defined in package \pkgname{readr}.
 
@@ -180,7 +183,7 @@ If we had written the file using default settings, reading the file so as to rec
 <<file-io-txt-02b>>=
 my_read1.df <- read.csv(file = "my-file1.csv")
 my_read1.df
-all.equal(my.df, my_read1.df, check.attributes = FALSE) 
+all.equal(my.df, my_read1.df, check.attributes = FALSE)
 @
 
 \begin{playground}
@@ -248,17 +251,21 @@ all.equal(my.lines, my_read.lines, check.attributes = FALSE)
 @
 
 \begin{warningbox}
-There are couple of things to take into account when reading data from text files using base R functions \code{read.tabke()} and its relatives: by default columns containing character strings are converted into factors, and column names are sanitised (spaces and other ``inconvenient'' characters replaced with dots). 
+There are couple of things to take into account when reading data from text files using base R functions \code{read.tabke()} and its relatives: by default columns containing character strings are converted into factors, and column names are sanitised (spaces and other ``inconvenient'' characters replaced with dots).
 \end{warningbox}
 
-\subsection[readr]{\pkgname{readr}}
+\subsubsection[readr]{\pkgname{readr}}
 
 <<>>=
 citation(package = "readr")
 @
 
 Package \pkgname{readr} is part of the \pkgname{tidyverse} suite. It defines functions that allow much faster input and output, and have different default behaviour. Contrary to base R functions, they are optimized for speed, but may sometimes wrongly decode their input and sometimes silently do this even for some CSV files that are correctly decoded by the base functions. Base R functions are dumb, the file format or delimiters must be supplied as arguments. The \pkgname{readr} functions use ``magic'' to guess the format, in most cases they succeed, which is very handy, but occasionally the power of the magic is not strong enough. The ``magic'' can be overridden by passing arguments. Another important advantage is that these functions read character strings formatted as dates or times directly into columns of class \code{datetime}.
 
+All \code{write} functions defined in this package have an \code{append} parameter, which can be used to change the default behaviour of overwriting an existing file with the same name, to appending the output at its end.
+
+Although in this section we exemplify the use of these functions by passing a file name as argument, URLS, and open file descriptors are also accepted. Furthermore, if the file name ends in a tag recognizable as indicating a compressed file format, the file will be uncompressed on-the-fly. 
+
 \begin{warningbox}
 The names of functions ``equivalent'' to those described in the previous section have names formed by replacing the dot with an underscore, e.g.\ \code{read\_csv()} $\approx$ \code{read.csv()}. The similarity refers to the format of the files read, but not the order, names or roles of their formal parameters. Function \code{read\_table()} has a different behaviour to \code{read.table()}, although they both read fields separated by white space, \code{read\_table()} expects the fields in successive records (usually lines) to be vertically aligned while \code{read.table()} handles tolerates vertical misalignment. Other aspects of the default behaviour is also different, for example they do not convert columns of character strings into factors and row names are not set in the returned data frame (truly a \code{tibble} which inherits from \code{data.frame}).
 \end{warningbox}
@@ -302,8 +309,102 @@ file.show("my-file6.csv", pager = "console")
 cat(readLines('my-file6.csv'), sep = '\n')
 @
 
+<<readr-06>>=
+write_lines(my.lines, path = "my-file7.txt")
+file.show("my-file7.txt", pager = "console")
+@
+
+<<readr-06a, comment='', echo=FALSE>>=
+cat(read_lines("my-file7.txt"), sep = '\n')
+@
+
+<<readr-06b>>=
+my_read.lines <- read_lines("my-file7.txt")
+my_read.lines
+all.equal(my.lines, my_read.lines, check.attributes = FALSE)
+@
+
+Additional write and read functions not mentioned are also provided: \code{write\_csv()}, \code{write\_delim()}, \code{write\_file()}, and \code{read\_fwf()}.
+
 \subsection{Worksheets}
 
+Microsoft Office, Open Office and Libre Office are the most frequently used suites containing programs based on the worksheet paradigm. There is available a standardized file formats for exchange of worksheet data, but it does not support all the features present in native file formats. We will start by considering MS-Excel. The file format used by Excel has changed significantly over the years, and old formats tend to be less well supported by available R packages and may require the file to be updated to a more modern format with Excel itself before import into R. The current format is based on XML and relatively simple to decode, older binary formats are more difficult. Consequently for the format currently in use, there are alternatives.
+
+\subsubsection{Exporting CVS files}
+
+If you have access to the original software used, then exporting a worksheet to a text file in CSV format and importing it into R using the functions described in section \ref{sec:files:txt} starting on page \pageref{sec:files:txt} is a possible solution. It is not ideal from the perspective of storing the same data set repeatedly, which, can lead to these versions diverging when updated. A better approach is to, when feasible, import the data directly from the workbook or worksheets into R.
+
+\subsubsection['readxl']{\pkgname{readxl}}
+
+<<readxl-00>>=
+citation(package = "readxl")
+@
+
+This package exports only two functions for reading Excel workbooks in xlsx format. The interface is simple, and the package easy to instal. We will import a file that in Excel looks as in the screen capture below.
+
+\begin{center}
+\includegraphics[width=0.75\textwidth]{data/Book1-xlsx.png}
+\end{center}
+
+We first list the sheets contained in the workbook file.
+<<readxl-01>>=
+sheets <- excel_sheets("data/Book1.xlsx")
+sheets
+@
+
+In this case the argument passed to \code{sheet} is redundant, as there is only a single worksheet in the file. It is possible to use either the name of the sheet or a positional index (in this case \code{1} would be equivalent to \code{"my data"}).
+<<readxl-02>>=
+Book1.df <- read_excel("data/Book1.xlsx", sheet = "my data")
+Book1.df
+@
+
+Of the remaining arguments, \code{skip} is useful when we need to skip the top row of a worksheet.
+
+\subsubsection['xlsx']{\pkgname{xlsx}}
+
+Package \pkgname{xlsx} can be more difficult to install as it uses Java functions to do the actual work. However, it is more comprehensive, with functions both for reading and writing Excel worksheet and workbooks, in different formats. It also allows selecting regions of a worksheet to be imported.
+
+<<xlsx-00>>=
+citation(package = "xlsx")
+@
+
+<<xlsx-01>>=
+Book1_xlsx.df <- read.xlsx("data/Book1.xlsx", sheetName = "my data")
+Book1_xlsx.df
+@
+
+<<xlsx-02>>=
+Book1_xlsx2.df <- read.xlsx2("data/Book1.xlsx", sheetIndex = 1)
+Book1_xlsx2.df
+@
+
+With the three different functions we get a data frame or a tibble, which is compatible with data frames.
+<<xlsx-03>>=
+class(Book1.df)
+class(Book1_xlsx.df)
+class(Book1_xlsx2.df)
+@
+
+However, the columns are imported differently. Both \code{Book1.df} and \code{Book1\_xlsx.df} differ only in that the second column, a character variable, has been converted into a factor or not. This is to be expected as packages in the \pkgname{tidyverse} suite default to preserving character variables as is, while base R functions convert them to factors. The third function, \code{read.xlsx2()}, did not decode numeric values correctly, and converted everything into factors. This function is reported as being much faster than \code{read.xlsx()}.
+<<xlsx-04>>=
+sapply(Book1.df, class)
+sapply(Book1_xlsx.df, class)
+sapply(Book1_xlsx2.df, class)
+@
+
+We can also write data frames out to Excel worksheets and even append new worksheets to an existing workbook.
+<<xlsx-05>>=
+set.seed(456321)
+my.data <- data.frame(x = 1:10, y = 1:10 + rnorm(10))
+write.xlsx(my.data, file = "data/my-data.xlsx", sheetName = "first copy")
+write.xlsx(my.data, file = "data/my-data.xlsx", sheetName = "second copy", append = TRUE)
+@
+
+When opened in Excel we get a workbook, containing two worksheets, named using the arguments we passed through \code{sheetName} in the code chunk above.
+\begin{center}
+\includegraphics[width=0.75\textwidth]{data/my-data-xlsx.png}
+\end{center}
+
 \subsection{Statistical software}
 
 \subsection{Databases}

diff --git a/R.data.tex b/R.data.tex
@@ -0,0 +1,102 @@
+% !Rnw root = appendix.main.Rnw
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/R.plotting.Rnw b/R.plotting.Rnw
@@ -1756,17 +1756,13 @@ ggplot(fake2.data, aes(z, y)) +
 Play with the values of the arguments to \code{annotate} to vary the position, size, color, font family, font face, rotation angle and justification of the annotation.
 \end{playground}
 
-We can add lines to mark the origin more precisely and effectively. (With \ggplot 2.2.1 we need to add a redundant \code{x} or \code{y} argument to get any effect of the annotations. Issue raised in Github 2017-01-22.)
+We can add lines to mark the origin more precisely and effectively. With \ggplot 2.2.1 we cannot use \code{annotate} with \code{geom = "vline"} or \code{geom = "hline"}, but we can achieve the same effect by directly adding the \emph{geometry} to the plot.
 
 <<annotate-02>>=
 ggplot(fake2.data, aes(z, y)) +
   geom_point() +
-  annotate(geom = "hline",
-           yintercept = 0, y = 0,
-           color = "blue") +
-  annotate(geom = "vline",
-           xintercept = 0, x = 0,
-           color = "blue")
+  geom_hline(yintercept = 0, color = "blue") +
+  geom_vline(xintercept = 0, color = "blue")
 @
 
 \begin{playground}

diff --git a/appendixes.prj b/appendixes.prj
@@ -4,16 +4,16 @@
 1
 1
 using-r-main.Rnw
-42
+41
 15
-1
+0
 
 using-r-main.Rnw
 TeX:RNW:UTF-8
-152055803 0 -1 16517 -1 16521 208 208 1568 731 0 1 497 128 -1 -1 0 0 198 -1 -1 198 2 0 16521 -1 1 5399 -1  0 -1 0
+420491259 0 -1 5410 -1 5414 208 208 1568 731 0 1 169 96 -1 -1 0 0 198 -1 -1 198 2 0 5414 -1 1 5399 -1  0 -1 0
 R.data.Rnw
 TeX:RNW
-269496315 0 -1 19057 -1 18878 26 26 977 443 0 1 353 160 -1 -1 0 0 31 -1 -1 31 1 0 18878 -1  0 -1 0
+1060859 1 -1 24087 -1 24326 26 26 977 443 0 1 41 128 -1 -1 0 0 31 -1 -1 31 1 0 24326 -1  0 -1 0
 usingr.sty
 TeX:STY
 1060850 1 67 21 67 30 234 234 1598 724 0 0 257 131 -1 -1 0 0 25 0 0 25 1 0 30 67  0 0 0
@@ -22,7 +22,7 @@ TeX:RNW
 17838075 1 -1 222 -1 651 26 26 924 603 1 1 89 304 -1 -1 0 0 30 -1 -1 30 1 0 651 -1  0 -1 0
 R.plotting.Rnw
 TeX:RNW
-17838075 0 -1 727 -1 1004 130 130 1516 559 1 1 241 -336 -1 -1 0 0 31 -1 -1 31 4 0 1004 -1 1 20784 -1 2 74107 -1 3 78085 -1  0 -1 0
+17838075 0 -1 87202 -1 87410 130 130 1516 559 1 1 649 256 -1 -1 0 0 31 -1 -1 31 4 0 87410 -1 1 20784 -1 2 74107 -1 3 78085 -1  0 -1 0
 C:\Program Files\MiKTeX 2.9\tex\latex\biblatex\biblatex.sty
 TeX:STY:UNIX
 1159154 0 0 1 0 1 64 64 977 575 0 0 25 0 -1 -1 0 0 42 0 0 42 1 0 1 0  0 0 0
@@ -55,10 +55,7 @@ TeX:RNW
 1060859 0 -1 452 -1 970 104 104 853 490 0 1 433 272 -1 -1 0 0 31 -1 -1 31 1 0 970 -1  0 -1 0
 using-r-main.tex
 TeX
-269496315 7 -1 40303 -1 40302 0 0 1009 511 0 1 41 160 -1 -1 0 0 73 -1 -1 73 1 0 40302 -1  0 -1 0
-using-r-main.toc
-TeX:AUX
-269496306 0 135 1 75 1 64 64 1390 511 0 0 25 160 -1 -1 0 0 103 0 0 103 1 0 1 75  0 0 0
+1060859 7 -1 54402 -1 54400 0 0 1009 511 0 1 233 160 -1 -1 0 0 73 -1 -1 73 1 0 54400 -1  0 -1 0
 :\aphalo\Documents\Own_manuscripts\Books\r4photobiology\plots.Rnw
 TeX:RNW
 269496315 1 -1 81514 -1 81534 78 78 926 482 0 1 187 306 -1 -1 0 0 26 -1 -1 26 1 0 81534 -1  0 -1 0

diff --git a/data/Book1-xlsx.png b/data/Book1-xlsx.png
diff --git a/data/Book1.xlsx b/data/Book1.xlsx
diff --git a/data/my-data-xlsx.png b/data/my-data-xlsx.png
diff --git a/data/my-data.xlsx b/data/my-data.xlsx
diff --git a/my-file7.txt b/my-file7.txt
@@ -0,0 +1,3 @@
+abcd
+hello world
+123.45
diff --git a/using-r-main.Rnw b/using-r-main.Rnw
@@ -155,7 +155,7 @@ options(warnPartialMatchAttr = FALSE,
 @
 
 <<own-set-up, echo=FALSE, include=FALSE, cache=FALSE>>=
-incl_all <- FALSE
+incl_all <- TRUE
 
 eval_diag <- FALSE
 @
@@ -212,7 +212,7 @@ I have been using \R since around 1998 or 1999, but I am still constantly learni
 
 \begin{infobox}
 \noindent
-\textbf{Status as of 2017-01-22.} Wrote section on ggplot2 themes, and on using system- and Google fonts in ggpplots with the help of package \pkgname{showtext}. Expanded section on \ggplot's \code{annotation}, and revised some sections in the ``R scripts and Programming'' chapter.
+\textbf{Status as of 2017-01-24.} Wrote section on ggplot2 themes, and on using system- and Google fonts in ggpplots with the help of package \pkgname{showtext}. Expanded section on \ggplot's \code{annotation}, and revised some sections in the ``R scripts and Programming'' chapter. Started writing the data chapter. Wrote draft on writing and reading text files.
 
 \textbf{Status as of 2017-01-17.} Added ``playground'' exercises to the chapter describing \ggplot, and converted some of the examples earlier part of the main text into these playground items. Added icons to help readers quickly distinguish playground sections (\textcolor{blue}{\noticestd{"0055}}), information sections (\textcolor{blue}{\modpicts{"003D}}), warnings about things one needs to be specially aware of (\colorbox{yellow}{\typicons{"E136}}) and boxes with more advanced content that may require longer time/more effort to grasp (\typicons{"E04E}). Added to the sections \code{scales} and examples in the \ggplot chapter details about the use of colors in R and \ggplot2. Removed some redundant examples, and updated the section on \code{plotmath}. Added terms to the alphabetical index. Increased line-spacing to avoid uneven spacing with inline code bits.