Skip to content

Commit

Permalink
Revise text files section and add first part of readr section to data…
Browse files Browse the repository at this point in the history
… chapter.
  • Loading branch information
aphalo committed Jan 24, 2017
1 parent 9bd5728 commit e73b7fa
Show file tree
Hide file tree
Showing 10 changed files with 14,266 additions and 9,296 deletions.
127 changes: 116 additions & 11 deletions R.data.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ opts_knit$set(concordance=TRUE)

\chapter{Storing and manipulating data with R}\label{chap:R:data}

\section{Packages used in this chapter}

For executing the examples listed in this chapter you need first to load the following packages from the library:

<<message=FALSE>>=
library(tibble)
library(readr)
library(readxl)
@

\section{Introduction}

By reading previous chapters, you have already become familiar with base R's classes, methods, functions and operators for storing and manipulating data. Several recently developed packages provide somehow different, and in my view easier, ways of working with data in R without compromising performance to a level that would matter outside the realm of `big data'. Some other recent packages emphasize computation speed, at some cost with respect to simplicity of use, and in particular intuitiveness. Of course, as with any user interface, much depends on one's own preferences and attitudes to data analysis. However, a package designed for maximum efficiency like \pkgname{data.table} requires of the user to have a good understanding of computers to be able to understand the compromises and the unusual behavior compared to the rest of R. I will base this chapters on what I mostly use myself for everyday data analysis and scripting, and exclude the complexities of R programming and package development.
Expand Down Expand Up @@ -125,9 +135,10 @@ head(list.dirs("."))
Function \code{file.path} can be used to construct a file path from its components in a way that is portable across OSs.

\subsection{Text files}
Text files come many different sizes and formats, but can be divided into two broad groups. Those with fixed format fields, and those with delimited fields. Fixed format fields were especially common in the early days of FORTRAN and COBOL, and computers with very limited resources. They are usually capable of encoding information using fewer characters than with delimited fields. The best way of understanding the differences is with examples.

In a format with delimited fields, a delimiter, in this case ``,'' is used to separate the values to be read. In this example, the values are aligned by inserting ``white space''. This is what is called comma-separated-values format (CSV).
Text files come many different sizes and formats, but can be divided into two broad groups. Those with fixed format fields, and those with delimited fields. Fixed format fields were especially common in the early days of FORTRAN and COBOL, and computers with very limited resources. They are usually capable of encoding information using fewer characters than with delimited fields. The best way of understanding the differences is with examples. We first discuss base R functions and starting from page \pageref{sec:files:readr} we discuss the functions defined in package \pkgname{readr}.

In a format with delimited fields, a delimiter, in this case ``,'' is used to separate the values to be read. In this example, the values are aligned by inserting ``white space''. This is what is called comma-separated-values format (CSV). Function \code{write.csv()} and \code{read.csv()} can be used to write and read these files using the conventions used in this example.
\begin{verbatim}
1.0, 24.5, 346, ABC
23.4, 45.6, 78, ZXY
Expand All @@ -139,52 +150,85 @@ When reading a CSV file, white space is ignored and fields recognized based on s
23.4,45.6,78,ZXY
\end{verbatim}

With a fixed format for fields no delimiters are needed, but a description of the format is required. A file like this cannot be interpreted without a description of the format used for saving the data. Files containing data stored in fixed with fields can be read with base R function \code{read.fwf()}. Records, can be stored in multiple lines, each line with fields of different but fixed widths.
With a fixed format for fields no delimiters are needed, but a description of the format is required. Decoding is based solely on the position of the characters in the line or record. A file like this cannot be interpreted without a description of the format used for saving the data. Files containing data stored in fixed format with fields can be read with base R function \code{read.fwf()}. Records, can be stored in multiple lines, each line with fields of different but fixed widths.
\begin{verbatim}
10245346ABC
234456 78ZXY
\end{verbatim}

Function \code{read.fortran()} is a wrapper on \code{read.fwf()} that accepts format definitions similar to those used in FORTRAN, but not completely compatible with them. One particularity of FORTRAN FWF files is that the decimal marker can be omitted in the saved file and its position specified as part of the format definition. Again an additional trick used to make text files (or decks of punch cards) smaller.
Function \code{read.fortran()} is a wrapper on \code{read.fwf()} that accepts format definitions similar to those used in FORTRAN, but not completely compatible with them. One particularity of FORTRAN \emph{formated data transfer} is that the decimal marker can be omitted in the saved file and its position specified as part of the format definition. Again an additional trick used to make text files (or stacks of punch cards) smaller.

R function \code{read.table()} provides more flexibility than function \code{read.csv()} earlier mentioned, which is in fact a wrapper on \code{read.table()} with defaults for its arguments suitable for reading CSV files in English-language locales. Function \code{read.csv2()} is similar but setting defaults for delimiters and decimal markers suitable for CSV files in locales with languages like Spanish, French, or Finnish that use comma (,) as decimal marker and semi-colon (;) as field delimiter. Another frequently used field delimiter is the ``tab'' or tabulator character, and sometimes any white space character (tab, space). In most cases the records (observations) are delimited by new lines, but this is not the only possible approach.
R functions \code{write.table()} and \code{read.table()} default to separating fields with whitespace. Functions \code{write.csv()} and \code{read.csv()} have defaults for their arguments suitable for writing and reading CSV files in English-language locales. Functions \code{write.csv2()} and \code{read.csv2()} are similar have defaults for delimiters and decimal markers suitable for CSV files in locales with languages like Spanish, French, or Finnish that use comma (,) as decimal marker and semi-colon (;) as field delimiter. Another frequently used field delimiter is the ``tab'' or tabulator character, and sometimes any white space character (tab, space). In most cases the records (observations) are delimited by new lines, but this is not the only possible approach as the user can pass the delimiters to used as arguments in the function call.

Matching functions are available for writing data to files: \code{write.csv()}, \code{write.csv2()} and \code{write.table()}. Below we give examples of the use of all the functions described in the paragraphs above, starting by writing data to a file, and then reading this file back into the workspace. The \code{write} functions take as argument data frames or objects that can be coerced into data frames. In contrast to \code{save()}, these functions can only write to files data that is in a tabular or matrix-like arrangement.
We give examples of the use of all the functions described in the paragraphs above, starting by writing data to a file, and then reading this file back into the workspace. The \code{write} functions take as argument data frames or objects that can be coerced into data frames. In contrast to \code{save()}, these functions can only write to files data that is in a tabular or matrix-like arrangement.

<<file-io-txt-01>>=
my.df <- data.frame(x = 1:5, y = 5:1 / 10)
my1.df <- data.frame(x = 1:5, y = 5:1 / 10)
@

We write a CSV file suitable for an English language locale, and then display its contents.
We write a CSV file suitable for an English language locale, and then display its contents. In most cases setting \code{row.names = FALSE} when writing a CSV file will help when it is read. Of course, if row names do contain important information, such as gene tags, you cannot skip writing the row names to the file unless you first copy these data into a column in the data frame. (Row names are stored separately as an attribute in \code{data.frame} objects.
<<file-io-txt-02>>=
write.csv(my.df, file = "my-file1.csv")
write.csv(my.df, file = "my-file1.csv", row.names = FALSE)
file.show("my-file1.csv", pager = "console")
@

<<file-io-txt-02a, comment='', echo=FALSE>>=
cat(readLines('my-file1.csv'), sep = '\n')
@

If we had written the file using default settings, reading the file so as to recover the original objects, would have required overriding of the default argument for parameter \code{row.names}.
<<file-io-txt-02b>>=
my_read1.df <- read.csv(file = "my-file1.csv")
my_read1.df
all.equal(my.df, my_read1.df, check.attributes = FALSE)
@

\begin{playground}
Read the file with function \code{read.csv2()} instead of \code{read.csv()}. Although this may look as a waste of time, the point of the exercise is for you to get familiar with R's behaviour in case of such a mistake. This will help you recognize similar errors when they happen accidentally.
\end{playground}

We write a CSV file suitable for a Spanish, Finnish or similar locale, and then display its contents. It can be seen, that the same data frame is saved using different delimiters.
<<file-io-txt-03>>=
write.csv2(my.df, file = "my-file2.csv")
write.csv2(my.df, file = "my-file2.csv", row.names = FALSE)
file.show("my-file2.csv", pager = "console")
@

<<file-io-txt-03a, comment='', echo=FALSE>>=
cat(readLines('my-file2.csv'), sep = '\n')
@

As with \code{read.csv()} had we written row names to the file, we would have needed to override the default behaviour.
<<file-io-txt-03b>>=
my_read2.df <- read.csv2(file = "my-file2.csv")
my_read2.df
all.equal(my.df, my_read2.df, check.attributes = FALSE)
@

\begin{playground}
Read the file with function \code{read.csv()} instead of \code{read.csv2()}. This may look as an even more futile exercise than the previous one, but it isn't as the behaviour of R is different. Consider \emph{how} values are erroneously decoded in both exercises. If the \emph{structure} of the data frames read is not clear to you, do use function \code{str()} to look at them.
\end{playground}

We write a file with the fields separated by white space with function \code{write.table()}.
<<file-io-txt-04>>=
write.table(my.df, file = "my-file3.txt")
write.table(my.df, file = "my-file3.txt", row.names = FALSE)
file.show("my-file3.txt", pager = "console")
@

<<file-io-txt-04a, comment='', echo=FALSE>>=
cat(readLines('my-file3.txt'), sep = '\n')
@

In the case of \code{read.table()} there is no need to override the default, independently of row names are written to the file or not. The reason is related to the default behaviour of the \code{write} functions. Whether they write a column name (\code{""}, an empty character string) or not for the first column, containing the row names.
<<file-io-txt-04b>>=
my_read3.df <- read.table(file = "my-file3.txt", header = TRUE)
my_read3.df
all.equal(my.df, my_read3.df, check.attributes = FALSE)
@

\begin{playground}
If you are still unclear about why the files were decoded in the way they were, now try to read them with \code{read.table()}. Do now the three examples make sense to you?
\end{playground}

Function \code{cat()} takes R objects and writes them after conversion to character strings to a file, inserting one or more characters as separators, by default a space. This separator can be set by an argument through \code{sep}. In our example we set \code{sep} to a new line (entered as the escape sequence \code{"\\n"}.

<<file-io-txt-05>>=
Expand All @@ -197,6 +241,67 @@ file.show("my-file4.txt", pager = "console")
cat(readLines('my-file4.txt'), sep = '\n')
@

<<file-io-txt-05b>>=
my_read.lines <- readLines('my-file4.txt')
my_read.lines
all.equal(my.lines, my_read.lines, check.attributes = FALSE)
@

\begin{warningbox}
There are couple of things to take into account when reading data from text files using base R functions \code{read.tabke()} and its relatives: by default columns containing character strings are converted into factors, and column names are sanitised (spaces and other ``inconvenient'' characters replaced with dots).
\end{warningbox}

\subsection[readr]{\pkgname{readr}}

<<>>=
citation(package = "readr")
@

Package \pkgname{readr} is part of the \pkgname{tidyverse} suite. It defines functions that allow much faster input and output, and have different default behaviour. Contrary to base R functions, they are optimized for speed, but may sometimes wrongly decode their input and sometimes silently do this even for some CSV files that are correctly decoded by the base functions. Base R functions are dumb, the file format or delimiters must be supplied as arguments. The \pkgname{readr} functions use ``magic'' to guess the format, in most cases they succeed, which is very handy, but occasionally the power of the magic is not strong enough. The ``magic'' can be overridden by passing arguments. Another important advantage is that these functions read character strings formatted as dates or times directly into columns of class \code{datetime}.

\begin{warningbox}
The names of functions ``equivalent'' to those described in the previous section have names formed by replacing the dot with an underscore, e.g.\ \code{read\_csv()} $\approx$ \code{read.csv()}. The similarity refers to the format of the files read, but not the order, names or roles of their formal parameters. Function \code{read\_table()} has a different behaviour to \code{read.table()}, although they both read fields separated by white space, \code{read\_table()} expects the fields in successive records (usually lines) to be vertically aligned while \code{read.table()} handles tolerates vertical misalignment. Other aspects of the default behaviour is also different, for example they do not convert columns of character strings into factors and row names are not set in the returned data frame (truly a \code{tibble} which inherits from \code{data.frame}).
\end{warningbox}

<<readr-01>>=
read_csv(file = "my-file1.csv")
@

<<readr-02>>=
read_csv2(file = "my-file2.csv")
@

Because of the vertically misaligned fields, we need to use \code{read\_delim()} instead of \code{read\_table()}.
<<readr-03>>=
read_delim(file = "my-file3.txt", " ")
@

We demonstrate here the use of \code{write\_tsv()} to produce a text file with tab-separated fields.
<<readr-04>>=
write_tsv(my.df, path = "my-file5.tsv")
file.show("my-file5.tsv", pager = "console")
@

<<readr-04a, comment='', echo=FALSE>>=
cat(readLines('my-file5.tsv'), sep = '\n')
@

<<readr-04b>>=
my_read4.df <- read_tsv(file = "my-file5.tsv")
my_read4.df
all.equal(my.df, my_read4.df, check.attributes = FALSE)
@

We demonstrate here the use of \code{write\_excel\_csv()} to produce a text file with comma-separated fields suitable for reading with Excel.
<<readr-05>>=
write_excel_csv(my.df, path = "my-file6.csv")
file.show("my-file6.csv", pager = "console")
@

<<readr-05a, comment='', echo=FALSE>>=
cat(readLines('my-file6.csv'), sep = '\n')
@

\subsection{Worksheets}

\subsection{Statistical software}
Expand Down
32 changes: 16 additions & 16 deletions appendixes.prj
Original file line number Diff line number Diff line change
Expand Up @@ -10,52 +10,52 @@ using-r-main.Rnw

using-r-main.Rnw
TeX:RNW:UTF-8
152055803 0 -1 16517 -1 16521 208 208 1568 731 0 1 614 140 -1 -1 0 0 198 -1 -1 198 2 0 16521 -1 1 5399 -1 0 -1 0
152055803 0 -1 16517 -1 16521 208 208 1568 731 0 1 497 128 -1 -1 0 0 198 -1 -1 198 2 0 16521 -1 1 5399 -1 0 -1 0
R.data.Rnw
TeX:RNW
269496315 0 -1 12418 -1 12290 26 26 977 443 0 1 214 200 -1 -1 0 0 31 -1 -1 31 1 0 12290 -1 0 -1 0
269496315 0 -1 19057 -1 18878 26 26 977 443 0 1 353 160 -1 -1 0 0 31 -1 -1 31 1 0 18878 -1 0 -1 0
usingr.sty
TeX:STY
1060850 1 67 21 67 30 234 234 1598 724 0 0 316 41 -1 -1 0 0 25 0 0 25 1 0 30 67 0 0 0
1060850 1 67 21 67 30 234 234 1598 724 0 0 257 131 -1 -1 0 0 25 0 0 25 1 0 30 67 0 0 0
R.more.plotting.Rnw
TeX:RNW
17838075 0 -1 7010 -1 6919 26 26 924 603 1 1 534 140 -1 -1 0 0 30 -1 -1 30 1 0 6919 -1 0 -1 0
17838075 1 -1 222 -1 651 26 26 924 603 1 1 89 304 -1 -1 0 0 30 -1 -1 30 1 0 651 -1 0 -1 0
R.plotting.Rnw
TeX:RNW
17838075 2 -1 116924 -1 116940 130 130 1516 559 1 1 574 320 -1 -1 0 0 31 -1 -1 31 4 0 116940 -1 1 20784 -1 2 74107 -1 3 78085 -1 0 -1 0
17838075 0 -1 727 -1 1004 130 130 1516 559 1 1 241 -336 -1 -1 0 0 31 -1 -1 31 4 0 1004 -1 1 20784 -1 2 74107 -1 3 78085 -1 0 -1 0
C:\Program Files\MiKTeX 2.9\tex\latex\biblatex\biblatex.sty
TeX:STY:UNIX
1159154 0 0 1 0 1 64 64 977 575 0 0 26 0 -1 -1 0 0 42 0 0 42 1 0 1 0 0 0 0
1159154 0 0 1 0 1 64 64 977 575 0 0 25 0 -1 -1 0 0 42 0 0 42 1 0 1 0 0 0 0
.git\gitHeadInfo.gin
DATA:UNIX
5341426 0 0 1 17 1 78 78 1029 495 1 0 86 340 -1 -1 0 0 14 0 0 14 1 0 1 17 0 0 0
5341426 0 0 1 17 1 78 78 1029 495 1 0 73 272 -1 -1 0 0 14 0 0 14 1 0 1 17 0 0 0
R.maps.Rnw
TeX:RNW
1060859 0 -1 2188 -1 2191 64 64 974 522 0 1 54 120 -1 -1 0 0 42 -1 -1 42 1 0 2191 -1 0 -1 0
1060859 0 -1 2188 -1 2191 64 64 974 522 0 1 49 112 -1 -1 0 0 42 -1 -1 42 1 0 2191 -1 0 -1 0
R.intro.Rnw
TeX:RNW
17838075 0 -1 102 -1 622 182 182 1542 705 0 1 44 240 -1 -1 0 0 22 -1 -1 22 1 0 622 -1 0 -1 0
17838075 0 -1 102 -1 622 182 182 1542 705 0 1 41 244 -1 -1 0 0 22 -1 -1 22 1 0 622 -1 0 -1 0
rbooks.bib
BibTeX:UNIX
1147890 0 758 7 758 7 52 52 872 313 0 1 104 300 -1 -1 0 0 21 0 0 21 1 0 7 758 0 -1 0
1147890 0 758 7 758 7 52 52 872 313 0 1 89 288 -1 -1 0 0 21 0 0 21 1 0 7 758 0 -1 0
R.as.calculator.Rnw
TeX:RNW
1060859 4 -1 33164 -1 33165 26 26 1386 549 0 1 284 60 -1 -1 0 0 31 -1 -1 31 1 0 33165 -1 0 -1 0
1060859 4 -1 33164 -1 33165 26 26 1386 549 0 1 233 52 -1 -1 0 0 31 -1 -1 31 1 0 33165 -1 0 -1 0
R.scripts.Rnw
TeX:RNW
17838075 4 -1 10955 -1 10956 78 78 1438 601 1 1 424 160 -1 -1 0 0 31 -1 -1 31 1 0 10956 -1 0 -1 0
17838075 4 -1 10955 -1 10956 78 78 1438 601 1 1 345 160 -1 -1 0 0 31 -1 -1 31 1 0 10956 -1 0 -1 0
references.bib
BibTeX
1049586 1 49 7 49 14 0 0 820 242 0 1 174 180 -1 -1 0 0 23 0 0 23 1 0 14 49 0 -1 0
1049586 1 49 7 49 14 0 0 820 242 0 1 145 176 -1 -1 0 0 23 0 0 23 1 0 14 49 0 -1 0
R.functions.Rnw
TeX:RNW
17838075 0 -1 6964 -1 6590 130 130 1490 653 0 1 314 200 -1 -1 0 0 30 -1 -1 30 1 0 6590 -1 0 -1 0
17838075 0 -1 6964 -1 6590 130 130 1490 653 0 1 257 212 -1 -1 0 0 30 -1 -1 30 1 0 6590 -1 0 -1 0
R.friends.Rnw
TeX:RNW
1060859 0 -1 452 -1 970 104 104 853 490 0 1 534 280 -1 -1 0 0 31 -1 -1 31 1 0 970 -1 0 -1 0
1060859 0 -1 452 -1 970 104 104 853 490 0 1 433 272 -1 -1 0 0 31 -1 -1 31 1 0 970 -1 0 -1 0
using-r-main.tex
TeX
269496315 7 -1 34436 -1 34440 0 0 1009 511 0 1 244 200 -1 -1 0 0 73 -1 -1 73 1 0 34440 -1 0 -1 0
269496315 7 -1 40303 -1 40302 0 0 1009 511 0 1 41 160 -1 -1 0 0 73 -1 -1 73 1 0 40302 -1 0 -1 0
using-r-main.toc
TeX:AUX
269496306 0 135 1 75 1 64 64 1390 511 0 0 25 160 -1 -1 0 0 103 0 0 103 1 0 1 75 0 0 0
Expand Down
11 changes: 11 additions & 0 deletions my-file5.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
x y
1 10
2 9
3 8
4 7
5 6
6 5
7 4
8 3
9 2
10 1
11 changes: 11 additions & 0 deletions my-file5.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
x y
1 10
2 9
3 8
4 7
5 6
6 5
7 4
8 3
9 2
10 1
11 changes: 11 additions & 0 deletions my-file6.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
x,y
1,10
2,9
3,8
4,7
5,6
6,5
7,4
8,3
9,2
10,1
10 changes: 10 additions & 0 deletions using-r-main.idx
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,13 @@
\indexentry{packages!data.table@\textsf {data.table}}{1}
\indexentry{MS-Windows@\textsf {MS-Windows}}{4}
\indexentry{programmes!MS-Windows@\textsf {MS-Windows}}{4}
\indexentry{readr@\textsf {readr}}{6}
\indexentry{packages!readr@\textsf {readr}}{6}
\indexentry{readr@\textsf {readr}}{11}
\indexentry{packages!readr@\textsf {readr}}{11}
\indexentry{readr@\textsf {readr}}{12}
\indexentry{packages!readr@\textsf {readr}}{12}
\indexentry{tidyverse@\textsf {tidyverse}}{12}
\indexentry{packages!tidyverse@\textsf {tidyverse}}{12}
\indexentry{readr@\textsf {readr}}{12}
\indexentry{packages!readr@\textsf {readr}}{12}
Loading

0 comments on commit e73b7fa

Please sign in to comment.