Skip to content

Commit

Permalink
Proof fixes ch-0 to 7
Browse files Browse the repository at this point in the history
  • Loading branch information
aphalo committed Feb 14, 2024
1 parent b842ee0 commit 612b530
Show file tree
Hide file tree
Showing 35 changed files with 253,924 additions and 253,710 deletions.
Binary file not shown.
Binary file not shown.
102 changes: 51 additions & 51 deletions R.as.calculator.Rnw

Large diffs are not rendered by default.

82 changes: 41 additions & 41 deletions R.data.containers.Rnw

Large diffs are not rendered by default.

20 changes: 10 additions & 10 deletions R.data.io.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ When \Rpgrm is used in batch mode, the ``files'' \code{stdin}, \code{stdout} and
\index{importing data!text files|(}\index{file formats!plain text}
In general, text files are the most portable approach to data storage but usually also the least efficient with respect to the size of the file. Text files are composed of encoded characters. This makes them easy to edit with text editors and easy to read from programs written in most programming languages. On the other hand, how the data encoded as characters is arranged can be based on two different approaches: positional or using a specific character as a separator.

The positional approach is more concise but almost unreadable to humans as the values run into each other. Reading of data stored using a positional approach requires access to a format definition and was common in FORTRAN and COBOL at the time when punch cards were used to store data. In the case of separators, different separators are in common use. Comma-separated values (CSV) encodings use either a comma or semicolon to separate the fields or columns. Tabulator, or tab-separated values (TSV) use the tab character as a column separator. Sometimes white space is used as a separator, most commonly when all values are to be converted to \code{numeric}.
The positional approach is more concise but almost unreadable to humans as the values run into each other. Reading of data stored using a positional approach requires access to a format definition and was common in FORTRAN and COBOL at the time when punch cards were used to store data. In the case of separators, different separators are in common use. Comma-separated values (CSV) encodings use either a comma or semicolon to separate the fields or columns. Tabulator, or tab-separated values (TSV) use the tab character as a column separator. Sometimes whitespace is used as a separator, most commonly when all values are to be converted to \code{numeric}.

\begin{explainbox}
\textbf{Not all text files are born equal.}\index{importing data!R names} When reading text files, and \emph{foreign} binary files which may contain embedded text strings, there is potential for their misinterpretation during the import operation. One common source of problems, is that column headers are to be read as \Rlang names. As earlier discussed, there are strict rules, such as avoiding spaces or special characters if the names are to be used with the normal \Rlang syntax. On import, some functions will attempt to sanitise the names, but others not. Most such names are still accessible in \Rlang statements, but a special syntax is needed to protect them from triggering syntax errors through their interpretation as something different than variable or function names---in \Rlang jargon we say that they need to be quoted.
Expand All @@ -220,7 +220,7 @@ Some of the things we need to be on the watch for are:
3) Wrongly guessed column classes---a typing mistake affecting a single value in a column, e.g., the wrong kind of decimal marker, can prevent the column from being recognised as numeric.
4) Mismatched decimal marker in \code{CSV} files---the marker depends on the locale (language and country settings).

If you encounter problems after import, such as failure of extraction of data frame columns by name, use function \code{names()} to get the names printed to the console as a character vector. This is useful because character vectors are always printed with each string delimited by quotation marks making leading and trailing spaces clearly visible. The same applies to use of \code{levels()} with factors created with data that might have contained mistakes or white space.
If you encounter problems after import, such as failure of extraction of data frame columns by name, use function \code{names()} to get the names printed to the console as a character vector. This is useful because character vectors are always printed with each string delimited by quotation marks making leading and trailing spaces clearly visible. The same applies to use of \code{levels()} with factors created with data that might have contained mistakes or whitespace.

To demonstrate some of these problems, I create a data frame with name sanitation disabled, and in the second statement with sanitation enabled. The first statement is equivalent to the default behaviour of functions in package \pkgname{readr} and the second is equivalent to the behaviour of base \Rlang functions. \pkgname{readr} prioritises the integrity of the original data while \Rlang prioritises compatibility with \Rlang's naming rules.

Expand All @@ -245,7 +245,7 @@ The hardest part of all these problems is to diagnose their origin, as function
Text files containing data in columns can be divided into two broad groups. Those with fixed-width fields and those with delimited fields. Fixed-width fields were especially common in the early days of \langname{FORTRAN} and \langname{COBOL} when data storage capacity was very limited. These formats are frequently capable of encoding information using fewer characters than when delimited fields are used. The best way of understanding the differences is with examples. Although in this section we exemplify the use of functions by passing a file name as an argument, URLs, and open file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}).

\begin{warningbox}
Wether columns containing character strings that cannot be converted into numbers are converted into factors or remain as character strings in the returned data frame depends on the value passed to parameter \code{stringsAsFactors}. The default changed in \Rlang version 4.0.0 from \code{TRUE} into \code{FALSE}. If code is to work consistently in old and new versions of \Rlang \code{stringsAsFactors = FALSE} has to be passed explicitly in calls to \Rfunction{read.csv} (the approach used in the book).
Wether columns containing character strings that cannot be converted into numbers are converted into factors or remain as character strings in the returned data frame depends on the value passed to parameter \code{stringsAsFactors}. The default changed in \Rlang version 4.0.0 from \code{TRUE} into \code{FALSE}. If code is to work consistently in old and new versions of \Rlang \code{stringsAsFactors = FALSE} has to be passed explicitly in calls to \Rfunction{read.csv()} (the approach used in the book).
\end{warningbox}

In\index{text files!CSV files}\index{text files!TSV files} the first example a file with fields solely delimited by ``,'' is read. This is what is called comma-separated-values (CSV) format that can be read and written with \Rfunction{read.csv()} and \Rfunction{write.csv()}, respectively.
Expand Down Expand Up @@ -273,7 +273,7 @@ sapply(from_csv_a.df, class)
Read the file \code{not-aligned-ASCII-UK.csv} with function \Rfunction{read.csv2()} instead of \Rfunction{read.csv()}. Although this may look like a waste of time, the point of the exercise is for you to get familiar with \Rlang behaviour in case of such a mistake. This will help you recognise similar errors when they happen accidentally, which is quite common when files are shared.
\end{playground}

Example file \code{aligned-ASCII-UK.csv} contains comma-separated-values with added white space to align the columns, to make it easier to read by humans.
Example file \code{aligned-ASCII-UK.csv} contains comma-separated-values with added whitespace to align the columns, to make it easier to read by humans.

The contents of file \code{aligned-ASCII-UK.csv} are shown below.

Expand All @@ -295,7 +295,7 @@ from_csv_b.df[["col4"]]
sapply(from_csv_b.df, class)
@

By default, column names are sanitised but white space in character strings kept. Passing an additional argument changes this default so that leading and trailing white space are discarded. Most likely the default has been chosen so that by default data integrity is maintained.
By default, column names are sanitised but whitespace in character strings kept. Passing an additional argument changes this default so that leading and trailing whitespace are discarded. Most likely the default has been chosen so that by default data integrity is maintained.

<<file-io-csv-05>>=
from_csv_c.df <-
Expand All @@ -310,7 +310,7 @@ sapply(from_csv_c.df, class)
@

\begin{explainbox}
When\index{importing data!character to factor conversion} character strings are converted into factors, leading and trailing white space is retained in the labels of factor levels. Leading and trailing white space are difficult to see when data frames are printed, as shown below. This example shows what problems were frequently encountered in earlier versions of \Rlang, and can still occur when factors are created. The recommended approach is to use the default \code{stringsAsFactors = FALSE} and do the conversion into factors in a separate step.
When\index{importing data!character to factor conversion} character strings are converted into factors, leading and trailing whitespace is retained in the labels of factor levels. Leading and trailing whitespace are difficult to see when data frames are printed, as shown below. This example shows what problems were frequently encountered in earlier versions of \Rlang, and can still occur when factors are created. The recommended approach is to use the default \code{stringsAsFactors = FALSE} and do the conversion into factors in a separate step.

<<file-io-csv-03b>>=
from_csv_b.df <-
Expand All @@ -330,7 +330,7 @@ Decimal\index{importing data!decimal marker} points and exponential notation are

This handled by using functions \Rfunction{read.csv2()} and \Rfunction{write.csv2()}. Furthermore, parameters \code{dec} and \code{sep} allow setting the decimal marker and field separator to arbitrary character strings.

Function \Rfunction{read.table()} does the actual work and functions like \Rfunction{read.csv()} only differ in the default arguments for the different parameters. By default, \Rfunction{read.table()} expects fields to be separated by white space (one or more spaces, tabs, new lines, or carriage return).
Function \Rfunction{read.table()} does the actual work and functions like \Rfunction{read.csv()} only differ in the default arguments for the different parameters. By default, \Rfunction{read.table()} expects fields to be separated by whitespace (one or more spaces, tabs, new lines, or carriage return).

The contents of file \code{aligned-ASCII.txt} are shown below.

Expand All @@ -339,7 +339,7 @@ cat(readLines("extdata/aligned-ASCII.txt"), sep = "\n")
@

The file is read and the returned value stored in a variable named \code{from\_txt\_b.df}, and printed.
Leading and trailing white space are removed because they are recognised as part of the separators. For character strings containing embedded spaces to be decoded as a single value they need to be quoted in the file as in \code{aligned-ASCII.txt} above.
Leading and trailing whitespace are removed because they are recognised as part of the separators. For character strings containing embedded spaces to be decoded as a single value they need to be quoted in the file as in \code{aligned-ASCII.txt} above.

<<file-io-txt-01>>=
from_txt_b.df <-
Expand Down Expand Up @@ -443,7 +443,7 @@ Package \pkgname{readr} is part of the \pkgname{tidyverse} suite. It defines fun
Although in this section we exemplify the use of these functions by passing a file name as an argument, as is the case with \Rlang native functions, URLs, and open file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}). Furthermore,\index{file formats!compressed} if the file name ends in a tag recognisable as indicating a compressed file format, e.g., \code{.gz} or \code{.zip}, the file will be uncompressed on the fly.

\begin{warningbox}
Functions ``equivalent'' to native \Rlang functions described in the previous section have names formed by replacing the dot with an underscore, e.g., \Rfunction{read\_csv()} $\approx$ \Rfunction{read.csv()}. The similarity refers to the format of the files read, but not the order, names, or roles of their formal parameters. For example, function \code{read\_table()} has a slightly different behaviour than \Rfunction{read.table()}, although they both read fields separated by white space. Row names are not set in the returned \Rclass{tibble}, which inherits from \Rclass{data.frame}, but is not fully compatible (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble}).
Functions ``equivalent'' to native \Rlang functions described in the previous section have names formed by replacing the dot with an underscore, e.g., \Rfunction{read\_csv()} $\approx$ \Rfunction{read.csv()}. The similarity refers to the format of the files read, but not the order, names, or roles of their formal parameters. For example, function \code{read\_table()} has a slightly different behaviour than \Rfunction{read.table()}, although they both read fields separated by whitespace. Row names are not set in the returned \Rclass{tibble}, which inherits from \Rclass{data.frame}, but is not fully compatible (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble}).
\end{warningbox}

\begin{warningbox}
Expand All @@ -460,7 +460,7 @@ read_csv(file = "extdata/aligned-ASCII-UK.csv", show_col_types = FALSE)
read_csv(file = "extdata/not-aligned-ASCII-UK.csv", show_col_types = FALSE)
@

Package \pkgname{readr} is under active development, and different major versions are not fully compatible with each other. Because of the misaligned fields in file \code{"not-aligned-ASCII.txt"} in the past we needed to use \Rfunction{read\_table2()}, which allowed misalignment of fields, similarly to \Rfunction{read.table()}. This function has been renamed as \Rfunction{read\_table()} and \Rfunction{read\_table2()} deprecated. However, parsing of both files fails if they are read with \Rfunction{read\_table()}, quoted strings containing white space are no longer recognised. See above example using \Rfunction{read.table()}. Examples below are not run, but kept as they may work again in the future.
Package \pkgname{readr} is under active development, and different major versions are not fully compatible with each other. Because of the misaligned fields in file \code{"not-aligned-ASCII.txt"} in the past we needed to use \Rfunction{read\_table2()}, which allowed misalignment of fields, similarly to \Rfunction{read.table()}. This function has been renamed as \Rfunction{read\_table()} and \Rfunction{read\_table2()} deprecated. However, parsing of both files fails if they are read with \Rfunction{read\_table()}, quoted strings containing whitespace are no longer recognised. See above example using \Rfunction{read.table()}. Examples below are not run, but kept as they may work again in the future.

<<readr-03, eval=FALSE>>=
read_table(file = "extdata/aligned-ASCII.txt")
Expand Down
Loading

0 comments on commit 612b530

Please sign in to comment.