Revise NetCDF and on-line data sections

Add section on 'tidync' and shorten 'ncdf4'.
aphalo · Jul 28, 2019 · eca6348 · eca6348
1 parent c5a1750
commit eca6348
Show file tree

Hide file tree

Showing 15 changed files with 9,929 additions and 9,400 deletions.
diff --git a/R.data.io.Rnw b/R.data.io.Rnw
@@ -56,14 +56,14 @@ library(dplyr)
 library(tidyr)
 library(readr)
 library(readxl)
-library(readODS)
 library(xlsx)
+library(readODS)
 library(pdftools)
 library(foreign)
 library(haven)
 library(xml2)
-library(RNetCDF)
 library(ncdf4)
+library(tidync)
 library(lubridate)
 library(jsonlite)
 @
@@ -627,22 +627,24 @@ If you use or have used in the past other statistical software or a general purp
 
 \section{NetCDF files}
 
-In some fields including geophysics and meteorology NetCDF is a very common format for the exchange of data. It is also used in other contexts in which data is referenced to an array of locations, like with data read from Affymetrix micro arrays used to study gene expression. The NetCDF format allows the storage of metadata together with the data itself in a well organized and standardized format, which is ideal for exchange of moderately large data sets.
+In some fields including geophysics and meteorology \pgrmname{NetCDF} is a very common format for the exchange of data. It is also used in other contexts in which data is referenced to a grid of locations, like with data read from Affymetrix micro arrays used to study gene expression. Files \pgrmname{NetCDF} files are binary but use a format that allows the storage of metadata describing each variable together with the data itself in a well organized and standardized format, which is ideal for exchange of moderately large data sets measured on a spatial or spatio-temporal grid.
 
 Officially described as
 \begin{quote}
-NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
+\pgrmname{NetCDF} is a set of software libraries [from Unidata] and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
 \end{quote}
 
-As sometimes NetCDF files are large, it is good that it is possible to selectively read the data from individual variables with functions in packages \pkgname{ncdf4} or \pkgname{RNetCDF}. On the other hand, this implies that contrary to other data file reading operations, reading a NetCDF file is done in two or more steps.
+As sometimes \pgrmname{NetCDF} files are large, it is good that it is possible to selectively read the data from individual variables with functions in packages \pkgname{ncdf4} or \pkgname{RNetCDF}. On the other hand, this implies that contrary to other data file reading operations, reading a \pgrmname{NetCDF} file is done in two or more steps---i.e.\ opening the file, reading metadata describing the variables and spatial grid and finally reading the variables of interest.
 
 \subsection[ncdf4]{\pkgname{ncdf4}}
 
 <<ncdf4-00>>=
 citation(package = "ncdf4")
 @
 
-We first need to read an index into the file contents, and in additional steps we read a subset of the data. With \Rfunction{print()} we can find out the names and characteristics of the variables and attributes. In this example we use long term averages for potential evapotranspiration (PET).
+Package \pkgname{ncdf4} supports reading of files using \pgrmname{netCDF} version 4 or earlier formats. Functions in \pkgname{ncdf4} not only allow reading and writing of these files, but also their modification.
+
+We first read metadata to obtain an index into the file contents and in additional steps read a subset of the data. With \Rfunction{print()} we can find out the names and characteristics of the variables and attributes. In this example we use long term averages for potential evapotranspiration (PET).
 
 We first open a connection to the file with function \Rfunction{nc\_open()}.
 
@@ -653,7 +655,7 @@ meteo_data.nc <- nc_open("extdata/pevpr.sfc.mon.ltm.nc")
 @
 
 \begin{playground}
-Uncomment the \Rfunction{print()} statement above and study the metadata available for the data set as a whole, and for each variable.
+Uncomment the \Rfunction{print()} statement above and study the metadata available for the data set as a whole, and for each variable. This will show a detailed definition for each variable and dimension in the file.
 \end{playground}
 The dimensions of the array data are described with metadata, mapping indexes to in our examples a grid of latitudes and longitudes and a time vector as a third dimension. The dates are returned as character strings. We get here the variables one at a time with function \Rfunction{ncvar\_get()}.
 
@@ -666,86 +668,74 @@ latitude <- ncvar_get(meteo_data.nc, "lat")
 head(latitude)
 @
 
-The \code{time} vector is rather odd, as it contains only month data as these are long-term averages. From the metadata we can infer that they correspond to the months of the year, and we directly generate these, instead of attempting a conversion.
+The \code{time} vector is rather odd, as it contains only monthly data as these are long-term averages, but expressed as days from 1800-01-01 corresponding to the first day of each month of year 1. We use package \pkgname{lubridate} for the conversion.
 
 We construct a \Rclass{tibble} object with PET values for one grid point, we can take advantage of \emph{recycling} or short vectors.
 
 <<ncdf4-03>>=
 pet.tb <-
-    tibble(moth = month.abb[1:12],
+    tibble(time = ncvar_get(meteo_data.nc, "time"),
+           month = month(ymd("1800-01-01") + days(time)),
            lon = longitude[6],
            lat = latitude[2],
            pet = ncvar_get(meteo_data.nc, "pevpr")[6, 2, ]
            )
 pet.tb
 @
 
-If we want to read in several grid points, we can use several different approaches. In this example we take all latitudes along one longitude. Here we avoid using loops altogether when creating a \emph{tidy} \Rclass{tibble} object. However, because of how the data is stored, we needed to transpose the intermediate array before conversion into a vector.
+If we want to read in several grid points, we can use several different approaches. However, the order of nesting of dimensions can make adding the dimensions as columns error prone. It is much simpler to use package \pkgnameNI{tidync} described next. 
 
-<<ncdf4-04>>=
-pet2.tb <-
-    tibble(moth = rep(month.abb[1:12], length(latitude)),
-           lon = longitude[6],
-           lat = rep(latitude, each = 12),
-           pet = as.vector(t(ncvar_get(meteo_data.nc, "pevpr")[6, , ]))
-           )
-pet2.tb
-subset(pet2.tb, lat == latitude[2])
+\subsection[tidync]{\pkgname{tidync}}
+
+<<tidync-00>>=
+citation(package = "tidync")
 @
 
-\begin{playground}
-Play with \code{as.vector(t(ncvar\_get(meteo\_data.nc, "pevpr")[6, , ]))} until you understand what is the effect of each of the nested function calls, starting from \code{ncvar\_get(meteo\_data.nc, "pevpr")}. You will also want to use \Rfunction{str()} to see the structure of the objects returned at each stage.
-\end{playground}
+Pakage \pkgname{tidync} provides functions that make it easier to extract subsets of the data from an \pgrmname{NetCDF} file. We start by doing the same operations as in the examples for \pkgnameNI{ncdf4}.
 
-\begin{playground}
-Instead of extracting data for one longitude across latitudes, extract data across longitudes for one latitude near the Equator.
-\end{playground}
+We open the file creating an object and simulateously activating the first grid.
 
-\subsection[RNetCDF]{\pkgname{RNetCDF}}
+<<tidync-01>>=
+meteo_data.tnc <- tidync("extdata/pevpr.sfc.mon.ltm.nc")
+meteo_data.tnc
+@
 
-\begin{warningbox}
-Package RNetCDF supports NetCDF3 files, but not those saved using the current NetCDF4 format.
-\end{warningbox}
+<<tidync-01a>>=
+hyper_dims(meteo_data.tnc)
+@
 
-<<netcdf-00>>=
-citation(package = "RNetCDF")
+<<tidync-01b>>=
+hyper_vars(meteo_data.tnc)
 @
 
-We first need to read an index into the file contents, and in additional steps we read a subset of the data. With \Rfunction{print.nc()} we can find out the names and characteristics of the variables and attributes. We open the connection with function \Rfunction{open.nc()}.
+We extract a subset of the data into a tibble in long (or tidy) format, and add
+the months using a pipe operator from \pkgname{wrapr} and methods from \pkgname{dplyr}.
 
-<<netcdf-01>>=
-meteo_data.nc <- open.nc("extdata/meteo-data.nc")
-str(meteo_data.nc)
-# very long output
-# print.nc(meteo_data.nc)
+<<tidync-02>>=
+hyper_tibble(meteo_data.tnc,
+             lon = signif(lon, 1) == 9,
+             lat = signif(lat, 2) == 87) %.>%
+  mutate(., month = month(ymd("1800-01-01") + days(time))) %.>%
+  select(., -time)
 @
 
-The dimensions of the array data are described with metadata, mapping indexes to in our examples a grid of latitudes and longitudes and a time vector as a third dimension. The dates are returned as character strings. We get variables, one at a time, with function \Rfunction{var.get.nc()}.
+In this second example, we extract data for all grid points along latitudes. To achieve this we need only to omit the test for \code{lat} from the chuck above. The tibble is assembled automatically and columns for the active dimensions added. The decoding of the months remains unchanged.
 
-<<netcdf-02>>=
-time.vec <- var.get.nc(meteo_data.nc, "time")
-head(time.vec)
-longitude <-  var.get.nc(meteo_data.nc, "lon")
-head(longitude)
-latitude <-  var.get.nc(meteo_data.nc, "lat")
-head(latitude)
+<<tidync-03>>=
+hyper_tibble(meteo_data.tnc,
+             lon = signif(lon, 1) == 9) %.>%
+  mutate(., month = month(ymd("1800-01-01") + days(time))) %.>%
+  select(., -time)
 @
 
-We construct a \Rclass{tibble} object with values for midday UV Index for 26 days. For convenience, we convert the strings into \Rlang datetime objects.
+\begin{playground}
+Instead of extracting data for one longitude across latitudes, extract data across longitudes for one latitude near the Equator.
+\end{playground}
 
-<<netcdf-03>>=
-uvi.tb <-
-    tibble(date = ymd(time.vec, tz="EET"),
-           lon = longitude[6],
-           lat = latitude[2],
-           uvi = var.get.nc(meteo_data.nc, "UVindex")[6,2,]
-           )
-uvi.tb
-@
 
 \section{Remotely located data}\label{sec:files:remote}
 
-Many of the functions described above accept am URL address in place of file name. Consequently files can be read remotely, without a separate step. This can be useful, especially when file names are generated within a script. However, one should avoid, especially in the case of servers open to public access, not to generate unnecessary load on server and/or network traffic by repeatedly downloading the same file. Because of this, our first example reads a small file from my own web site. See section \ref{sec:files:txt} on page \pageref{sec:files:txt} for details of the use of these and other functions for reading text files.
+Many of the functions described above accept an URL address in place of file name. Consequently files can be read remotely, without a separate step. This can be useful, especially when file names are generated within a script. However, one should avoid, especially in the case of servers open to public access, not to generate unnecessary load on server and/or network traffic by repeatedly downloading the same file. Because of this, our first example reads a small file from my own web site. See section \ref{sec:files:txt} on page \pageref{sec:files:txt} for details of the use of these and other functions for reading text files.
 
 <<url-01, eval=eval_online_data>>=
 logger.df <-
@@ -764,15 +754,20 @@ sapply(logger.tb, class)
 sapply(logger.tb, mode)
 @
 
-While functions in package \pkgname{readr} support the use of URLs, those in packages \pkgname{readxl} and \pkgname{xlsx} do not. Consequently we need to first download the file writing a file locally, that we can read as described in section \ref{sec:files:excel} on page \pageref{sec:files:excel}.
+While functions in package \pkgname{readr} support the use of URLs, those in packages \pkgname{readxl} and \pkgname{xlsx} do not. Consequently we need to first download the file writing a file locally, that we can read as described in section \ref{sec:files:excel} on page \pageref{sec:files:excel}. Function \Rfunction{download.file()} in \Rlang \pkgname{utils} package can be used to download files using URLs. It supports different modes such as binary or text, and write or append, and different methods such as \code{"internal"}, \code{"wget"} and \code{"libcurl" }.
+
+\begin{warningbox}
+For portability \pgrmname{MS-Excel} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
+\end{warningbox}
+
 
 <<url-11, eval=eval_online_data>>=
 download.file("http://r4photobiology.info/learnr/my-data.xlsx",
               "data/my-data-dwn.xlsx",
               mode = "wb")
 @
 
-Functions in package \pkgname{foreign}, as well as those in package \pkgname{haven} support URLs. See section \ref{sec:files:stat} on page \pageref{sec:files:stat} for more information about importing this kind of data into R.
+Functions in package \pkgname{foreign}, as well as those in package \pkgname{haven} support URLs. See section \ref{sec:files:stat} on page \pageref{sec:files:stat} for more information about importing this kind of data into \Rlang.
 
 <<url-03, eval=eval_online_data>>=
 remote_thiamin.df <-
@@ -787,8 +782,6 @@ remote_my_spss.tb <-
 remote_my_spss.tb
 @
 
-Function \Rfunction{download.file()} in \Rlang default \pkgname{utils} package can be used to download files using URLs. It supports differemt modes such as binary or text, and write or append, and different methods such as internal, wget and libcurl.
-
 In this example we use a downloaded NetCDF file of long-term means for potential evapotranspiration from NOOA, the same used above in the \pkgname{ncdf4} example. This is a moderately large file at 444~KB. In this case we cannot directly open the connection to the NetCDF file, we first download it (commented out code, as we have a local copy), and then we open the local file.
 
 <<url-05, eval=eval_online_data>>=
@@ -802,7 +795,8 @@ pet_ltm.nc <- nc_open("extdata/pevpr.sfc.mon.ltm.nc")
 @
 
 \begin{warningbox}
-For portability NetCDF files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required at least under MS-Windows.
+For portability \pgrmname{NetCDF} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
+.
 \end{warningbox}
 
 \section{Data acquisition from physical devices}\label{sec:data:acquisition}
@@ -1147,12 +1141,11 @@ unlink("./extdata", recursive = TRUE)
 <<eco=FALSE>>=
 try(detach(package:jsonlite))
 try(detach(package:lubridate))
+try(detach(package:tidync))
 try(detach(package:ncdf4))
-try(detach(package:RNetCDF))
 try(detach(package:xml2))
 try(detach(package:haven))
 try(detach(package:foreign))
-try(detach(package:pdftools))
 try(detach(package:xlsx))
 try(detach(package:readxl))
 try(detach(package:readr))

diff --git a/appendixes.prj b/appendixes.prj
@@ -25,10 +25,10 @@ TeX:RNW
 17838075 2 -1 62302 -1 62304 78 78 1114 601 1 1 125 504 -1 -1 0 0 31 -1 -1 31 1 0 62304 -1  0 -1 0
 usingr.sty
 TeX:STY
-1060850 1 50 13 50 22 234 234 1270 724 0 0 237 1050 -1 -1 0 0 25 0 0 25 1 0 22 50  0 0 0
+1060850 1 50 13 50 22 234 234 1270 724 0 0 237 168 -1 -1 0 0 25 0 0 25 1 0 22 50  0 0 0
 R.data.io.Rnw
 TeX:RNW
-286273531 0 -1 41978 -1 42152 0 0 1498 379 1 1 105 -210 -1 -1 0 0 31 -1 -1 31 1 0 42152 -1  0 -1 0
+286273531 4 -1 52388 -1 52389 0 0 1498 379 1 1 435 -42 -1 -1 0 0 31 -1 -1 31 1 0 52389 -1  0 -1 0
 frontmatter\preface.tex
 TeX
 1060859 2 -1 20 -1 0 130 130 1573 499 0 1 45 0 -1 -1 0 0 18 -1 -1 18 1 0 0 -1  0 -1 0
@@ -49,7 +49,7 @@ BibTeX
 1049586 0 265 1 283 1 0 0 820 242 0 1 45 147 -1 -1 0 0 23 0 0 23 1 0 1 283  0 -1 0
 using-r-main-crc.tex
 TeX
-269496315 0 -1 69190 -1 69190 0 0 977 411 1 1 695 126 -1 -1 0 0 62 -1 -1 62 1 0 69190 -1  0 -1 0
+269496315 0 -1 69190 -1 69190 0 0 977 411 1 1 705 126 -1 -1 0 0 62 -1 -1 62 1 0 69190 -1  0 -1 0
 cut-from-plots.tex
 TeX
 269496315 0 -1 5047 -1 12688 234 234 1710 723 1 1 265 357 -1 -1 0 0 18 -1 -1 18 1 0 12688 -1  0 -1 0

diff --git a/extdata/meteo-data.nc b/extdata/meteo-data.nc
diff --git a/rcatsidx.idx b/rcatsidx.idx
@@ -89,13 +89,7 @@
 \indexentry{functions and methods!print()@\texttt  {print()}}{27}
 \indexentry{functions and methods!nc\_open()@\texttt  {nc\_open()}}{27}
 \indexentry{functions and methods!print()@\texttt  {print()}}{27}
-\indexentry{functions and methods!ncvar\_get()@\texttt  {ncvar\_get()}}{27}
+\indexentry{functions and methods!ncvar\_get()@\texttt  {ncvar\_get()}}{28}
 \indexentry{classes and modes!tibble@\texttt  {tibble}}{28}
-\indexentry{classes and modes!tibble@\texttt  {tibble}}{28}
-\indexentry{functions and methods!str()@\texttt  {str()}}{29}
-\indexentry{functions and methods!print.nc()@\texttt  {print.nc()}}{30}
-\indexentry{functions and methods!open.nc()@\texttt  {open.nc()}}{30}
-\indexentry{functions and methods!var.get.nc()@\texttt  {var.get.nc()}}{30}
-\indexentry{classes and modes!tibble@\texttt  {tibble}}{30}
-\indexentry{functions and methods!download.file()@\texttt  {download.file()}}{32}
+\indexentry{functions and methods!download.file()@\texttt  {download.file()}}{31}
 \indexentry{functions and methods!fromJSON()@\texttt  {fromJSON()}}{33}
diff --git a/rcatsidx.ilg b/rcatsidx.ilg
@@ -1,6 +1,6 @@
 This is makeindex, version 2.15 [MiKTeX 2.9.7050 64-bit] (kpathsea + Thai support).
-Scanning input file rcatsidx.idx....done (101 entries accepted, 0 rejected).
-Sorting entries....done (680 comparisons).
-Generating output file rcatsidx.ind....done (76 lines written, 0 warnings).
+Scanning input file rcatsidx.idx....done (95 entries accepted, 0 rejected).
+Sorting entries....done (655 comparisons).
+Generating output file rcatsidx.ind....done (73 lines written, 0 warnings).
 Output written in rcatsidx.ind.
 Transcript written in rcatsidx.ilg.
diff --git a/rcatsidx.ind b/rcatsidx.ind
@@ -2,7 +2,7 @@
 
   \item classes and modes
     \subitem \texttt  {data.frame}, 13
-    \subitem \texttt  {tibble}, 13, 25, 28, 30
+    \subitem \texttt  {tibble}, 13, 25, 28
 
   \indexspace
 
@@ -15,7 +15,7 @@
     \subitem \texttt  {dimnames()}, 26
     \subitem \texttt  {dir()}, 5
     \subitem \texttt  {dirname()}, 4
-    \subitem \texttt  {download.file()}, 32
+    \subitem \texttt  {download.file()}, 31
     \subitem \texttt  {excel\_sheets()}, 20
     \subitem \texttt  {file.path()}, 6
     \subitem \texttt  {fromJSON()}, 33
@@ -26,11 +26,9 @@
     \subitem \texttt  {names()}, 26
     \subitem \texttt  {nc\_open()}, 27
     \subitem \texttt  {ncol()}, 26
-    \subitem \texttt  {ncvar\_get()}, 27
+    \subitem \texttt  {ncvar\_get()}, 28
     \subitem \texttt  {nrow()}, 26
-    \subitem \texttt  {open.nc()}, 30
     \subitem \texttt  {print()}, 27
-    \subitem \texttt  {print.nc()}, 30
     \subitem \texttt  {read.csv()}, 8, 10--13
     \subitem \texttt  {read.csv2()}, 8, 10, 11
     \subitem \texttt  {read.fortran()}, 10, 15
@@ -53,10 +51,9 @@
     \subitem \texttt  {read\_tsv()}, 15
     \subitem \texttt  {setwd()}, 4
     \subitem \texttt  {shell()}, 3
-    \subitem \texttt  {str()}, 26, 29
+    \subitem \texttt  {str()}, 26
     \subitem \texttt  {system()}, 3
     \subitem \texttt  {tools:::showNonASCIIfile()}, 8
-    \subitem \texttt  {var.get.nc()}, 30
     \subitem \texttt  {write.csv()}, 8, 11
     \subitem \texttt  {write.csv2}, 10
     \subitem \texttt  {write.csv2()}, 12

diff --git a/rindex.idx b/rindex.idx
@@ -89,13 +89,7 @@
 \indexentry{print()@\texttt  {print()}}{27}
 \indexentry{nc\_open()@\texttt  {nc\_open()}}{27}
 \indexentry{print()@\texttt  {print()}}{27}
-\indexentry{ncvar\_get()@\texttt  {ncvar\_get()}}{27}
+\indexentry{ncvar\_get()@\texttt  {ncvar\_get()}}{28}
 \indexentry{tibble@\texttt  {tibble}}{28}
-\indexentry{tibble@\texttt  {tibble}}{28}
-\indexentry{str()@\texttt  {str()}}{29}
-\indexentry{print.nc()@\texttt  {print.nc()}}{30}
-\indexentry{open.nc()@\texttt  {open.nc()}}{30}
-\indexentry{var.get.nc()@\texttt  {var.get.nc()}}{30}
-\indexentry{tibble@\texttt  {tibble}}{30}
-\indexentry{download.file()@\texttt  {download.file()}}{32}
+\indexentry{download.file()@\texttt  {download.file()}}{31}
 \indexentry{fromJSON()@\texttt  {fromJSON()}}{33}