updated docs in light of suggestions

healthyregions · Sep 19, 2024 · 23e6943 · 23e6943
1 parent 808fbfa
commit 23e6943
Show file tree

Hide file tree

Showing 7 changed files with 132 additions and 94 deletions.
diff --git a/01-getting-started.Rmd b/01-getting-started.Rmd
@@ -1,5 +1,7 @@
 # Getting started
 
+## Installation
+
 Installing `oepsData` is easy. Just run the following command to grab the newest release from GitHub.
 
 ```{r how to install, results=FALSE, message=FALSE, warning=FALSE}
@@ -15,3 +17,15 @@ library(oepsData)
 ```
 
 Efforts are currently under way to list the package on CRAN.
+
+## Cacheing
+
+`oepsData` pulls its data from online repositories, primarily GitHub. This can lead to issues for users operating on slow internet, for whom load times can be long for larger datasets, or for users who anticipate needing the package when entirely offline.
+
+To help minimize these issues, `oepsData` caches, or saves a local copy of, data loaded by `load_oeps` on its first load. Additionally, `oepsData` offers a few commands can help maintain caches:
+* `cache_geometries` and `cache_oeps_tables` cache all tables and geometries, overwriting prior ones in the process.
+* `clear_cache` deletes all cached data.
+* `cache_dir` returns the directory of the oepsData cache.
+
+Users who want to avoid using cached data and instead download data fresh every time can set `cache=FALSE` when calling `load_oeps`. 
+
diff --git a/02-basic-usage.Rmd b/02-basic-usage.Rmd
diff --git a/02-usage.Rmd b/02-usage.Rmd
@@ -0,0 +1,65 @@
+# Usage
+
+```{r, include = FALSE, eval=TRUE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  eval = TRUE
+)
+```
+
+`oepsData` is centered around two functions: `load_oeps_dictionary`, which loads a basic data dictionary; and `load_oeps`, which directly loads OEPS data. We expect that most users will start by calling `load_oeps_dictionary` to look at what data is available at their desired analysis scale, followed by calling `load_oeps` to actually load the data.
+
+## load_oeps_dictionary
+
+`load_oeps_dictionary` itself takes one argument --- `scale` --- that can be any of "tract", "zcta", "county", or "state". It returns the data dictionary (stored as a data.frame), so we recommend browsing it through the `View` command:
+
+```{r preview data}
+# See what data is available at the state level
+data_dictionary <- load_oeps_dictionary(scale="state")
+
+# if working in RStudio, we recommend:
+# View(data_dictionary)
+
+# as we're in a bookdown, we just preview it directly:
+data_dictionary
+```
+
+## load_oeps
+
+We might find that we're interested in the 1990 state data. We can load that data and its geometries using `load_oeps`, which accepts the following arguments:
+* `scale` The scale of analysis. One of "tract", "zcta", "county", or "state"
+* `year` The release year for the data. One of 1980, 1990, 2000, 2010, or 2018. 
+* `themes`  The theme to pull data for. One of 'Geography", "Social", "Environment", "Economic", "Policy", "Composite", or "All". Defaults `All`.
+* `states`  A string or vector of strings specifying which states to pull data for, either as FIPS codes or names. Ignored when scale is in ZCTA. Defaults `None`.
+* `counties` A string or vector of strings specifying which counties to pull data for, either as FIPS or names. Ignored for ZCTA, and must be specified alongside `states. Defaults `None`. 
+* `tidy` Boolean specifying whether to return data in tidy format; defaults to `FALSE`.
+* `geometry` Boolean specifying whether to pull geometires for the dataset. Defaults `FALSE`
+* `cache` Boolean specifying whether to use cahced geometries or not. See the section on [cacheing](https://oepsdata.healthyregions.org/getting-started#cacheing) for more. Defaults `TRUE`.
+
+```{r basic data loading}
+
+states_1990 <- load_oeps(scale="state", 
+          year=1990,
+          geometry=TRUE)
+
+head(data.frame(states_1990))
+```
+
+Which lets us operate on the data as we desire. For instance, we can make a simple map:
+
+```{r}
+library(tmap)
+library(sf)
+
+# reproject to a better display CRS
+states_1990 <- st_transform(states_1990, "ESRI:102004")
+
+tm_shape(states_1990) + 
+  tm_fill("NoHsP", style="jenks") +
+  tm_borders(alpha=0.05) +
+  tm_layout(main.title = "Population over 25 without a high school degree")
+
+```
+
+
diff --git a/03-larger-examples.Rmd b/03-larger-examples.Rmd
@@ -1,4 +1,4 @@
-# Example uses
+# Examples
 
 
 ## Data subsetting
@@ -43,7 +43,7 @@ cook_county_2010 <- load_oeps(
 head(data.frame(cook_county_2010))
 ```
 
-We can then immediately map our data. We opt to use `tmap` in this example, but `ggplot2` also has mapping functionality for users more familiar with the library.
+We can then immediately map our data. We opt to use [`tmap`](https://cran.r-project.org/web/packages/tmap/vignettes/tmap-getstarted.html#hello-world) in this example, but [`ggplot2`](https://ggplot2.tidyverse.org/) also has mapping functionality for users more familiar with the library.
 
 ```{r}
 library(tmap)
@@ -87,7 +87,7 @@ tm_shape(chicago_metro_2010) +
 
 ## Longitudinal analysis
 
-Although not directly supported in the package, `oepsData` also enables easier longitudinal analysis. Continuing our above example, we might be interested in the change in percent poverty over time throughout Chicagoland. To check, let's compare 2000 data to 2010 data.
+The `oepsData` package also enables easier longitudinal analysis. Continuing our above example, we might be interested in the change in percent poverty over time throughout Chicagoland. To check, let's compare 2000 data to 2010 data.
 
 We start by grabbing data from 2000:
 ```{r}
@@ -99,15 +99,16 @@ chicago_metro_2000 <- load_oeps(
   counties=c('17031', '18089'),
   geometry=F)
 ```
-Note that `chicago_metro_2000` and `chicago_metro_2010` share column names, so we need to do some data wrangling for the merge. A wide variety of approaches exist, but we opt to use `dplyr` to select and rename our columns of interest.
+
+Because variables have the same column name in every year, `chicago_metro_2000` and `chicago_metro_2010` have columns with the same names. To fix this, we need to select and rename som eof our columns; there are a wide variety of possible approaches to this problem, but we opt to use [`dplyr`](https://dplyr.tidyverse.org/) to select and rename our columns of interest.
 
 ```{r}
 # rename data columns
 chicago_metro_2000 <- dplyr::select(chicago_metro_2000, "HEROP_ID", "PovP2000"="PovP") 
 chicago_metro_2010 <- dplyr::select(chicago_metro_2010, "HEROP_ID", "PovP2010"="PovP")
 ```
 
-We can then merge the dataframes. Provided with oepsData are a HEROP specific merge-key -- `HEROP_ID` -- and a more common `GEOID`. We recommend merging on `HEROP_ID` when merging HEROP data, and reserving `GEOID` for compatability with outside datasets. 
+We can then merge the dataframes. Two merge keys are provided with `oepsData`: a HEROP specific merge-key called `HEROP_ID` and the more common [`GEOID`](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html). We recommend merging on `HEROP_ID` when merging HEROP data, and reserving `GEOID` for compatability with outside datasets. 
 
 ```{r}
 # we changed the name of our merge keys earlier
@@ -277,4 +278,4 @@ tm_shape(filter(zcta, More_Telehealth)) +
 tm_shape(zcta) + tm_borders(lwd=.2)
 ```
 
-Evelyn has thus identified six ZCTAs she thinks may benefit most from increased transit options, and one that she thinks will benefit more from increased telehealth accessibility. From here, she can and probably should continue to validate her results through multiple approaches. Computationally, she may want to check the percent of households with internet in each ZCTA -- which she can do using `load_oeps` -- or collect increased responses to ensure her calculated percentages are accurate. Qualitatively, she may also want to validate her results against interview data from patients, to ensure that her proposed interventions agree with the needs expressed by the communities they will effect.
+Evelyn has thus identified six ZCTAs she thinks may benefit most from increased transit options, and one that she thinks will benefit more from increased telehealth accessibility. From here, she can and probably should continue to validate her results through multiple approaches. Computationally, she may want to check the percent of households without internet in each ZCTA -- contained in the `NoIntP` variable she can get through another use of `load_oeps` -- or collect increased responses to ensure her calculated percentages are accurate. Qualitatively, she may also want to validate her results against interview data from patients, to ensure that her proposed interventions agree with the needs expressed by the communities they will effect.
diff --git a/04-bigquery-access.Rmd b/04-bigquery-access.Rmd
@@ -1,6 +1,6 @@
 # Getting OEPS Data from BigQuery
 
-Opioid Environment Policy Scan data is also available on Google BigQuery. In this notebook, we'll go over how to interact with the data using `bigrquery`. We go over two of the `bigrquery` APIs -- one for readers familiar with SQL, and one for readers who want to avoid SQL. Lastly, readers who are already familiar with Google BigQuery will likely want to skip to [Make a Query](#querying).
+Opioid Environment Policy Scan data is also available on Google BigQuery. In this notebook, we'll go over how to interact with the data using [`bigrquery`](https://bigrquery.r-dbi.org/). We go over two of the `bigrquery` APIs -- one for readers familiar with SQL, and one for readers who want to avoid SQL. Lastly, readers who are already familiar with Google BigQuery will likely want to skip to [Make a Query](#querying).
 
 ```{r, include = FALSE, eval=TRUE}
 knitr::opts_chunk$set(
@@ -9,8 +9,7 @@ knitr::opts_chunk$set(
   eval = TRUE
 )
 ```
-
-## Setting up BigQuery
+## Overview
 
 When making queries against a BigQuery dataset, we do not directly query the dataset. Instead, we connect to a BigQuery profile and submit a job, which tells the profile to make the query in our stead and return the data. You can think of this like connecting to another computer to middleman the exchange.
 
@@ -20,7 +19,20 @@ knitr::include_graphics('./images/gc-structure.png')
 
 The setup  allows users to work with multiple BigQuery datasets within a single profile, and also allows for billing to be separated so that data providers only pay to store the data instead of also paying for all usage of their data.
 
-To enable BigQuery, sign into a Google account on your browser of choice before navigating to [this link](https://console.cloud.google.com/marketplace/product/google/bigquery.googleapis.com?hl=en&returnUrl=%252Fbigquery), where you will be prompted to "Enable BigQuery." Do so to enable your account to access BigQuery and data through BigQuery.
+The OEPS data warehouse itself is named `oeps-391119` on BigQuery, and is divided into two datasets: `tabular` and `spatial`. The `tabular` dataset consists of 16 tables of attribute data at the state, county, tract, and ZCTA scales from 1980 to 2020. The `spatial` dataset contains the 2010 TIGER/Line geometries for each of these scales. The primary key for the datasets are `HEROP_ID`. A full dataset schema can be found on the OEPS BigQuery reference [linked here](https://github.com/healthyregions/oeps/blob/23_update_explorer/docs/BQ-Reference.md).
+
+```{r, echo=FALSE, fig.align='center', out.width=400, out.height=400, eval=TRUE}
+knitr::include_graphics('./images/oeps-structure.png')
+```
+
+## Setting up BigQuery
+
+You can set up BigQuery for usage in R in three broad steps:
+1. Enabling BigQuery on your Google Account
+2. Grabbing the name of your BigQuery resource
+3. Connecting everything to BigRQuery
+
+Let's start at step one. Sign into a Google account on your browser of choice before navigating to [this link](https://console.cloud.google.com/marketplace/product/google/bigquery.googleapis.com?hl=en&returnUrl=%252Fbigquery), where you will be prompted to "Enable BigQuery." Do so to enable your account to access BigQuery and data through BigQuery.
 
 ```{r, echo=FALSE, fig.align='center', out.width=700, out.height=246}
 #knitr::include_graphics('images/bigquery enable button.png')
@@ -36,6 +48,18 @@ Whichever route you take, we need to store the name of your BigQuery project in
 billing <- "oeps-tutorial" # replace this with your project name!
 ```
 
+:::: {.box data-latex=""}
+
+**"Will I be charged money for using BigQuery?"**
+
+It's unlikely outside of unreasonably intense usage of BigQuery.
+
+As of September 2024, the free tier for BigQuery allows for 1 TiB of data querying. The entire OEPS dataset is less than 1 GiB in size, so you would need to pull the entire dataset over 1,000 times in a month to leave the free tier. Unless you have an automated pipeline pulling from OEPS or are pulling other datasets on BigQuery, this is a hard limit to reach!
+
+For more on BigQuery billing, see the [Google BigQuery pricing page](https://cloud.google.com/bigquery/pricing).
+
+::::
+
 Lastly, we need to establish that we actually have permission to create jobs on the account we created. To do that, we can use `bigrquery::bq_auth()`, and then grant the Tidyverse API a few permissions on our Google Account. Note that this command will prompt you to open a new window in your browser.
 
 ```{r, eval=FALSE}
@@ -48,19 +72,9 @@ bigrquery::bq_auth(path=Sys.getenv("BIGQUERY_OEPS_TUTORIAL_KEY"))
 billing <- "oeps-391119"
 ```
 
-## Making Queries {#query}
-
-Now that we've enabled BigQuery on our account, we can use it to query the OEPS data on BigQuery.
-
-First, lets back up and look at the OEPS project at a broader level. Currently, the OEPS data warehouse on BigQuery is named `oeps-391119`, and is divided into two datasets: `tabular` and `spatial`. The `tabular` dataset consists of 16 tables of attribute data at the state, county, tract, and ZCTA scales from 1980 to 2020. The `spatial` dataset contains the 2010 TIGER/Line geometries for each of these scales. The primary key for the datasets are `HEROP_ID`. A full dataset schema can be found on the OEPS BigQuery reference [linked here](https://github.com/healthyregions/oeps/blob/23_update_explorer/docs/BQ-Reference.md).
-
-```{r, echo=FALSE, fig.align='center', out.width=400, out.height=400, eval=TRUE}
-knitr::include_graphics('./images/oeps-structure.png')
-```
-
-bigrquery offers three interfaces for interacting with BigQuery, but we introduce two here: the low-level API that uses SQL, and a higher level method using dplyr. 
+Now that `bigrquery` is set up, we can explore using two of its interfaces to interact with BigQuery: the low-level API that uses SQL, and a higher level API using `dplyr`. 
 
-### The low-level API {#low-level}
+## The low-level API {#low-level}
 
 The low-level API offers a series of methods that can be used to interact with [BigQuery's REST API](https://cloud.google.com/bigquery/docs/reference/rest). While bigrquery offers quite a few commands, it's usually sufficient to use two: `bq_project_query` and `bq_table_download`.
 
@@ -118,13 +132,13 @@ head(results)
 And this works:
 
 ```{r}
-# This works
+# good
 results <- bq_table_download(tb) |> 
   st_as_sf(wkt='geom', crs='EPSG:4326') # convert geom to sf
 head(results)
 ```
 
-#### A full low-level pipeline:
+### A full low-level pipeline:
 
 Putting this all together, we can create a quick map of how county level poverty changed from 1990 to 2000:
 
@@ -157,17 +171,17 @@ tm_shape(results) +
 
 ```
 
-### The dplyr API {#dplyr}
+## The dplyr API {#dplyr}
 
-For users with less SQL familiarity, it's also possible to use dplyr to interact with BigQuery. We'll need the help of DBI, a library for interacting with databases in R.
+For users with less SQL familiarity, it's also possible to use dplyr to interact with BigQuery. We'll need the help of [`DBI`](https://dbi.r-dbi.org/), a library for interacting with databases in R.
 
 ```{r, warning=FALSE, message=FALSE}
 library(dplyr)
 library(DBI)
 library(bigrquery)
 ```
 
-For this pipeline, we use DBI to connect to a given dataset (e.g. `tabular`), before picking a table within the dataset to interact with and then manipulate that table using dplyr. 
+For this pipeline, we use `DBI` to connect to a given dataset (e.g. `tabular`), before picking a table within the dataset to interact with and then manipulate that table using dplyr. 
 
 ```{r}
 # Connect to the tabular dataset
@@ -247,7 +261,7 @@ counties2010 <- tbl(spatial_conn, 'counties2010') |>
 head(counties2010)
 ```
 
-#### A full dplyr pipeline:
+### A full dplyr pipeline:
 
 Putting all the pieces together, we can make our poverty map with the following code: 
 

diff --git a/index.Rmd b/index.Rmd
@@ -13,8 +13,8 @@ description: "This is a short bookdown explaining how to access OEPS data throug
 
 # Introduction
 
-The Opioid Environment Policy Scan (OEPS) is an open-source data warehouse created by the Healthy Regions & Policies Lab to support researchers in studying and modeling the opioid risk environment. This website is intended as a starting place for researchers interested in using the OEPS data.
-
+The Opioid Environment Policy Scan (OEPS) is an open-source data warehouse created by the [Healthy Regions & Policies Lab](https://healthyregions.org/) to support researchers in studying and modeling the opioid risk environment. This website is intended as a starting place for researchers interested in using the OEPS data, especially using R.
+ 
 On this site, we have tutorials demonstrating two methods of accessing the Policy Scan's data sets: first through an R package called `oepsData`, and secondly through `bigrquery`. We additionally provide a two example analyses using the package as a starting place for spatial research. 
 
 To hear more on the project, check out the OEPS [website](https://oeps.healthyregions.org/). For more on the data itself, check out the in-depth [data documentation](https://oeps.healthyregions.org/docs) or the [online explorer](https://oeps.healthyregions.org/map).

diff --git a/style.css b/style.css
@@ -12,3 +12,11 @@ pre {
 pre code {
   white-space: inherit;
 }
+
+.box {
+  padding: 1em;
+  background: #429E90;
+  color: white;
+  border: 2px solid #2C5D56;
+  border-radius: 10px;
+}