-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy pathpset1.qmd
512 lines (386 loc) · 20.8 KB
/
pset1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
---
format: html
figure:
align: center
---
# Problem Set 1 {.unnumbered}
Reminders:
* Add a title to all your graphs.
* Add a label to the x and y axes when not obvious what they are showing.
* Think about transformations that convey the message in clearer fashion.
## Measles
(@) Load the __dslabs__ package and figure out what is in the `us_contagious_diseases` dataset. Create a data frame, call it `avg`, that has a column for `year`, and a `rate` column containing the cases of Measles per 10,000 people per year in the US. Because we start in 1928, exclude Alaska and Hawaii. Make sure to take into account the number of weeks reporting each year. If a week was not report, it should not be included in the calculation of the rate.
```{r}
#| message: false
library(tidyverse)
library(dslabs)
avg <- us_contagious_diseases |>
filter(!state %in% c("Hawaii","Alaska") & disease == "Measles" & weeks_reporting > 0) |>
group_by(year) |>
summarize(rate = sum(count*52/weeks_reporting, na.rm = TRUE)/sum(population) * 10000)
```
Criteria (5 points):
* Filtered correctly: removed HI and AK, kept only Measels, removed rows with no weeks reporting = 2 points.
* Grouped by year = 1 point.
* Adjusted for week reportin = 1 point.
* Summarized correctly: sum of counts divided by sum of popuation and multiploed by 10^5 = 1 point.
(@) Use the data frame `avg` to make a trend plot showing the cases rate for Measles per year. Add a vertical line showing the year the Measles vaccines was introduced. Write a short paragraph describing the graph to someone you are urging to take the Measles vaccines.
```{r}
avg |>
ggplot(aes(year, rate)) +
geom_line() +
labs(x = "Year", y = "Cases per 10,000", title = "Measles cases per year in the US") +
geom_vline(xintercept = 1963, col = "blue")
```
Criteria (5 points):
* Plotted rate versus year = 3 points
* Vertical line added correctly = 1 point.
* Labels and title = 1 point.
(@) Is the pattern observed above the same for each state? Add a grey trend line for each state to the plot above. Use a transformation that keeps the high rates from dominating the figure.
```{r}
us_contagious_diseases |>
filter(!state %in% c("Hawaii","Alaska") & disease == "Measles" & weeks_reporting > 0) |>
filter(weeks_reporting > 0) |>
mutate(rate = count/population*10000*52/weeks_reporting) |>
ggplot(aes(year, rate)) +
geom_line(aes(group = state), color = "grey") +
geom_line(data = avg) +
scale_y_continuous(trans = "sqrt") +
labs(x = "Year", y = "Cases per 10,000", title = "Measles cases per year in the US") +
geom_vline(xintercept = 1963, col = "blue")
```
Criteria (5 points):
* Includes curves for each state with no lengend = 2 points.
* Average curve can be distinguished = 1 point.
* Transformation was used = 1 point.
* Labels and title = 1 point.
(@) In the plot above we can't tell which state is which curve. Using color would be challenging as it is hard if not impossible to find 48 colors we humans can distinguish. To make a plot where you can compare states knowing which is which, use one of the axis for state and the other for year, and then use hue or intensity as a visual cue for rates. Use a `sqrt` transformation to avoid the higher rates taking up all the color scale. Use grey to denote missing data. Order the states based on their highest peak.
```{r}
reds <- RColorBrewer::brewer.pal(9, "Reds")
us_contagious_diseases |>
filter(!state %in% c("Hawaii","Alaska") & disease == "Measles") |>
mutate(rate = count/population*10000*52/weeks_reporting) |>
mutate(state = reorder(state, rate, max, na.rm = TRUE)) |>
ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey") +
scale_x_continuous(expand = c(0,0)) + ## to remove extra space on sides
scale_fill_gradientn(colors = reds, trans = "sqrt") +
geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position = "bottom",
text = element_text(size = 8)) +
labs(title = "Measles cases per year in the US", x = "", y = "")
```
Criteria (8 points):
* Tile plot used = 2 points.
* Rate computed correctly = 1 point.
* Sequential palette used = 1 point.
* NAs shown in grey = 1 point.
* Transformation used = 1 point.
* States ordered by peak, title and labels included = 2 point.
## COVID-19
(@) The csv file shared [here](https://raw.githubusercontent.com/datasciencelabs/2023/main/data/covid19-data.csv) includes weekly data on SARS-CoV-2 reported cases, tests, COVID-19 hospitalizations and deaths, and vaccination rates by state.
* Import the file into R without making a copy on your computer.
* Examine the dataset.
* Write a sentence describing each variable in the dataset.
```{r}
url <- "https://raw.githubusercontent.com/datasciencelabs/2023/main/data/covid19-data.csv"
dat <- read_csv(url)
```
* `state` - State abbreviation.
* `state_name` - State name.
* `population` - State population in 2020 census.
* `region` - Department of Health and Human Services regions.
* `mmwr_year` - Morbidity and Mortality Weekly Report year.
* `mmwr_week` - Morbidity and Mortality Weekly Report week.
* `cases` - SARS-CoV-2 cases reported to CDC that week.
* `tests` - Accumulated tests reported according to JHU dashboard.
* `hosp` - COVID-19 Hospitalization reported to CDC.
* `deaths_underlying_cause` - Deaths reported to CDC with COVID-19 as underlying cause of death.
* `deaths_multiple_causes` - Deaths reported to CDC with COVID-19 as one of the causes of death.
* `prov_death` - Deaths provisionally reported to CDC as COVID-19 related.
* `series_complete` - number of people in state with first vaccine series complete by the end of the week.
* `booster` - number of people in state with booster by the end of the week.
* `bivalent` - number of people in state with bivalent vaccine by the end of the week.
Criteria (5 points)
* Data read correctly = 2 points.
* Variables described correctly = 3 points.
(@) One of these columns could benefit from being a factor. Identify the column and convert it to factor.
```{r}
dat <- mutate(dat, region = factor(region))
```
Criteria (2 points)
* Region identified as variable = 1 point.
* Correctly changed to a factor = 1 point.
(@) Rather than providing a date, the dataset provides MMWR year and week. Read [this document](https://ndc.services.cdc.gov/wp-content/uploads/MMWR_Week_overview.pdf) and write a function to convert these to the start of the MMWR week in ISO-8601.
```{r}
library(lubridate)
mmwr_to_date <- function(mmwr_year, mmwr_week) {
first_day <- floor_date(make_date(mmwr_year, 1, 4) , unit = "week")
date <- first_day + weeks(mmwr_week - 1)
return(date)
}
```
Criteria (5 points):
* A function was defined with two variables = 2 points
* First day identified correctly = 1 point.
* Number of weeks added correctly = 2 points.
(@) Add a columns `start_date` and `end_date` with the start and end of the MMWR week. Confirm that it worked by computing the MMWR week and year for both start and end date and comparing it to the MMWR week and year provided.
```{r}
dat <- dat |> mutate(start_date = mmwr_to_date(mmwr_year, mmwr_week),
end_date = start_date + days(6))
## check : these should all be TRUE
dat |> summarize(w1 = all(epiweek(start_date) == mmwr_week),
y1 = all(epiyear(start_date) == mmwr_year),
w2 = all(epiweek(end_date) == mmwr_week),
y2 = all(epiyear(end_date) == mmwr_year))
```
Criteria (3 points)
* Two columns added correctly = 1 point.
* Check performed correctly = 2 point.
(@) Make a trend plot similar to the one we made for Measles:
* Include a trend line for the US cases rate. Use per 100,000 person per week as the unit.
* Add a trend for each state, with color representing region.
* Use the end of the week date for your x-axis.
* Add a vertical dashed line on the day COVID-19 vaccination started being delivered in the US.
Write a paragraph describing the COVID-19 pandemic by describing the plot.
```{r}
dat |>
filter(!is.na(cases)) |>
group_by(end_date) |>
mutate(avg = sum(cases)/sum(population)*10^5) |>
ungroup() |>
ggplot(aes(end_date, cases/population*10^5)) +
geom_line(aes(group = state, color = region), alpha = 0.5) +
geom_line(aes(y = avg)) +
scale_y_continuous(trans = "sqrt") +
geom_vline(xintercept = make_date(2020, 12, 14), lty = 2) +
labs(title = "SARS-Cov-2 cases per 100,000 in the US", x = "Date", y = "Cases per 100,000",
caption = "Colors represent HHS regions")
```
Criteria (5 points):
* Includes curves for each state with colors and lengend showing region= 2 points.
* Average curve can be distinguished = 1 point.
* Vertical line in correct location = 1 point.
* Labels and title = 1 point.
(@) The number of cases depends on testing capacity. Note that during the start of the pandemic, when we know many people died, there are not that many cases reported. Also notice somewhat large variability across states that might not be consistent with actual prevalence. The `tests` columns provides the cumulative number of tests performed by the data represented by the week. This data is not official CDC data. It was provided by Johns Hopkins Coronavirus Resource Center. Before using the data, explore the data for each state carefully looking for potential problems.
For each state compute and plot the number of tests perforemd each week. Look at the plot for each state and look for possible problems. No need to make this plot pretty since we are just using it for data exploration. Report any inconsistencies if any.
```{r}
tmp <- dat |>
group_by(state) |>
arrange(end_date) |>
mutate(tests = diff(c(0, tests))) |>
ungroup() |>
filter(!is.na(tests))
tmp |>
ggplot(aes(end_date, tests)) +
geom_line() +
facet_wrap(~state, scales = "free_y") +
labs(x = "Date", y = "Number of tests")
states <- tmp |> filter(tests < 0) |> pull(state_name) |> unique()
```
For `r cat(states, sep = ", ")` we see negative numbers which is must be an error as the number of tests must be non negative.
Criteria (5 points):
* Difference computed correctly after group by = 1 point.
* Plot using facet wrap = 1 point.
* Free y-scale = 1 point.
* Correct states identified = 2 points.
(@) To see if the inconsistencies seen in the previous plot are a problem if we are only going to look at the total number of tests at the end, plot the cumulative tests for each of the states with inconsistencies and see if the results are sensible. Explain your answer in 1-2 sentences.
```{r}
dat |> filter(state_name %in% states) |>
filter(!is.na(tests)) |>
ggplot(aes(end_date, tests/10^6)) +
geom_line() +
facet_wrap(~state, scales = "free_y") +
labs(x = "Date", y = "Cumulative number of tests (in millions)")
```
Criteria (5 points):
* Only correct states shown and cumulative values shown = 1
* Plot using facet wrap with free y-scale = 1 point.
* Explanation includes observation of dips that are likely corrections, and that it does not affect final values = 3 points.
(@) JHU stopped reporting some time in 2021. What was that date? Show the day in the format September 18, 2022.
```{r}
last_day <- dat |> filter(!is.na(tests)) |> pull(end_date) |> max()
format(last_day, "%B %d, %Y")
```
Criteria (3 points):
* Correct date identied = 1 point.
* Correct format used = 2 point.
(@) Compute the number of tests per capita for the last day JHU reported these statistics. Make a boxplot of these values for each region and include the state level data with the state abbreviation as a label. Write a sentences describing these differences you see and how this could affect our interpretation of differences in cases rates across states.
```{r}
dat |> filter(end_date == last_day) |>
mutate(region = factor(region)) |>
ggplot(aes(region, tests/population)) +
geom_boxplot() +
geom_text(aes(label = state)) +
labs(x = "Region", y = "Test per capita", title = "SARS-COV2 tests per person")
```
Criteria (5 points):
* Correct boplot shown = 1 points.
* Labels and titles = 1 point.
* Description of state to state variation provided: 3 points.
(@) Although JHU stopped collecting testing data from the states, the CDC collected data from a few laboratories. We provide these date in [this url](https://raw.githubusercontent.com/datasciencelabs/2023/main/data/covid19-tests.txt).
* Import the data into R without downloading the file.
* Make sure that you create a data frame with a column with dates in `Dates` format and tests as numbers.
```{r}
url <- "https://raw.githubusercontent.com/datasciencelabs/2023/main/data/covid19-tests.txt"
tests <- read_delim(url, delim = " ")
tests <- tests |> mutate(date = mdy(date), tests = parse_number(tests))
## alternatively we can re-read the file using the appropriate column types
tests <- read_delim(url, delim = " ",
col_types = cols(date = col_date(format = "%m-%d-%y"),
tests = col_number()))
```
Criteria 3 points):
* Dates is date format = 2 point
* Tests are numeric = 1 point.
(@) Plot the tests per week to see the trend in time. Write a sentence of the pattern you see.
```{r}
tests |> ggplot(aes(date, tests)) + geom_line() + labs(title = "Tests per week", x = "Date", y = "Number of tests")
```
At the start of the pandemic there were few tests conducted and now we are also seeing a drop.
Criteria (3 points):
* Correct trend plot made = 1 points.
* Observation that the number of tests are much lower now included = 2 points.
(@) The analysis on tests points to cases as not being a good measure of the state of the pandemic. Remake the trend plot but using death rates instead of cases rates. Write a sentence on how this plot better shows about the start of the pandemic that the cases plot did not.
```{r}
dat |>
filter(!is.na(deaths_prov)) |>
group_by(end_date) |>
mutate(avg = sum(deaths_prov)/sum(population)*10^5) |>
ungroup() |>
ggplot(aes(end_date, deaths_prov/population*10^5)) +
geom_line(aes(group = state, color = region), alpha = 0.5) +
geom_line(aes(y = avg)) +
scale_y_continuous(trans = "sqrt") +
geom_vline(xintercept = make_date(2020, 12, 14), lty = 2) +
labs(title = "COVID-19 deaths per 100,000 in the US", x = "Date", y = "Deaths per 100,000",
caption = "Colors represent HHS regions")
```
Criteria (3 points):
* Same code as before but using `deaths_prov` = 1 points.
* The fact that we now see that the highest death rates were actually at the start of the pandemic and that is was particularly bad in the northeast is noted = 2 points.
(@) We want to examine the percent of the population that completed the first series of vaccines, received the booster, and received the bivalent booster, respectively. First run this line of code and examine what it does.
```{r}
tmp <- dat |>
pivot_longer(c(series_complete, booster, bivalent), names_to = "series", values_to = "percent") |>
select(state, region, population, end_date, series, percent) |>
filter(!is.na(percent)) |>
mutate(percent = percent/population,
series = factor(series, c("series_complete", "booster", "bivalent")))
```
Then make a plot showing the percent of population vaccination for each state. Use color to represent region.
* Show the dates on the x axis with the month abbreviation and year.
* Place the three plots vertically, on top of each other.
* Show percentages on the y axis. Hint: use `scales::percent`.
```{r}
#| out-width: 50%
#| fig-width: 4
#| fig-height: 8
#|
levels(tmp$series) <- c("First vaccine", "Booster", "Bivalent")
tmp |>
rename(Region = region) |>
ggplot(aes(end_date, percent, group = state, color = Region)) +
geom_line() +
scale_y_continuous(labels = scales::percent) +
scale_x_date(date_labels = "%b %y", date_breaks = "6 month") +
facet_wrap(~series, nrow = 3) +
labs(title = "Percent of population vaccinated by state", x = "Date", y = "Percent vaccinated")
```
Criteria (5 points):
* Cummulative values shown = 1 point.
* Regions shown with color and legend included = 1 point.
* Facet wrap used correctly with 3 rows = 2 points.
* Clear labels and titles = 1 point.
(@) For each period used to make the trend plot for the three series, make a boxplot showing the maximum percentage reached by every state stratified by region. Let the range of the y axis adapt to the range of each group.
```{r}
tmp |>
group_by(state, region, series) |>
summarize(percent = max(percent), .groups = "drop") |>
ggplot(aes(region, percent)) +
scale_y_continuous(labels = scales::percent) +
geom_boxplot() +
geom_text(aes(label = state)) +
facet_wrap(~series, scales = "free_y", nrow = 3) +
labs(title = "Percent vaccinated by region", x = "Region", y = "Percent")
```
Criteria (5 points)
* Maximum computed correctly = 2 points
* Boxplot showing points by region = 2 points
* Faceting and labels = 1 point.
(@) Use the plot above to define four periods: No vaccine, First vaccine, Booster, and Bivalent.
Get the dates when each vaccine series starts (these don't have to be exact). Create a version of your data frame with a column called period that keeps the period associated with the week contained in each row.
```{r}
period_dates <- c(make_date(2020, 1,1), make_date(2021, 1, 1), make_date(2021, 9, 1), make_date(2022, 9, 1), today())
labels <- c("No vaccine", "First vaccine", "Booster", "Bivalent")
tmp <- dat |> mutate(period = cut(end_date, period_dates, labels = labels))
```
Criteria (3 points):
* Approriate choice of dates selecteed: 2 points
* used `cut` correctly: 1 point
(@) Define a new variable that stores the maximum vaccination percentage reached during each period. But for the first vaccine period use `series_complete`, for the booster period use the `booster` column, and for the bivalent period use the bivalent percentages. Remove data from the no vaccine period. The make a plot comparing the COVID-19 death rate to the percent vaccinated. Use color to denote region. You should produce three plots, each with it's own axes range. Put the three plots in three entries of a 2x2 layout. Comment on what you observe.
```{r}
tmp |> filter(!period %in% c("No vaccine")) |>
mutate(period = droplevels(period)) |>
group_by(state) |>
mutate(vax = case_when(
period == "First vaccine" ~ max(series_complete, na.rm = TRUE),
period == "Booster" ~ max(booster, na.rm = TRUE),
period == "Bivalent" ~ max(bivalent, na.rm = TRUE))) |>
ungroup() |>
group_by(state, region, period) |>
summarize(deaths = mean(deaths_prov, na.rm = TRUE) / population[1] * 100000,
vax = vax[1]/population[1], .groups = "drop") |>
rename(Region = region) |>
ggplot(aes(vax, deaths, label = state, color = Region)) +
geom_point(size = 0) +
geom_text(show.legend = FALSE) +
scale_x_continuous(labels = scales::percent) +
guides(colour = guide_legend(override.aes = list(size = 1))) +
facet_wrap(~period, scales = "free", ncol = 2) +
labs("Death rate versus vaccination rate for states", x = "% population vaccinated", y = "Per 100,000 per week death rate")
```
Criteria (8 points)
* Correctly defines the three periods = 3 points
* Correctly computes the death rates and vaccination percentages in each period = 3 points
* Uses `facet_wrap` to show three plots = 2 point
* Clear labels and titles = 1 point.
(@) Does population density have an effect on infections? Use the `state.area` predefined variable to add compute population density. Make a histogram and density plot of state densities. Note that you will have to add the geographical area for Puerto Rico and DC as it is not included in `state.area`
```{r}
my.state.abb <- c(state.abb, "PR", "DC")
my.state.area <- c(state.area, 5325, 69)
popdens <- dat |> filter(end_date == min(end_date)) |>
select(state, population) |>
mutate(area = my.state.area[match(state, my.state.abb)]) |>
mutate(popdens = population / area)
popdens |> ggplot(aes(popdens)) +
geom_histogram(aes(y = after_stat(density)), bins = 25, color = "black") +
geom_density() +
scale_x_log10() +
labs(title = "Distribution of poplation density across states", x = "Population density", y = "Density")
```
Criteria (5 points):
* Adds PR and DC = 1 point
* Add population density column = 1 point
* Histogram shown = 1 point
* Density shown = 1 point
* Title, labels and smoothing parameter and bin size choice = 1 point
(@) Plot death rates versus density for the four periods defined above.
```{r}
tmp |> filter(!is.na(deaths_prov)) |>
mutate(area = my.state.area[match(state, my.state.abb)]) |>
mutate(popdens = population / area) |>
group_by(state, period) |>
summarize(rate = mean(deaths_prov) / population[1]*100000, popdens = popdens[1], .groups = "drop") |>
ggplot(aes(popdens, rate, label = state)) +
geom_text() +
scale_x_log10() +
facet_wrap(~period) +
labs(title = "Death rate versus density", x = "Population density", y = "Deaths per 100,000 per week")
```
Criteria (5 points)
* Computes rate and population density correctly = 2 points
* Four plots shown made with `facet_wrap` = 2 points
* Labels and titles = 1 point