-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathUS_Storm_Impact.Rmd
383 lines (311 loc) · 19.2 KB
/
US_Storm_Impact.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
title: "Impact of US Weather Events from 2002 to 2011"
author: "Paul Clark"
date: "January 9, 2017"
output:
html_document:
keep_md: yes
number_sections: yes
toc: yes
toc_depth: 5
toc_float: yes
html_notebook: default
---
***
# Synopsis
Weather events have public health and economic impacts. We analyze the U.S. National Oceanic and Atmospheric Administration's (NOAA) [National Storm Database][3] to determine which types of events are most harmful. Over the most recent 10-year period for which data is available, we determine that `TORNADO` events caused the most human casualties (about 40% of all weather-related fatalities and injuries, or 1470 per year), while `FLOOD` events caused the most economic damage (40% of property and crop damage, or $13.7 B per year). `HURRICANE (TYPHOON)` events also caused very significant damage (about 20% of total, or $7.5 B per year).
***
# Data Processing
## Load required packages
```{r message=FALSE}
mir <- "https://cloud.r-project.org"
if (!require(data.table)) {install.packages("data.table", repos = mir); require(data.table)} # fread
if (!require(R.utils)) {install.packages("R.utils", repos = mir); require(R.utils)} # unzip bz2
if (!require(dplyr)) {install.packages("dplyr", repos = mir); require(dplyr)} # process data
if (!require(ggplot2)) {install.packages("ggplot2", repos = mir); require(ggplot2)} # for charts
if (!require(formattable)) {install.packages("formattable", repos = mir); require(scales)} # tables
if (!require(tidyr)) {install.packages("tidyr", repos = mir); require(tidyr)} # prep data for chrts
rm(mir)
```
## Read data
```{r include=FALSE}
data_path <- "C:/users/clarkpa.AUTH/OneDrive - Hewlett-Packard/Coursera/Rep_Research/PA2_data" # dir for unzipped data
```
```{r eval = TRUE, message=FALSE, results='hide'}
# Download bzip2 from url into working dir if needed. Then unzip into data_path dir and read data.
# Skip process if data already resident in working memory.
data_name <- "stdata" # Name of data frame and data file (before extensions)
# 'data_path': This string contains the path to the directory in which you want to store the large,
# unzipped *.csv file. It is omitted for privacy reasons, but needs to be specified for the below
# function to work.
data_ext <- ".csv"
zip_ext <- ".bz2"
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" # available as of
# 1/1/2017
prep_read_bz2_file <- function(data_name, data_ext, zip_ext, data_path, url) {
# If source data in environment, end fn, otherwise unzip if necessary and read data
if (!data_name %in% ls(parent.frame(1))) {
zipname <- paste0(data_name, data_ext, zip_ext)
data_file_path <- file.path(data_path, paste0(data_name, data_ext))
# If neither data file nor zip archive exists, download & unzip
if (!file.exists(data_file_path) && !file.exists(zipname)) {
# Download zip archive into working dir
download.file(url, zipname)
cat("Please wait while unzipping data file...\n")
bunzip2(zipname, data_file_path, remove = FALSE, skip = TRUE)
# Else must have csv or bz2 or both: unzip if lacking csv
} else if (!file.exists(data_file_path)) {
cat("Please wait while unzipping data file...\n")
# Prepare data dir at path location, unzip archive into it
if (!file.exists(data_path)) dir.create(data_path, recursive = TRUE)
bunzip2(zipname, data_file_path, remove = FALSE, skip = TRUE)
}
# Read data
assign(data_name,fread(data_file_path,data.table=FALSE,na.strings=""),
envir = parent.frame(1))
}
}
prep_read_bz2_file(data_name, data_ext, zip_ext, data_path, url)
rm(list = c("prep_read_bz2_file", "data_name", "data_ext", "zip_ext", "data_path", "url"))
```
## Inspect data
We inspected the data to understand available variables and relationships among them. We do not show the results here.
```{r eval = FALSE}
str(stdata)
set.seed(1234)
stdata[sample(nrow(stdata), 10), c("BGN_DATE", "EVTYPE", "REMARKS")]
```
## Subset data
As described in the [NOAA Storm Database Details][3], recorded events have changed over time:
1. **1950-54:** Tornado
2. **1955-92:** Tornado, Thunderstorm Wind, Hail (keyed from paper publications into digital data)
3. **1993-95:** Tornado, Thunderstorm Wind, Hail (extracted from unformatted text files)
4. **1996-11:** All event types (48)
To assess *all* event types, we ignore data before 1996.
Before subsetting, we examine date information to understand the format:
```{r include = TRUE}
# examine the date information
stdata$BGN_DATE[sample(nrow(stdata), 10)]
```
We then transform and subset `BGN_DATE` to match the fourth period of reporting.
```{r}
# Transform date information so we can subset by date
stdata$BGN_DATE <- as.Date(stdata$BGN_DATE, format = "%m/%d/%Y")
cat("The data extends from", format(min(stdata$BGN_DATE), "%m/%d/%Y"),
"to", format(max(stdata$BGN_DATE), "%m/%d/%Y"), ".\n")
# subset the database for the fourth period
stdata <- stdata[stdata$BGN_DATE >= as.Date("1996-01-01") & stdata$BGN_DATE <=
as.Date("2011-12-31"),
c("BGN_DATE", "BGN_TIME", "TIME_ZONE", "STATE", "COUNTY", "EVTYPE", "FATALITIES",
"INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "REMARKS", "REFNUM")]
```
We examine period (4) data to see if trends remain.
```{r annual_events_chart}
(ggplot(stdata, aes(BGN_DATE)) + stat_bin(binwidth = 365.25, center = 365.2/2,
closed = "right", alpha = 0.4, na.rm = FALSE) +
geom_vline(aes(xintercept = as.numeric(as.Date("2002-01-01"))),color="red",linetype="dashed") +
scale_x_date(date_minor_breaks="1 year",date_breaks = "2 years", date_labels="%Y") +
scale_y_continuous(labels = function(x) {paste0(x/1000,"K")}) +
labs(x = "Years", y = "Number of Events per Year (K = 1000s)", title =
"Number of Weather Events Reported Annually from 1996 through 2011"))
```
There is an upward trend: approximate doubling in events per year over a 15-year period. We suspect period (4) reporting may have been immature in its early years, therefore we focus analysis on the last 10 years. This is the period to the right of the dotted line in the figure.
```{r}
yrs <- 10
db_end_date <- max(stdata$BGN_DATE)
stdata_sub <- filter(stdata, BGN_DATE >= db_end_date - yrs*365.25)
cat("The beginning of the selected data range is ", format(min(stdata_sub$BGN_DATE), "%m/%d/%Y"),
".\n", sep = "")
```
We also examine the contents of `EVTYPE`, to see how consistent are the classifications. There are `r length(unique(stdata_sub$EVTYPE))` unique values in `EVTYPE` over the specified time period. We may have to re-classify and group some of these to match the NOAA standard classification scheme (see below).
```{r results='hide'}
unique(stdata_sub$EVTYPE) # for brevity, values are not shown
```
## Cleaning and processing the data
### Investigating NAs and Non-standard values
We first assess and treat any **NAs** and **non-standard values** in the variables used to calculate impact. Variables 1, 2, 3, and 5 are numeric, and 4 and 6 are 'exponents' -- symbols defining factors that multiply the numeric variables to quantify damage impacts.
```{r}
names(stdata_sub)[7:12]
```
We find no NAs in any of the numeric fields (variables 1, 2, 3, and 5 above). I.e., the number of **NAs** across all these numbers is `0`:
```{r}
sum(is.na(stdata_sub$FATALITIES)) + # no NAs in fatalities
sum(is.na(stdata_sub$INJURIES)) + # no NAs in injuries
sum(is.na(stdata_sub$PROPDMG)) + # no NAs in property damage values
sum(is.na(stdata_sub$CROPDMG)) # no NAs in crop damage values
```
Now we need to understand the `EXP` values (variables 4 and 6 above):
```{r}
unique(stdata_sub$PROPDMGEXP)
unique(stdata_sub$CROPDMGEXP)
```
The meanings of `"K"`, `"M"`, and `"B"` are clear, but we must investigate how to interpret `NA`s and `"0"`s. It turns out that the `DMG` numbers are `0` whenever the `EXP` values are `NA` or `"0"` (i.e., the number of times when this is not the case is `0`):
```{r}
with(stdata_sub, sum(is.na(PROPDMGEXP) & PROPDMG != 0)) + # if exp==NA, PROPDMG == 0
with(stdata_sub, sum(is.na(CROPDMGEXP) & CROPDMG != 0)) + # if exp==NA, CROPDMG == 0
with(stdata_sub, sum(PROPDMGEXP == "0" & PROPDMG != 0)) # if exp=="0", PROPDMG == 0
```
Therefore, when computing economic impacts, we interpret damage as `0` whenever `EXP` is **NA** or `"0"`.
### Calculating total health and economic impact
For this analysis, we define the following impacts.
1. Total Health Impact = `FATALITIES` + `INJURIES`
2. Total Economic Impact = `PROPDMG * exp$fact[PROPDMGEXP]`+`CROPDMG * exp$fact[CROPDMGEXP]`
The sum in (2) is pseudo-code for the following: the appropriate factor `fact` is provided by lookup in the data frame `exp` of the value of `PROPDMGEXP` or `CROPDMGEXP`. The executable code is shown in the code chunk just below, which calculates the crop and property components of *Total Economic Impact*:
```{r}
# EXP values of NA and "0" will both be ignored, with associated damage set to 0
stdata_sub$PROPDMGEXP[is.na(stdata_sub$PROPDMGEXP)] <- "0"
stdata_sub$CROPDMGEXP[is.na(stdata_sub$CROPDMGEXP)] <- "0"
# We create a dataframe of factors for each EXP value
exp <- data.frame(symb = c("0", "K", "M", "B"), fact = c(0, 1e3, 1e6, 1e9))
# We compute indices in the `exp` dataframe associating the appropriate factor for each damage value
# The indices are then used to retrieve and multiply by the factors
indices <- match(stdata_sub$PROPDMGEXP, exp$symb)
stdata_sub$`prop$` <- stdata_sub$PROPDMG*exp$fact[indices]
indices <- match(stdata_sub$CROPDMGEXP, exp$symb)
stdata_sub$`crop$` <- stdata_sub$CROPDMG*exp$fact[indices]
```
We now add columns for total economic damage and casualties, and remove unneeded columns:
```{r}
stdata_sub$dollars <- stdata_sub$`crop$` + stdata_sub$`prop$`
stdata_sub$casualties <- stdata_sub$INJURIES + stdata_sub$FATALITIES
stdata_sub <- select(stdata_sub, -PROPDMG,-PROPDMGEXP, -CROPDMG, -CROPDMGEXP)
```
***
# Analysis
We now group by `EVTYPE` and examine which values account for significant impact over time.
```{r}
# We look at the values causing the top 99% of impact
# To do so, we first construct a cumulative sum of casualty percentages by `EVTYPE`
cumcutoff <- 0.99
by_type_cas <- group_by(stdata_sub, EVTYPE) %>% summarize_each(funs(sum), casualties, dollars) %>%
arrange(desc(casualties)) %>% mutate(pct_casualties = casualties/sum(casualties),
pct_dollars = dollars/sum(dollars)) %>% mutate(pctcumsum_cas = cumsum(pct_casualties))
```
## Initial Assessment
### Health Impact
Percentage of 10-year casualties by *raw* `EVTYPE` value:
```{r}
transmute(by_type_cas, EVTYPE, pct_casualties = formattable::percent(pct_casualties,1))
```
### Economic Impact
Percentage of 10-yr economic damage by *raw* `EVTYPE` value:
```{r}
# We start with the `by_type_cas` data, then re-arrange and add a cumulative percentage of damage
by_type_tot <- arrange(by_type_cas, desc(dollars)) %>% mutate(pctcumsum_doll = cumsum(pct_dollars))
transmute(by_type_tot, EVTYPE, pct_dollars = formattable::percent(pct_dollars,1))
```
## Refinement of Assessment
To refine the assessment, we take the database event type values that account for `r 100*cumcutoff`% of the impact, then group and standardize them to conform to the 48 event types defined for official NOAA reporting. This improves the assessment by grouping values that describe the same type of weather event.
Event type values that account for the top `r 100*cumcutoff`% of harm:
```{r}
topPct <- filter(by_type_tot, pctcumsum_cas <= cumcutoff | pctcumsum_doll <= cumcutoff)
(arrange(topPct, EVTYPE) %>% select(EVTYPE))$EVTYPE
```
Over the last 10 years of data, only `r length(unique(topPct$EVTYPE))` event type values (out of the total `r length(unique(stdata_sub$EVTYPE))`) explain `r 100*cumcutoff`% of the impact. However, some values do not match any of the official 48 listed in the [Storm Data Documentation][1]. We group and standardize the above types to match those in the documentation, page 6, _**Storm Data Event Table**_. We have copy/pasted the contents of the table into a *.csv file, and read and list them here:
```{r}
(event_types <- read.csv("event_types.csv", stringsAsFactors = FALSE))$Event.Name
```
Standardization/grouping of database values to match the official 48:
```{r}
stdata_sub$stdEVTYPE <- stdata_sub$EVTYPE
stdata_sub <- within(stdata_sub, {
stdEVTYPE[stdEVTYPE == "FOG"] <- "DENSE FOG"
stdEVTYPE[stdEVTYPE == "HEAVY SURF/HIGH SURF"] <- "HIGH SURF"
stdEVTYPE[stdEVTYPE == "HURRICANE"] <- "HURRICANE (TYPHOON)"
stdEVTYPE[stdEVTYPE == "HURRICANE/TYPHOON"] <- "HURRICANE (TYPHOON)"
stdEVTYPE[stdEVTYPE == "RIP CURRENTS"] <- "RIP CURRENT"
stdEVTYPE[stdEVTYPE == "STORM SURGE"] <- "STORM SURGE/TIDE"
stdEVTYPE[stdEVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
stdEVTYPE[stdEVTYPE == "WILD/FOREST FIRE"] <- "WILDFIRE"
stdEVTYPE[stdEVTYPE == "WINTER WEATHER/MIX"] <- "WINTER WEATHER"
})
```
***
# Results
For the final assessment, we focus only on the top 5 event types in health and economic damage.
```{r}
rankcutoff <- 5
by_type_cas_tbl <- group_by(stdata_sub, stdEVTYPE) %>% summarize_each(funs(sum),
FATALITIES, INJURIES, `prop$`, `crop$`, casualties, dollars) %>%
arrange(desc(casualties)) %>% mutate(pct_casualties = casualties/sum(casualties),
pct_dollars = dollars/sum(dollars)) %>% mutate(rank_cas = rank(-pct_casualties))
by_type_tot_tbl <- arrange(by_type_cas_tbl, desc(dollars)) %>%
mutate(rank_doll = rank(-pct_dollars))
topPct_cas_tbl <- filter(by_type_tot_tbl, rank_cas <= rankcutoff) %>%
arrange(desc(casualties))
topPct_doll_tbl <- filter(by_type_tot_tbl, rank_doll <= rankcutoff)
```
## Health Impact
The event type causing the most casualties (nationally) is `TORNADO`, followed at a distant second by `EXCESSIVE HEAT`:
```{r}
# Display table of health impact
transmute(topPct_cas_tbl, `Standard Event Type` = stdEVTYPE,
`% Casualties over 10 yrs` = formattable::percent(pct_casualties, 1),
`Avg casualties per yr` = paste0(digits(casualties/yrs, 0), " / yr")) %>% as.data.frame %>%
formattable(caption = "Table 1: Casualties by Event Type for Top 5 Causes of Casualty from 2002 to 2011")
```
Grouping and standardizing `EVTYPE` values has reversed the casualty ranking of `THUNDERSTORM WIND` and `LIGHTNING` events, moving `THUNDERSTORM WIND` above `LIGHTNING`. This is because we added the casualties from the original database value `TSTM WIND` (around 4%) to those captured in the official event type `THUNDERSTORM WIND` (also around 4%).
```{r avg_annual_casualties}
# Create data frame for chart by gathering FATALITIES and INJURIES into a single column
top_cas_chrt <- select(topPct_cas_tbl, stdEVTYPE:INJURIES, rank_cas) %>%
gather(Impact, Casualties, FATALITIES:INJURIES) %>%
rename(`Weather Event` = stdEVTYPE) %>%
mutate(`Average Casualties per Year` = Casualties/yrs)
# Set order of factor levels w/i `Weather Event` so that event types are displayed in descending
# order
top_cas_chrt <- within(top_cas_chrt, `Weather Event` <-
factor(`Weather Event`, levels =
`Weather Event`[Impact ==
"INJURIES"][order(-rank_cas[Impact == "INJURIES"])]))
(ggplot(top_cas_chrt, aes(x = `Weather Event`, y = `Average Casualties per Year`, fill = Impact)) +
geom_col() + coord_flip() + scale_y_continuous(breaks = seq(0,1600,200),
minor_breaks = seq(0,1600,100)) + labs(title =
"Average Annual Casualties from 2002 to 2011\nfrom Top 5 Weather-Related Causes") +
theme(plot.title = element_text(hjust = 0.5)))
```
One can see here that the vast majority of casualties are injuries. Also, `TORNADO` events cause the most `FATALITIES` as well as total casualties.
The magnitude of casualties by event type (hundreds per year) seems small, intuitively. As a next step outside the scope of this analysis, we would validate the magnitudes by comparison with other data/analyses.
## Economic Impact
The event types causing the most economic damage (nationally) are `FLOOD` and `HURRICANE (TYPHOON)`, which cost many billions of dollars per year, on average:
```{r}
transmute(topPct_doll_tbl, `Standard Event Type` = stdEVTYPE, `% Cost over 10 yrs` =
formattable::percent(pct_dollars,1), `Avg cost per yr` = paste0("$ ",
digits(dollars/yrs/1e9, 1), " Billion / yr")) %>% as.data.frame %>%
formattable(caption = "Table 2: Cost by Event Type for Top 5 Causes of Damage from 2002 to 2011")
```
As compared to the initial assessment, `STORM SURGE/TIDE` moves up by 1.4% due to inclusion of costs from the unofficial value `STORM SURGE`. `HURRICANE (TYPHOON)` moves up by 1.1% for a similar reason.
```{r avg_annual_damage}
# Create data frame for chart by gathering property damage and crop damage into a single column
top_doll_chrt <- select(topPct_doll_tbl, stdEVTYPE, `prop$`, `crop$`, rank_doll) %>%
rename(`Weather Event` = stdEVTYPE,
`Property Damage` = `prop$`, `Crop Damage` = `crop$`) %>%
gather(Impact, Cost, `Property Damage`:`Crop Damage`) %>%
mutate(`Average Cost per Year ($B)` = Cost/yrs/1e9)
# Set order of factor levels w/i `Weather Event` so that event types are displayed in descending
# order
top_doll_chrt <- within(top_doll_chrt, `Weather Event` <-
factor(`Weather Event`, levels = `Weather Event`[Impact ==
"Property Damage"][order(-rank_doll[Impact == "Property Damage"])]))
(ggplot(top_doll_chrt, aes(x = `Weather Event`, y = `Average Cost per Year ($B)`, fill = Impact)) +
geom_col() + coord_flip() + scale_y_continuous(breaks = seq(0,16,2)) +
labs(title =
"Average Annual Economic Damage from 2002 to 2011\nfor Top 5 Weather-Related Causes") +
theme(plot.title = element_text(hjust = 0.5)))
```
The above chart shows average annual economic damage for the top 5 weather related causes. The cost of crop damage is dwarfed by property damage, even for flood-related events.
The magnitude of costs (many billions per year) seems huge. As a next step outside the scope of this analysis, we would validate the magnitudes by comparison with other data/analyses.
***
# Key References
* Coursera as of 1/9/17
- National Weather Service [Storm Data Documentation][1]
- National Climatic Data Center Storm Events [FAQ][2]
* NOAA as of 1/9/17
- NOAA [National Storm Database Details][3]
***
# Appendix (sessionInfo)
```{r}
sessionInfo()
```
[1]:https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
[2]:https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
[3]:https://www.ncdc.noaa.gov/stormevents/details.jsp