-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdcsm_report_draft.Rmd
896 lines (692 loc) · 32.3 KB
/
dcsm_report_draft.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
---
title: 'Data management protocol'
author: "N** B**"
date: "date of submission"
output:
html_document: default
pdf_document: default
html_notebook:
toc: yes
toc_float: yes
number_section: no
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# more internal settings can go here
# Consider help pages like:
# https://rmarkdown.rstudio.com/lesson-1.html
# https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
```
### Loaded packages
```{r libraries, results=FALSE, message=FALSE}
# Load packages
require(RPostgreSQL) # for database handling
require(getPass) # for the getPass() function
library(tidyverse) # for dplyr, ggplot, ...
#library(sf) # to handle shapefiles and CRS
library(lubridate) # to parse and convert time objects
library(zoo) # to extend the apply family of functions (rolling window)
library(microbenchmark) # to measure function performance
#library(wesanderson) # for discrete colors in plots
```
### Preprocessing / R functions
```{r preprocessing, message=FALSE, warning=FALSE}
# self made functions used:
# functions:
flagR <- function(x) { # function to flag values in quality check 1
ifelse(x>=1, 1,0) # sets flag to 0 when passed, to 1 when failed
}
flagP <- function(x){ # function flags values in quality check 3 (variability)
ifelse(length(unique(x)) == 1, 1,0) # no variability -> 1 unique val
} # must be applied using a rolling window
pw <- function () { # to ask for password when connecting to database
if (Sys.getenv('POSTGRES_PASSWORD') == ""){
return(getPass('Provide the password: '))
} else {
return(Sys.getenv('POSTGRES_PASSWORD'))
}
}
```
# 1. Hobo meta informations
The HOBO was situated in an upside down planter, shielded from direct sunlight and placed in a hedge with no direct influence from any nearby buildings or the like.
![image of my Hobo](./10350049.jpg){width=250px}
# 2. Consistent HOBO data file
The raw data was read in from GitHub. For importing the data the base R function `read.csv()` was used:
```{r raw_data, eval=T}
# Load data from Github (then eval = TRUE)
my_HOBO <- read.csv("https://raw.githubusercontent.com/data-hydenv/data/master/hobo/2023/raw/10350049.csv", skip = 1)
# Load calibration file from GitHub
t_cal <- read.csv("https://raw.githubusercontent.com/data-hydenv/data/master/hobo/2023/calibration.csv")
```
Date/Time column must be parsed properly. This was done using `lubridate::mdy_hms()`.
```{r parse datetime}
my_HOBO$Datum.Zeit..GMT.01.00 <- mdy_hms(my_HOBO$Datum.Zeit..GMT.01.00)
```
Tidy up and sensibly rename the columns:
```{r tidy_cols}
my_HOBO <- as_tibble(my_HOBO) %>%
select(., 1:4) %>%
rename(id = 1, dttm = 2, temp = 3, lux = 4)
head(my_HOBO)
```
## 2.1. Calibration
The calibration data from GitHub is a table of the time when the calibration measurement was performed and the target value temperature. Both items were extracted and displayed with 3 decimal places. This is purely for show because R does exact calculations under the hood independently of the displayed decimals. The calculated calibration offset was then added to the temperature measurements.
```{r calibration}
caltime <- t_cal$This.is.the.calibration.value[3] # time of calibration
calib_line <- which(my_HOBO$dttm == caltime) # according line in raw data
meas_temp <- num(my_HOBO[calib_line,]$temp, digits = 3) # what the HOBO measured
meas_temp <- meas_temp[1] # as non-numbered number
tru_temp <- num(as.numeric(t_cal$to.calibrate.your.Hobo.measurements.in..C.[3]),
digits = 3) # what the target value is
tru_temp <- tru_temp[1] # target value as non-numbered number
cal_offset <- meas_temp-tru_temp # difference of the two
cal_offset <- as.numeric(cal_offset[1])
# so -0.267 is now my calibration offset
my_HOBO$temp <- my_HOBO$temp-cal_offset # subtract offset from raw temperature
```
## 2.2. create HOBO data file
Truncate data to the given time frame: First declared the target start and end
time using `lubridate::ymd_hms` again:
```{r truncate raw data}
start_time <- ymd_hms("2022-12-01 00:00:00") # start point
end_time <- ymd_hms("2023-01-07 23:50:00") # end point
time_range <- interval(start_time, end_time) # may be useful later
```
Filter the raw data to clip it according to the start and end times and also
overwrite the id column with a new value starting from `1`. `format()` was used
to format decimals in the light intensity and temperature vectors and to change
the timestamp format.
```{r format hobo file}
my_HOBO <- my_HOBO %>%
filter(between(dttm, start_time, end_time)) # filter
my_HOBO <- my_HOBO %>%
mutate(., id = c(1:length(my_HOBO$id))) # and fix ID col
my_HOBO$temp <- format(my_HOBO$temp, digits=3, nsmall=3) # fix decimals in temp
my_HOBO$lux <- format(my_HOBO$lux, digits=3, nsmall=3) # and in light intensity
my_HOBO$dttm <- format(my_HOBO$dttm, "%Y-%m-%d %H:%M") # timestamp format
```
The `tibble: my_HOBO` was then written to disk and manually uploaded to GitHub.
```{r write hobo file}
# write to file
write_csv(my_HOBO, file = "10350049.csv")
```
## 2.3. Verify your file
### Benchmark import function performance
Two classic import functions were used for this performance test: `baseR:read.csv` and `readr:read_csv`. The following chunk might take a little while to finish:
```{r benchmark import, results=FALSE, message=FALSE}
perf_test <- microbenchmark(
"readr" = {read_test <- read_csv("10350049.csv")},
"read.csv" = {read_test <- read.csv("10350049.csv")}
)
```
Benchmark results:
```{r benchmark result}
perf_test
```
The reason why `baseR:read.csv()` performed much faster in this test is because `readr:read_csv()` tried to guess the column data types but throws a warning on the datetime column (which is muted in the chunk output). Informing the function by passing the respective information as arguments to the function call would likely improve the performance dramatically. Nevertheless I decided to stick with `read.csv()` because I prefer parsing such data with `lubridate` anyway.
### Verification of import
Inspection of the file that was imported using `readr::read_csv()`:
```{r verify import}
head(read_test)
```
# 3. Quality control
Reading the data back in from GitHub to maintain order:
```{r formated_data, message=F}
# Load data from Github (then eval = TRUE)
HOBO_qc <- read.csv("https://raw.githubusercontent.com/data-hydenv/data/master/hobo/2023/10_minutes/10350049.csv")
```
## 3.1. Quality control procedures (QCPs)
### 3.1.1. Measurement range (Plausible values)
Checking the value ranges of both the light intensity and temperature data shows that all data points were well within the measurement range of the device. Quality check 1 passed.
```{r qcp1, eval=T}
# TODO NAs
range(HOBO_qc$temp)
range(HOBO_qc$lux)
```
### 3.1.2. Plausible rate of change
No temperature data point should differ from the preceding 10 minute steps by more than 1° Kelvin. A lagged (i.e. shifted forward in time by 10 minutes) of the temperature vector was created and subtracted from the temperature to yield a delta value that can then be inspected:
```{r qcp2}
# lag one column, then mutate difference to new column
HOBO_wip <- HOBO_qc %>%
mutate(., lagged=lag(temp)) %>%
mutate(., delta=temp-lagged)
range(na.omit(HOBO_wip$delta)) # delta values out of legal range?
```
The range of delta values shows that there are in fact data points that violate this quality check.
```{r flag qcp2}
qc_R <- sapply(HOBO_wip$delta,flagR) # apply flagging function
HOBO_wip <- cbind(HOBO_wip,qc_R) # collect results in new data frame
length(which(qc_R==1)) # count occurence of bad data points
```
Since only 24 data points failed the check it's feasible and worthwhile to inspect them manually:
```{r inspect qcp2, warning=F}
bad_temps <- HOBO_wip[which(qc_R==1),2:3] %>% # extract suspicious data
arrange(., desc(hm(dttm))) # order by hour and minute
bad_temps # display
```
This clearly shows that temperature change of more than 1 K per 10 minutes mostly happened at noon and sunrise with some occurrences right after sunset when rapid cooling is somewhat expected. These data points might not be faulty per se.
### 3.1.3. Minimum variability (Persistence)
Using `zoo::rollapply()` to run the persistence check over a moving time window of 6 time steps (i.e. 60 minutes). The function simply detects and flags each set of 6 time steps when the number of unique temperature values within is only 1.
```{r, flag qcp3}
qc2 <- rollapply(HOBO_wip$delta, width=6,FUN=flagP) # width parameter is window
```
Compiling the results of the preceding quality checks in a tibble:
```{r qc3 results}
qc_P <- c(1:length(HOBO_wip$delta)) # vector to store persistence flag
qc_P[1:5] <- NA # manually fill positions 1 through 5
qc_P[6:length(qc_P)] <- qc2 # append results
HOBO_wip <- cbind(HOBO_wip,qc_P) # collect in tibble
```
Looking at the failing data points of the persistence check in detail:
```{r persistence analysis,warning=F}
bad_pers <- HOBO_wip[which(qc_P==1),2:3] %>% # extract suspicious data
arrange(., desc(hm(dttm))) # order by hour and minute
hist(bad_pers$temp, breaks = 50, main = "Temp. data points failing persistence check",
xlab = "Temperature in Deg. Celsius")
```
The histogram shows that the majority of the failing data points happened at a temperature of -5°C to +5°C. This is not so unrealistic especially when considering that the HOBO was well shielded from sunlight (which will be shown in the next quality check).
### 3.1.4. Light intensity
The HOBO was situated in a planter at all times so only very limited disturbance of the light measurement should have occurred.
```{r qcp4}
summary(HOBO_wip$lux)
```
We can gather from the summary that the range of light intensity values is as expected and the HOBO has never seen bright and direct sunlight during the time of interest.
To extract more systematic information about the quality of the light intensity data a set of sky illuminance classes were assigned to the data:
```{r assign SIC}
HOBO_wip <- HOBO_wip %>%
mutate(SIC = case_when(lux <= 10 ~ '0_night',
lux <= 500 ~ '1_rise_Set',
lux <= 2000 ~ '2_overcast_full',
lux <= 15000 ~ '3_overcast_light',
lux <= 20000 ~ '4_clear',
lux < 50000 ~ '5_sunshine',
lux >= 50000 ~ '6_brightshine',
))
unique(HOBO_wip$SIC) # only the first four SICs present
```
This check corroborates what was stated above: The hedge and planter effectively shielded the light intensity sensor from direct sunlight. The brightest SIC the HOBO was exposed to was "Overcast (light)".
```{r light check summary}
summary(as_factor(HOBO_wip$SIC))
```
This is the reason why the light intensity check never resulted in a positive flag for the light intensity. The HOBO simply detected SICs that are associated with dusk, dawn and night in the majority of cases. Occurrences of SIC: Overcast (light) is the brightest SIC detected and only in 10 cases.
### 3.2 Summary of quality checks
#### QC Aggregation
The following chunk compiles the results of the quality checks, extracts the failed data points as a percentage and overwrites all failed data points with `NA`.
```{r flag summary, warning=F,message=F}
# analyse occurence of flags and aggregate qc fails
HOBO_wip <- HOBO_wip %>%
mutate(qc_tot = qc_P + qc_R)
HOBO_wip$qc_tot[1:5] <- 0
HOBO_wip <- HOBO_wip %>%
mutate(qc_all = case_when(qc_tot == 0 ~ 0,
qc_tot != 0 ~ 1))
# as percentage:
qc_result <- table(HOBO_wip$qc_all)
qc_df <- as_tibble(HOBO_wip) # write to df
# numbering all hours
hour_ct <- rep(c(1:912), each = 6)
# stick it to the df
qc_df <- cbind(qc_df, h_ct = hour_ct)
bh <- qc_df %>%
group_by(., h_flag = h_ct) %>%
summarize(., sum_flags=sum(qc_all)) %>%
filter(sum_flags>=2)
# this yields a vector of all hours that must be set to NA
bad_hours <- bh$h_flag
# create df
qc_df2 <- qc_df
# overwrite with NA
bad_temps <- which(qc_df2$h_ct %in% bad_hours)
qc_df2$temp[bad_temps] <- NA
```
#### QC Analysis
Display percentage of data points failing the quality checks in total:
```{r qc percentage}
perc <- num(((qc_result[2]/length(HOBO_wip$qc_all))*100),digits=4)
perc[1]
```
Of all data points in the 10-minute time series 3.7% failed the quality check and were replaced by NAs.
For the share of NAs in the hourly data set this means that all hours containing failed data points must be set to NA yielding a higher share of NAs in the hourly time series:
```{r NA share}
# Calculate the share of NAs in your hourly time-series in %
hrly_dat <- qc_df2 %>%
group_by(., h_ct) %>%
summarise(., mean_temp=mean(temp))
# this yields hourly means with NAs
# share of NA in hourly time series
table(is.na(hrly_dat$mean_temp))
# 50 NAs
num((50/length(hrly_dat$mean_temp))*100,digits = 4)
```
50 hours suffered from bad data points and were replaced by NAs. This results in an NA share of 5.5%. The following table shows the failed qc in absolute numbers:
```{r summarize}
qc_res_tab <- tibble('impl. r.o.c.'=length(which(qc_df2$qc_R==1)),
'impl. persistence'=length(which(qc_df2$qc_P==1)),
'impl. light intensity'=0,
'o.o. meas. range'=0)
qc_res_tab
```
This barplot shows that the overwhelming majority of positive flags were found in the persistence check with some positive flags in the rate of change plausibility check:
```{r qc barplot}
barplot(as.matrix(qc_res_tab), cex.names = 0.5, main = "No. of failed hourly data points per quality check", cex.main=0.8)
```
#### Write quality checked file
The quality checked file is being written to the parent directory of this file
as `10350049_Th.csv` after using `format()` to format the decimals and date/time:
```{r write qc file}
dttm <- seq(start_time, end_time, by= "hours")
hobo_hourly <- hrly_dat %>%
select(th=mean_temp) %>%
mutate(date_time=dttm) %>%
mutate(origin=rep("H"))
hobo_hourly$date_time <- format(hobo_hourly$date_time, "%Y-%m-%d %H:%M:%S")
hobo_hourly$th <- format(hobo_hourly$th, digits=3, nsmall=3)
write_csv(hobo_hourly, file = "10350049_Th.csv" )
```
# 4. Fill-up with reference station
The lines of the quality checked data that have previsouly been taken out and replaced with NAs must now be substituted by values predicted by a linear regression on the basis of data from a reference station.
## 4.1. Reference Station
The data from the reference stations was manually download from their respective sources and imported. The following chunk takes care of this and also reads the qc data from the preceding step back into the workspace from the disk. A missing line in the data from the `Freiburg Garten` Station, a weirdly formatted timestamp and comma glyphs as a decimal separator were the pitfalls encountered here. All could be tended to using the functions `format()` and `scan()`.
```{r ref station, warning=F,message=F}
# Weather stations import:
# WBI Station
# data import from hdd
WBI_raw <- read.csv("Stunde_096.csv", sep=";")
# convert to proper POSX or ld object
WBI_raw$Tag <- dmy(WBI_raw$Tag)
# class(WBI_raw$Tag)
# works semi well with this timestamp...
WBI_raw$Stunde <- hm(WBI_raw$Stunde)
# class(WBI_raw$Stunde)
# i'll make my own
# declare time frame
start_time <- ymd_hms("2022-12-01 00:00:00")
end_time <- ymd_hms("2023-01-07 23:50:00")
time_range <- interval(start_time, end_time)
# make my own dttm and scrap the ones that came with the WBI data
dttm <- seq(start_time, end_time, by= "hours")
# clip WBI data, add dttm and rename cols
Wdat_WBI <- WBI_raw %>%
filter(.,between(Tag, date(start_time), date(end_time))) %>%
select(.,temp=AVG_TA200) %>%
mutate(dttm=dttm)
# FREIBURG Garten station
GAR_raw <- read.csv("Freiburg_Garten_2022-11-30_2023-01-08.csv")
# correct missing line in GAR_raw
GAR_raw <- GAR_raw %>%
add_row(UTC = "2022-12-16 10:00:00", Lokalzeit = "2022-12-16 11:00:00", Lufttemperatur...C. = NA, .before = 395)
# GAR_raw[390:396,]
# that's better, now the whole converting and clipping procedure again
# convert to POSX
GAR_raw$Lokalzeit <- ymd_hms(GAR_raw$Lokalzeit)
GAR_raw$UTC <- ymd_hms(GAR_raw$UTC)
# class(GAR_raw$UTC)
Wdat_GAR <- GAR_raw %>%
filter(., between(Lokalzeit, start_time, end_time))
# this took way too long...
# Freiburg Stadtklimastation
URB_raw <- read.csv("produkt_air_temperature_13667_akt.txt", sep = ";")
# convert POSX
URB_raw$MESS_DATUM <- ymd_h(URB_raw$MESS_DATUM)
# and clip
Wdat_URB <- URB_raw %>%
filter(.,between(MESS_DATUM, start_time, end_time)) %>%
select(MESS_DATUM, LUFTTEMPERATUR) %>%
mutate(dttm = dttm)
# phew..correct nr of observations this time..
# DWD Station 1443
DWD_raw <- read.csv("produkt_tu_stunde_20210718_20230118_01443.txt", sep = ";")
# POSX conversio
DWD_raw$MESS_DATUM <- ymd_h(DWD_raw$MESS_DATUM)
Wdat_DWD <- DWD_raw %>%
filter(., between(MESS_DATUM, start_time, end_time)) %>%
select(., MESS_DATUM, TT_TU) %>%
mutate(dttm = dttm)
# stations df
tempDF <- Wdat_DWD %>%
select(., time = dttm, tempDWD = TT_TU) %>%
mutate(tempURB = Wdat_URB$LUFTTEMPERATUR) %>%
mutate(tempGar = Wdat_GAR$Lufttemperatur...C.) %>%
mutate(tempWBI = Wdat_WBI$temp)
# tempWBI still has the comma decimals
tempDF$tempWBI <- scan(text=tempDF$tempWBI, dec = ",", sep = ".")
# my hobo df
myHOBO <- read.csv("10350049_Th.csv" )
tempDF <- tempDF %>%
mutate(myHOBO$th)
# format:
tempDF <- cbind(tempDF,HOBOtemp = myHOBO$th)
tempDF$HOBOtemp <- as.numeric(tempDF$HOBOtemp)
```
### Visual identification of best suited reference station
Before any statistical analysis was done to find which reference station best matches the data I performed a visual analysis that shows where there are gaps in the temperature data and which reference stations data might be able to fill them:
```{r reference plot, warning=F,echo=F}
date1 <- tempDF$time[210] # zoom in to time with many NAs
date2 <- tempDF$time[245]
# overview plot
#plot1 <- ggplot(tempDF,aes(time))+
# geom_line(tempDF,mapping=aes(y=tempURB),color=wescols[1])+
# geom_line(tempDF,mapping=aes(y=tempGar),color=wescols[2])+
# geom_line(tempDF,mapping=aes(y=tempWBI),color=wescols[3])+
# geom_line(tempDF,mapping=aes(y=tempDWD),color=wescols[4])+
# geom_line(tempDF,mapping=aes(y=HOBOtemp),color=wescols[5])+
# ylim(-20,25)
# tempWBI is the closest fit
# manualy zoomed in comparison graph
plot2 <- ggplot(tempDF,aes(x=time))+
geom_line(tempDF,mapping=aes(y=tempURB,color='URB'))+
geom_line(tempDF,mapping=aes(y=tempGar,color='GAR'))+
geom_line(tempDF,mapping=aes(y=tempWBI,color='WBI'))+
geom_line(tempDF,mapping=aes(y=tempDWD,color='DWD'))+
geom_line(tempDF,mapping=aes(y=HOBOtemp,color='HOB'))+
ylim(-3,1)+
xlim(c(date1,date2))+
labs(x="\n\nTime",y="\nTemperature in deg. C.")+
ggtitle("HOBO and ref. station temperature zoomed in")+
scale_color_manual(name="Station",breaks=c('URB','GAR','WBI','DWD','HOB'),
values = c("URB"="#E2D200", "GAR"="#5B1A18",
"WBI"="#F21A00", "DWD"="#E58601",
"HOB"="#5BBCD6"))
```
The overview plot is not included here because it's hard to make out which station fits best but by limiting the time frame to a time with many NAs the graph already gives a clear answer as to which stations data fits best:
```{r stations plot,warning=F}
plot2
```
Apparently the WBI station matches the HOBO data fairly well, especially around the NA gaps with. In the next step a linear model was created for every station to extract R<sup>2</sup> values to make a final decision on which station to use:
### Statistical identification of best suited reference station
```{r linear model, echo=FALSE, message=FALSE, results='hide',warning=F}
# MODEL 1: WBI
modWBI <- lm(tempDF$HOBOtemp~tempDF$tempWBI, tempDF)
#summary(modWBI)
# MODEL 2: DWD
modDWD <- lm(tempDF$HOBOtemp~tempDF$tempDWD, tempDF)
#summary(modDWD)
# MODEL 3: URB
modURB <- lm(tempDF$HOBOtemp~tempDF$tempURB, tempDF)
#summary(modURB)
#
# MODEL 4: GAR
modGAR <- lm(tempDF$HOBOtemp~tempDF$tempGar, tempDF)
#summary(modGAR)
# table of model coefficients & R^2
Intercepts <- c(summary(modDWD)$coefficients[1,4],
summary(modURB)$coefficients[1,4],
summary(modGAR)$coefficients[1,4],
summary(modWBI)$coefficients[1,4])
R_squ <- c(summary(modDWD)$r.squared,
summary(modURB)$r.squared,
summary(modGAR)$r.squared,
summary(modWBI)$r.squared)
stations <- c("DWD","URB","GAR","WBI")
MOD_RES <- arrange(tibble(R_squ,stations,Intercepts))
```
The following tibble displays the compiled results of the R<sup>2</sup> extraction. p-values were discarded as they are not meaningful for correlation coefficients.
```{r R2 table}
MOD_RES %>%
arrange(., desc(R_squ))
```
The DWD Urban station sports the best fit according to R<sup>2</sup> so it will be used to fill the gaps with predicted values from its regression on the HOBO data.
## 4.2. fill-up
Predictions were made using `lm.predict()` function and labeled as "R" while original HOBO measurements were labeled as "H".
```{r fill regression}
# prediction
predDF <- tibble(HOBOtemp=tempDF$HOBOtemp,tempURB=tempDF$tempURB)
preds <- predict(modURB, tempDF,type="response")
# compile for comparing
comparison <- tibble(HOBOorig=tempDF$HOBOtemp,
tempURB=tempDF$tempURB,
predicted=preds)
# identify NAs to fill and fill with preds
tempDF <- tempDF %>%
mutate(HOBOcorr=case_when(is.na(HOBOtemp)~preds,is.numeric(HOBOtemp)~HOBOtemp))
# fill origin col with H for HOBO
hobo_hr_corr <- tempDF %>%
select(dttm=time, th=HOBOcorr) %>%
mutate(origin="H")
# and with R for regression
hobo_hr_corr$origin[which(is.na(tempDF$HOBOtemp))] <- "R"
# format timestamp and values
hobo_hr_corr$dttm <- format(hobo_hr_corr$dttm, "%Y-%m-%d %H:%M:%S")
hobo_hr_corr$th <- format(hobo_hr_corr$th, digits=3, nsmall=3)
# write to disk, overwriting the NA ridden file
write_csv(hobo_hr_corr, file = "10350049_Th.csv" )
```
Plot showing corrected data
```{r corrected HOBO plot}
tempDF <- cbind(tempDF, HobCor=hobo_hr_corr$th)
tempDF$HobCor <- as.numeric(tempDF$HobCor)
plot3 <- ggplot(tempDF,aes(x=time))+
geom_line(tempDF,mapping=aes(y=tempURB,color='URB'))+
#geom_line(tempDF,mapping=aes(y=tempGar,color='GAR'))+
#geom_line(tempDF,mapping=aes(y=tempWBI,color='WBI'))+
#geom_line(tempDF,mapping=aes(y=tempDWD,color='DWD'))+
geom_line(tempDF,mapping=aes(y=HobCor,color='HOB'))+
labs(x="\n\nTime",y="\nTemperature in deg. C.")+
ggtitle("Corrected HOBO and ref. station temperature")+
scale_fill_manual(name="Station", breaks=c('URB','HOB'),
values = c("URB"="#E2D200","HOB"="#5BBCD6"))+
guides(color = guide_legend(title = "Station"))
```
```{r}
plot3
```
Looking at the graph (showing the full time series) now makes it clear why the URB station was best suited for the regression. Trends and peaks are very similar with only a little offset (URB temperature is a bit higher most of the time).
# 5. Calculate indices
Calculation of indices was done using Postgresql within this very Rmarkdown document.
The following chunk establishes connection to the database
```{r setup_sql}
# establish the connection
drv <- dbDriver('PostgreSQL')
con <- dbConnect(drv, host='hydenv.hydrocode.de', port=5432, user='hydenv',
password=pw(), dbname='hydenv')
```
## 5.1. Generate metadata overview
A 'human readable' file showing the HOBO locations and descriptions was pulled from the online database and saved to the R workspace for convenience.
```{r import meta table,warning=F}
# get meta table
HOBO_MTab <- dbReadTable(con, 'metadata')
```
Meta table clipped to this year by specifying the 'term id' (gathered from the table overview not included here); location attribute parsed using `sf::st_asewkt()`.
```{sql connection=con, output.var="HOBO_MTab" }
SELECT m.id, device_id as "HOBO id", st_x(location) as lon, st_y(location) as lat, term_id as term, description
FROM metadata m
WHERE term_id ='15'
```
## 5.2 Calculate Indices
As per the assignment a number of indices were derived from both raw and quality checked data of individual HOBOs as well as of all of this terms HOBOs.
### Indices of own HOBO raw data
My own HOBOs raw measurements were pulled from the database using the corresponding meta id from the meta table using the following query (output limited for readability). To include only temperature measurements in the `VIEW` the `variable_id` was specified. Necessary information on the variable key was gathered by querying the `variables` table (not included in this document).
```{sql connection=con}
create temporary VIEW temperature_index as
SELECT * FROM raw_data
WHERE meta_id = 216 and variable_id = 1
```
### Mean Temperature
Using `AVG()` to calculate the mean temperature:
```{sql connection=con}
select AVG(value) as mean from temperature_index
```
### Mean Night Temperature
The same process was repeated for mean temperature at night time only by adding a `WHERE` clause to the query:
```{sql connection=con}
create temporary VIEW temperature_index_night as
SELECT * FROM raw_data
WHERE meta_id = 216 and variable_id = 1 and date_part('hour', tstamp) <= 8 and date_part('hour', tstamp) <= 18
```
Aggregation:
```{sql connection=con}
select AVG(value) from temperature_index_night
```
### Mean Day Temperature
This is the same process as above only with the `WHERE` clause inverted:
```{sql connection=con}
create temporary VIEW temperature_index_day as
SELECT * FROM raw_data
WHERE meta_id = 216 and variable_id = 1 and date_part('hour', tstamp) <= 18 and date_part('hour', tstamp) >=8
```
Aggregation of daytime temperature:
```{sql connection=con}
select AVG(value) from temperature_index_day
```
## 5.3 Create indices table/view/query
Similar indices were computed for all of this terms HOBO by constructing a series of slightly more complex queries
### 5.3.1 Winter term 2023 all HOBOs raw data
Daily and nightly mean as well as full day mean and coefficient of variation were calculated for the non-quality checked 10 minute (raw) data of all available HOBO measurements of the 2023 term.
```{sql connection=con, output.var = "idx_raw"}
with c as (
select m.id, tstamp, value from raw_data r
join metadata m on m.id=r.meta_id
join terms t on t.id=m.term_id
where t.id = 15 and variable_id = 1
),
mean as (
select id, avg(value) as mean_full from c
group by id
),
coevar as (
select id, stddev(value)/avg(value) as co_var from c
group by id
),
night as (
select id, avg(value) as night_temp from c
where date_part('hour', tstamp) < 8 or date_part('hour', tstamp) >= 20
group by id
),
day as (
select id, avg(value) as day_temp from c
where date_part('hour', tstamp) > 8 or date_part('hour', tstamp) <= 20
group by id
),
idx as (
select m.id, mean.mean_full, night.night_temp, day.day_temp, coevar.co_var
from metadata m
natural join mean
natural join night
natural join day
natural join coevar
)
select * from idx order by idx.id
```
### Winter term 2023 all HOBOs quality checked hourly data
Apart from having to `select` the data from a different table (i.e. `data` as opposed to the quality checked hourly `raw_data` table) the only difference to the preceding query is that it was not necessary to filter for the correct variable because this table only contains the temperature measurements and not light intensity.
```{sql connection=con, output.var = "idx_qc"}
-- TODO comment these
with c as (
select m.id, tstamp, value from data d
join metadata m on m.id=d.meta_id
join terms t on t.id=m.term_id
where t.id = 15
),
mean as (
select id, avg(value) as mean_full from c
group by id
),
night as (
select id, avg(value) as night_temp from c
where date_part('hour', tstamp) < 8 or date_part('hour', tstamp) >= 20
group by id
),
day as (
select id, avg(value) as day_temp from c
where date_part('hour', tstamp) > 8 or date_part('hour', tstamp) <= 20
group by id
),
coevar as (
select id, stddev(value)/avg(value) as co_var from c
group by id
),
idx as (
select m.id, mean.mean_full, night.night_temp, day.day_temp, coevar.co_var
from metadata m
natural join mean
natural join night
natural join day
natural join coevar
)
select * from idx order by idx.id
```
### Differences of raw data and qc data indices
Apart from the length (the qc data has fewer entries) the main differences can be gathered from the table constructed in the following chunk:
```{r}
# TODO aggregated table of indices goes here
```
## 5.5 Combine the table with R
### Max daily temperature change of one HOBO
Mean of the maximum daily temperature change (per hour) of one HOBO:
Data import from database:
```{sql connection=con, output.var="my_temp"}
-- Selecting only the first HOBO
SELECT tstamp, value from data where meta_id=217 and variable_id = 1
```
And calculation:
```{r}
idx_my_temp <- my_temp %>% # one hobo temps and dates
mutate(lagged=lag(value)-value) %>% # calc delta per 10 mins
group_by(day=date(tstamp)) %>% # group by day
mutate(max_delta=max(abs(lagged))) %>% # get max delta of day
ungroup() # fold out
```
### Max daily temperature change of all of this terms HOBOs
Get data from database
```{sql connection=con, output.var="all_temp_23"}
-- Selecting only the first HOBO
SELECT meta_id, tstamp, value from data where meta_id between 217 and 242
and variable_id = 1
```
And calculate for all of this terms HOBOs
```{r mean max daily change 23, warning=F, message=F}
idx_all_temp <- all_temp_23 %>% # all hobo temps and dates
mutate(lagged=lag(value)) %>% # calc delta per 10 mins
mutate(delt_T=abs(lagged-value))
# this yields the max temp change per day (absolute)
library(data.table) # dplyr/sf/plyr namespace problems (?) made me use this
idx_all_temp <- idx_all_temp %>%
mutate(day_=date(tstamp))
idx2 <- setDT(idx_all_temp)
maxDtemp <- idx2[ , MX:=max(delt_T), by = list(day_, meta_id)]
# this finally yields the max delta of temperature PER DAY AND META ID
maxDtemp <- na.omit(maxDtemp) # strip NAs
meanDtemp <- maxDtemp[, AVG:=mean(MX), by = meta_id]
# and this yields the mean of each HOBOs max delta temp
HOBOs23 <- meanDtemp %>%
group_by(meta_id) %>%
summarise(avgMxDelta=max(AVG))
# detach(package:data.table,unload = TRUE)
library(dplyr)
```
### Combine in one table
```{r join tables}
# prep meta table
fin_tab <- HOBO_MTab %>%
filter(between(id, 217, 242)) %>%
arrange(id)
# join idx_qc (contains indices computed on the DB) and
# HOBO_MTab (contains meta info, location) and
# HOBOs23 (contains avg Max Temp Delta)
full_hobos <- merge(x=idx_qc, y=fin_tab, by="id", sort = TRUE)
full_hobos <- merge(x=full_hobos, y=HOBOs23, by.y = "meta_id", by.x = "id")
```
A bit of the finished table:
```{r}
head(full_hobos)
```
# 6. Spatial analysis
## 6.1 Find spatial data
## 6.2 Filter by location
<div class="alert alert-info">
Filter the database for only the city districts, that contain a reference station and present them in either in a human readable table or a map.
</div>
First, create a sub-query/view/with statement containing all reference stations. Then extent this with the needed filter
## 6.3 Aggregate by location
<div class="alert alert-info">
Query all city districts of Freiburg that contain at least three HOBOs and aggregate the indices calculated in *Create indices table/view/query* for each of these districts and present them as a table.
Repeat the procedure for the HOBO locations of WT21 or WT22 (or both). Are there differences? Describe.
</div>
## 6.4 Creating a map
<div class="alert alert-info">
Use the queries constructed in the last task to create a map of Freiburg, that illustrates differences in one (or more) of the temperature indices between the city districts of Freiburg. You can create this map either in R or in QGis.
</div>
# cleanup
```{r}
dbDisconnect(con)
```