-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsupplementary_examples.Rmd
882 lines (688 loc) · 36.1 KB
/
supplementary_examples.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
---
title: A practical guide to fitting GAMs to medical monitoring data with *mgcv* and R
author:
- |
| **Johannes Enevoldsen**,
| Department of Clinical Medicine, Aarhus University and Department of Anaesthesiology
| & Intensive Care, Aarhus University Hospital,
| Aarhus, Denmark
| Email: [email protected]
- |
| **Gavin L Simpson**,
| Department of Animal Science, Aarhus University,
| Tjele, Denmark
- |
| **Simon T Vistisen**,
| Department of Clinical Medicine, Aarhus University and Department of Anaesthesiology
| & Intensive Care, Aarhus University Hospital,
| Aarhus, Denmark
abstract: |
Supplement to the article: *Using Generalized Additive Models to Decompose Time Series and Waveforms, and Dissect Heart-lung Interaction Physiology*.
The document examplifies how to fit and work with generalized additive models (GAMs) in [R](https://www.r-project.org/) using *[mgcv](https://CRAN.R-project.org/package=mgcv)*.
output:
bookdown::pdf_document2:
extra_dependencies: ["underscore", "float"]
toc: yes
dev: cairo_pdf
latex_engine: xelatex
toc-title: "Contents"
include-before:
- '`\pagebreak{}`{=latex}'
fontsize: 11pt
linkcolor: NavyBlue
monofont: Source Code Pro
monofontoptions: 'Scale=0.7'
mainfontoptions: 'Linestretch=4'
---
```{r hidden_setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
#dpi = 150,
fig.width = 7,
fig.height = 5,
fig.retina = 2,
out.width = "70%",
fig.align = "center",
fig.pos = "H", out.extra = ""
)
knitr::opts_template$set(wide = list(
fig.height=2, fig.width=10, out.width = "100%"
),
nooutput = list(fig.height=1, fig.width=1, out.width = "100%") # keep cache with default opt change
)
```
# Data and source code
Sample data and the source code for this document is available at <https://github.com/JohannesNE/gam-medical-signals-supplementary>.
# Packages
Packages used in these examples can be installed from CRAN using `install.packages("package name")`.
```{r packages, message=FALSE}
library(mgcv) # fit GAMs
library(gratia) # visualise and work with GAMs
library(dplyr) # work with data frames
library(ggplot2) # plots
library(patchwork) # combine plots
# Packages only used for heart beat detection (dependencies of `find_abp_beats()`)
# install.packages("RcppRoll")
# install.packages("purrr")
theme_set(theme_minimal()) # Change the default plotting theme
# Data preparation functions. Available in code repository.
source("functions.R")
```
# Pulse pressure
The first example corresponds to the model shown in Fig. 2 in the paper. The aim is to model variation in pulse pressure using information about the respiratory cycle. The model is then used to estimate pulse pressure variation (PPV).
## Load data
We use a 30 second recording of arterial blood pressure (ABP). The sample data is structured as a list. It includes ABP `sample_pp$abp` and the start time of each inspiration `sample_pp$insp_start`.
```{r}
sample_pp <- readRDS("sample_PP.RDS")
```
The ABP is stored as a data frame with a time column and a pressure column. The time is in seconds (sample rate 125 Hz) and the pressure unit is mmHg.
```{r}
head(sample_pp$abp)
```
The timing of inspirations is a data frame with a time column that indicates when a new inspiration starts. These time points are recorded from a Draeger ventilator, but could be determined from any recorded airway pressure waveform.
```{r}
head(sample_pp$insp_start)
```
We can visualise the two data frames together.
```{r opts.label='wide' }
#| fig.cap = "Arterial blood pressure (ABP). Red lines indicate inspiration start."
abp_plot <- ggplot(sample_pp$abp, aes(time, ABP)) +
geom_line() +
geom_vline(aes(xintercept = time), color = "red",
data = sample_pp$insp_start)
abp_plot
```
## Detect heart beats
We now detect individual heart beats from the ABP waveform. A function, `find_abp_beats()`, is shared in the file `functions.R`. It takes an ABP waveform and returns a data frame of individual beats. The data frame contains the following columns (plus a few more that we do not use in this example):
- `time`: timing of the diastole (negative peak) [seconds]. This marks the beginning of a beat
- `dia`: diastolic pressure [mmHg]
- `time_systole`: timing of the systole (positive peak) [seconds]
- `sys`: the following systolic pressure [mmHg]
- `PP`: pulse pressure (`sys` - `dia`) [mmHg]
- `beat_len`: length of the beat (`time` - `lead(time)`) [seconds], which roughly corresponds to an RR interval
```{r}
beats <- find_abp_beats(sample_pp$abp)
head(beats)
```
We can add this information to the previous plot.
```{r opts.label='wide' }
#| fig.cap = "Beats are detected using `find_abp_beats()`."
abp_plot +
geom_point(aes(x = time,
y = dia,
colour = "diastole"),
data = beats) +
geom_point(aes(x = time_systole,
y = sys,
colour = "systole"),
data = beats) +
# Place systole above diastole in legend
scale_color_discrete(limits = c("systole", "diastole"))
```
For this model, we are only interested in the pulse pressure (PP = systolic pressure - diastolic pressure) of each heart beat.
```{r opts.label='wide' }
#| fig.cap = "Pulse pressure (PP) of each heart beat."
pp_plot <- ggplot(beats, aes(time, PP)) +
geom_line() +
geom_point() +
geom_vline(aes(xintercept = time), color = "red",
data = sample_pp$insp_start)
pp_plot
```
Before we can fit the model, we need to calculate the position of each beat in the respiratory cycle. `functions.R` contain `add_time_since_event()`, that takes a data frame with a time column (here `beats`) and a vector of times corresponding to some event (here the timing of each inspiration start: `sample_pp$insp_start$time`) and returns the data frame with new columns indicating the timing of each observation (beat) relative to the most recent event (inspiration start). The new columns are:
- `insp_index`: time since the latest inspiration start [seconds]
- `insp_n`: respiratory cycle number
- `insp_cycle_len`: length of the respiratory cycle [seconds]
- `insp_rel_index`: the relative position of the beat in the respiratory cycle (`insp_index` / `insp_cycle_len`)
```{r}
beats_indexed <- add_time_since_event(beats,
time_events = sample_pp$insp_start$time,
prefix = "insp") %>%
# the first beats are earlier than the first inspiration and
# therefore have `insp_rel_index = NA`. We remove these.
na.omit()
# Relocate time and the four newly added columns.
head(beats_indexed %>% relocate(time, starts_with("insp_")))
```
This lets us show each beat by its position in the respiratory cycle.
```{r fig.height=3, fig.width=10, out.width = '100%'}
#| fig.cap = "Pulse pressure indexed to the respiratory cycle."
pp_plot_color <- ggplot(beats_indexed, aes(time, PP)) +
geom_line() +
# insp_n is a unique (consecutive) number for each respiratory cycle
geom_point(aes(color = as.factor(insp_n)), show.legend = FALSE) +
geom_vline(aes(xintercept = time), color = "red",
data = sample_pp$insp_start) +
labs(title = "Pulse pressure",
subtitle = "by time. Color indicate respiratory cycle (`insp_n`)")
pp_insp_plot <- ggplot(beats_indexed,
aes(
insp_rel_index,
PP,
group = as.factor(insp_n),
color = as.factor(insp_n)
)
) +
geom_line(alpha = 0.3, show.legend = FALSE) +
geom_point(show.legend = FALSE) +
labs(subtitle = "by position in the respiratory cycle")
pp_plot_color + pp_insp_plot + plot_layout(widths = c(2,1))
```
We now have the variables we need to fit the model. For clarity, we select only the variables we need for the model.
```{r}
PP_data <- select(beats_indexed, PP, time, insp_rel_index)
head(PP_data)
```
## Fit and visualise the GAM
To fit the model, we use the `gam()` function from `mgcv`.
```{r}
PP_gam <- gam(
# The first parameter to the gam() function is the model specification,
# supplied using formula notation:
# Left of the tilde (~) is our dependent variable PP
PP ~
# Right of the tilde is our independent variables.
# Define a smooth function of insp_rel_index.
s(insp_rel_index,
k = 15, # 15 knots.
bs = "cc" # The basis is a cyclic cubic spline
) +
# Define a smooth function of time
s(time,
bs = "cr" # The basis is a natural cubic spline.
# default k is 10. This will be fine here.
),
# We can specify the positions of the knots for each smooth.
# If only two knots are specified for a cyclic spline, these will
# set the positions of the limiting knot(s). The remaining knots will
# be positioned automatically (for cubic splines, the default position
# is at quantiles of the independent variable).
knots = list(insp_rel_index = c(0,1)),
# We use restricted maximum likelihood (REML) to fit the optimal smoothing parameter.
# This is often the best choice, but not the default.
method = "REML",
data = PP_data
)
```
Now, plot the model using `gratia::draw()` (or `plot()`). Adding partial residuals is a simple way to visualise how well the model fits observed data and to identify systematic errors. Partial residuals are the model residuals + the specific smooth effect in the plot.
```{r fig.height=3.5, fig.width=7}
#| fig.cap = "Smooth effects of the pulse pressure GAM."
draw(PP_gam,
residuals = TRUE)
```
In addition to the two smooth effects, the model also has a constant/intercept, which is not visualised in the plots above. In this model, the intercept is the mean PP.
```{r}
coef(PP_gam)[1]
```
Predictions (or fit) can be calculated as
$$
\hat{PP} = \alpha + s(\text{insp_rel_index}) + s(\text{time}),
$$
where $\alpha$ is the model intercept.
We can use `predict()` to calculate model predictions.[^predict] If we only pass the GAM, it will make predictions using the observed independent variables we used to fit the model. We can add these predictions as a new column to our original dataset.
[^predict]: When `predict()` is called on a GAM (an object of class "gam"), the specific method `predict.gam()` is used. See `?predict.gam()` for help.
```{r opts.label='wide' }
#| fig.cap = "Observed and predicted pulse pressure."
PP_pred <- mutate(PP_data, pred = predict(PP_gam))
ggplot(PP_pred, aes(x=time)) +
geom_line(aes(y=PP, color = "Observed")) +
geom_point(aes(y=PP, color = "Observed")) +
geom_point(aes(y=pred, color = "Predicted"))
```
We can use the `newdata` parameter in `predict()` to interpolate predictions between our observations (it rarely makes sense to extrapolate a spline fit).
```{r opts.label='wide'}
#| fig.cap = "Observed and predicted pulse pressure, including predictions between observations."
PP_newdata <- tibble(
# create 200 points from 0 to 30 to make the prediction visually smooth
time = seq(0, 30, length.out = 200)) %>%
# index each new time to our existing vector of inspiration times
add_time_since_event(sample_pp$insp_start$time, prefix = "insp") %>%
na.omit()
PP_interpolate <- bind_cols(
PP_newdata,
predict(PP_gam,
newdata = PP_newdata,
# in addition to the predictions (fit) we can also return the standard error
# (se.fit) for each prediction. This makes predict return a named list,
# that we can simply bind as columns to our data frame.
se.fit = TRUE)
)
ggplot(PP_interpolate, aes(x = time)) +
geom_ribbon(aes(ymin = fit - 1.96*se.fit,
ymax = fit + 1.96*se.fit,
fill = "Predicted (95% CI)")) +
geom_line(aes(y = fit)) +
geom_point(aes(y=PP, color = "Observed"), data = PP_data) +
scale_fill_manual(values = "skyblue") +
labs(x = "PP")
```
## Calculate pulse pressure variation
We can calculate pulse pressure variation (PPV) from this model using the formula
$$
PPV = \frac{maximum(s(\text{insp_rel_index})) - minimum(s(\text{insp_rel_index}))}{\alpha},
$$
where $\alpha$ is the model constant (intercept). To find the extrema of the smooth, we can generate a grid of predictions using only the one smooth term `s(insp_rel_index)`. We could use `predict(type="terms")` but `gratia::smooth_estimates()` conveniently returns the values of the smooth over the original range of the independent variable (here `insp_rel_index`).
```{r}
insp_rel_index_smooth <- smooth_estimates(PP_gam,
smooth = "s(insp_rel_index)",
n=100)
min_PP <- min(insp_rel_index_smooth$est)
max_PP <- max(insp_rel_index_smooth$est)
intercept_PP <- coef(PP_gam)[1]
PPV_est <- (max_PP - min_PP) / intercept_PP
sprintf("PPV is %.1f%%", PPV_est*100)
```
This PPV is an estimate from the model. We should also report the uncertainty. `mgcv` lets us sample parameters from the posterior distribution of the model (similarly to posterior sampling from a Bayesian model; see [Gavin Simpson's answer on StackOverflow](https://stats.stackexchange.com/questions/190348/can-i-use-bootstrapping-to-estimate-the-uncertainty-in-a-maximum-value-of-a-gam) for an in-depth example of how this can be done). Here we need posterior samples of a smooth, and, conveniently, `gratia::smooth_samples()` does just that. For each sampled smooth, we can calculate PPV, and use the many samples of PPVs to calculate a confidence interval for PPV (this approach neglects that the model intercept is also an estimate, but since the standard error for the intercept is *very* small, this has negligible effect on the width of the confidence interval).
```{r}
#| fig.cap = "Plot of 50 of the 5000 sampled smooths from the posterior distribution of the respiratory cycle smooth `s(insp_rel_index)`."
set.seed(1)
# Sample 5000 smooths
insp_smooth_samples <- smooth_samples(PP_gam, term = "s(insp_rel_index)", n = 5000)
# Draw the first 50 samples.
ggplot(insp_smooth_samples %>% filter(draw <= 50), aes(.x1, value, group = draw)) +
geom_line(alpha = 0.1) +
labs(x="insp_rel_index")
```
```{r}
#| fig.cap = "Histogram of PPVs calculated from each of the 5000 sampled smooths from the posterior distribution of the respiratory cycle smooth `s(insp_rel_index)`."
# A function that returns PPV given a smooth and an intercept
calc_PPV <- function(smooth, intercept) {
min_PP <- min(smooth)
max_PP <- max(smooth)
(max_PP - min_PP) / unname(intercept)
}
PPV_samples <- insp_smooth_samples$value %>%
split(insp_smooth_samples$draw) %>%
sapply(calc_PPV, intercept = coef(PP_gam)[1])
PPV_95 <- quantile(PPV_samples, probs = c(0.025, 0.975))
ggplot(data.frame(PPV = PPV_samples), aes(x=PPV)) +
geom_histogram() +
geom_vline(aes(xintercept = PPV_est, color = "Estimate"),
data = data.frame()) +
geom_vline(aes(xintercept = PPV_95, color = "95% confidence interval"),
data = data.frame()) +
scale_x_continuous(labels = scales::label_percent())
```
# Central venous pressure
These examples correspond to the models shown in Fig. 5 and 6 in the paper.
## Load data
We use a section of a recording of central venous pressure (CVP) `sample_cvp$cvp`. The sample data is structured as a list. In addition to CVP, it also includes the start time of each inspiration `sample_cvp$insp_start` and of each QRS-complex `sample_cvp$qrs` and the interval in which 250 ml fluid is administered (`sample_cvp$fluid_start` to `sample_cvp$fluid_end`).
```{r}
sample_cvp <- readRDS("sample_CVP.RDS")
```
The CVP is stored as a data frame with a time column and a pressure column. The time is in seconds, the sample rate is 125 Hz and the pressure unit is mmHg.
```{r}
head(sample_cvp$cvp)
```
```{r cvp-overview, opts.label='wide', fig.height=4}
#| fig.cap = "Sample data for CVP model."
plot_cvp_full <- ggplot(sample_cvp$cvp, aes(time, CVP)) +
annotate("rect", xmin = sample_cvp$fluid_start, xmax = sample_cvp$fluid_end,
ymin = -Inf, ymax = Inf,
fill = alpha("blue", 0.4)) +
annotate("rect", xmin = sample_cvp$fluid_start-30, xmax = sample_cvp$fluid_start,
ymin = -Inf, ymax = Inf,
fill = alpha("green", 0.4)) +
geom_line() +
labs(title = "Full sample",
subtitle = "Blue area: administration of 250 ml fluid
Green area: section used to fit GAM in the first example")
plot_cvp_short <- ggplot(sample_cvp$cvp, aes(time, CVP)) +
geom_line() +
coord_cartesian(xlim = c(sample_cvp$fluid_start-30, sample_cvp$fluid_start)) +
geom_vline(aes(xintercept = time), color = "red",
data = sample_cvp$insp_start) +
geom_vline(aes(xintercept = time), color = "blue",
data = sample_cvp$qrs) +
labs(title = "Green area",
subtitle = "Blue lines: QRS-complexes
Red lines: inspiration start")
plot_cvp_full/plot_cvp_short
```
We want to fit the model:
$$
CVP = \alpha + f(pos_{cardiac}) + f(pos_{ventilation}) + f(pos_{cardiac},\ pos_{ventilation}) + f(t_{total}) + \epsilon.
$$
First, we need to calculate each CVP sample's position in both the cardiac and respiratory cycles. For the respiratory cycle, we will fit a cyclic spline based on the relative position (as in the pulse pressure example above). For the cardiac cycle, we cannot simply use a cyclic spline, as the cycles vary in length. We could use a cyclic spline based on the relative position in the cardiac cycle of a CVP sample, but that would assume that the CVP waveform of a long cardiac cycle is simply a stretched version of a short cycle. Instead we assume that the cardiac cycle effect depends on the time since the P wave (initiation of atrial contraction), without constraining it to be cyclic. Instead of trying to detect P waves from the ECG, we assume that they appear a constant interval before the QRS complex.
In the sample data, QRS complexes have already been detected from `sample_cvp$ecg`. This was done with [`rsleep::detect_rpeaks()`](https://cran.r-project.org/package=rsleep).
To find the PQ interval, we align 30 seconds of ECG recording by the detected QRS complexes.
```{r}
#| fig.cap = "ECG recording (30 seconds) aligned by QRS complexes."
sample_cvp$ecg %>%
add_time_since_event(sample_cvp$qrs$time-0.3, prefix = "before_qrs") %>%
na.omit() %>%
filter(time < 30) %>%
ggplot(aes(before_qrs_index-0.3, ECG_II, group = before_qrs_n)) +
geom_line(alpha = 0.3) +
scale_x_continuous(breaks = seq(-0.2, 0.4, by = 0.05)) +
labs(x = "Time - QRS-time [seconds]")
```
We can see that the P wave starts ~150 ms before the QRS complex.
In this example we will fit a GAM to the last 30 seconds before fluid administration starts (the green area in Figure \@ref(fig:cvp-overview)). We add each sample's position in both the cardiac cycle (starting at the P wave) and the respiratory cycle, and filter the data to the relevant section.
```{r}
PQ_interval <- 0.150 #seconds
cvp_df <- sample_cvp$cvp %>%
# QRS time - PQ interval = P wave time.
add_time_since_event(sample_cvp$qrs$time - PQ_interval, prefix = "P_wave") %>%
add_time_since_event(sample_cvp$insp_start$time, prefix = "insp")
# Select a 30 second sample before fluid administration.
cvp_df_30 <- filter(cvp_df,
between(time, sample_cvp$fluid_start-30, sample_cvp$fluid_start))
head(cvp_df_30)
```
In the above table, `P_wave_index` is the position in the cardiac cycle (seconds since latest P wave) and `insp_rel_index` is the position in the respiratory cycle (time since inspiration start relative to length of the respiratory cycle).
Before fitting the model, we can visualise (individually) the three effects we subsequently want to model (collectively): time, respiratory cycle and heart cycle.
```{r}
#| fig.cap = "Different visualizations of the data used for the CVP model."
cvp_time <- ggplot(cvp_df_30, aes(time, CVP)) +
geom_line() +
labs(title="Time")
cvp_insp <- ggplot(cvp_df_30, aes(insp_rel_index, CVP, group = insp_n)) +
geom_line() +
labs(title="Respiratory cycle (relative)")
cvp_p <- ggplot(cvp_df_30, aes(P_wave_index, CVP, group = P_wave_n)) +
geom_line() +
labs(title="Cardiac cycle (seconds since P wave)")
cvp_time / (cvp_insp + cvp_p)
```
## Fit and visualise the GAM
Now, we are ready to fit the model. This time we use `bam()` which is like `gam()`, but optimised for large datasets. First, we fit the model without correcting for autocorrelation of the residuals. A later section will demonstrate how to correct for autocorrelation.
```{r, opts.label='nooutput', cache=TRUE}
gam_cvp <- bam(
CVP ~
s(P_wave_index, bs = "cr", k = 40) +
s(insp_rel_index, bs = "cc", k = 30) +
# We create the interaction smooth with ti() rather than te() because the main
# effects, s(P_wave_index) and s(insp_rel_index), are also present in the model.
# Separating the interaction from the main effects, allows mgcv to fit it with
# separate smoothing parameters.
# 40 x 30 knots makes a highly flexible plane, that will often overfit the data
# (especially with a sample of just 30 seconds).
# We can reduce overfitting either by fixing the smoothing parameters, for the
# interaction smooth (with the `sp` parameter), or by using `bam`s `gamma`
# parameter to increase smoothness for entire model.
ti(
P_wave_index,
insp_rel_index,
bs = c("cr", "cc"),
k = c(40, 30)
) +
s(time, bs = "cr"), # If the detrending smooth captures some of the
# other effects (respiratory or cardiac) it may be necessary to set a high
# fixed smoothing parameter.
knots = list(insp_rel_index = c(0, 1)),
gamma = 5, # for a 125 Hz CVP waveform, `gamma = 5` seems to work well.
data = cvp_df_30
)
```
Visualise the smooth effects of the model.
```{r gam-cvp-draw}
#| fig.cap = "Smooth effects of the CVP GAM."
gratia::draw(gam_cvp, residuals = TRUE, rug = FALSE)
```
The model intercept (the mean CVP) is not visualised in the plots above.
```{r}
coef(gam_cvp)[1]
```
We can visualise how well our model fits the observed CVP.
```{r, opts.label='wide', fig.height=3}
#| fig.cap = "Observed, predicted and residual CVP"
cvp_df_pred <- mutate(cvp_df_30,
pred = predict(gam_cvp),
resid = resid(gam_cvp))
cvp_df_pred %>%
ggplot(aes(time, CVP)) +
geom_line() +
geom_line(aes(y = pred), color = "red") +
geom_line(aes(y = resid), color = "orange")
```
There is a small but clear pattern in the residuals one third into each respiratory cycle (most visible in the `s(insp_rel_index)` smooth in Figure \@ref(fig:gam-cvp-draw)). This corresponds to the closing of the ventilator solenoid valve at end-inspiration. The sudden drop in pressure makes the ventilator tubing move and disturb the adjacent CVP line. In the section below, we extend the respiratory cycle smooth to fit this residual signal.
## Adaptive smoothness
The respiratory cycle effect does not have a constant smoothness. It has a sharp drop at start-expiration. In addition, this sample recording has a quick disturbance at end-inspiration, as described in the section above.
There are different ways to allow a change in smoothness across a spline. One is transformation of the independent variable (here `insp_rel_index`) to "stretch" the section with high wiggliness. This is computationally efficient, but requires the change in smoothness to be known *a priori*.
The method demonstrated here uses a spline with adaptive smoothness. This is an extension of a smoothing spline, where the smoothing parameter (wiggiliness penalty) is not constant but allowed to vary over the independent variable. The smoothing parameter is, itself, a spline estimated from data.
```{r, opts.label='nooutput', cache=TRUE}
gam_cvp_ad <- bam(
CVP ~
s(P_wave_index, bs = "cr", k = 40) +
# Adaptive smooth `bs = "ad"`, `m` is equivalent to `k`, but sets the number
# knots for the spline that defines the adaptive penalty.
# The `xt` parameter is used to specify the smoothing spline
# (here a cyclic cubic spline).
# We increase k to 60 to enable the spline to fit the extra wiggly section.
s(insp_rel_index, bs = "ad", k = 60, m = 5, xt = list(bs = "cc")) +
# The interaction smooth does not have adaptive smoothness.
ti(
P_wave_index,
insp_rel_index,
bs = c("cr", "cc"),
k = c(40, 30)
) +
s(time, bs = "cr"),
knots = list(insp_rel_index = c(0,1)),
data = cvp_df_30,
gamma = 5
)
```
```{r}
#| fig.cap = "Smooth effects of the CVP GAM with adaptive smoothness."
gratia::draw(gam_cvp_ad, residuals = TRUE, rug = FALSE)
```
The disturbance at end-inspiration is now fitted in the respiratory cycle smooth.
## Autocorrelated residuals
One assumption of a GAM is that the residuals are independent. This will rarely be true when modelling high resolution waveforms. We can investigate the serial correlation of model residuals using `acf()`.
```{r}
#| fig.cap = "Serial correlation of residuals from the GAM above (gam_cvp_ad). The plot shows the correlation of residuals with the residual 0 to 35 points later. Zero points later correspond to the residual itself, and will always be 1."
auto_corr_gam_cvp_ad <- acf(residuals(gam_cvp_ad))
```
`bam()` allows modelling the residual errors as an AR(1) model (each residual ($\epsilon_t$) is some proportion (rho) of the previous residual + random error: $\epsilon_t = rho \times \epsilon_{t-1} + w_t, w_t \sim normal(0, \sigma)$)). `bam()` allows us to specify a fixed AR(1) correlation coefficient with the `rho` parameter, but we have to tune it manually. A good guess is the first-order autocorrelation from a model not accounting for autocorrelation as suggested by van Rij et al, 2019 (https://doi.org/10.1177/2331216519832483).
```{r}
(ar1 <- auto_corr_gam_cvp_ad$acf[2]) # $acf[1] is the unlagged correlation (= 1)
```
The AR(1) model implies that the correlation between residuals drops exponentially with the distance between them. Our CVP waveform has a sample rate of 125 Hz, so a correlation coefficient of `r round(ar1, 2)` per sample corresponds to the following relation between correlation and time:
```{r}
time_ms <- 0:500
correlation <- ar1^((time_ms/1000)*125)
plot(time_ms, correlation, type = "l",
main = sprintf("Expected autocorrelation given the AR(1) model
with rho = %0.2f at 125 Hz", ar1))
```
With this model, we expect very little correlation between residuals more than 200 ms apart.
We can now fit a GAM that expects autocorrelated residuals.
```{r, opts.label='nooutput', cache=TRUE}
# The model is identical to gam_cvp_ad except for the rho parameter.
gam_cvp_ad_AR <- bam(
CVP ~
s(P_wave_index, bs = "cr", k = 40) +
s(insp_rel_index, bs = "ad", k = 60, m = 5, xt = list(bs = "cc")) +
ti(
P_wave_index,
insp_rel_index,
bs = c("cr", "cc"),
k = c(40, 30)
) +
s(time, bs = "cr"),
knots = list(insp_rel_index = c(0,1)),
rho = 0.88, # correlation coefficient for AR(1) model of the residuals
gamma = 5,
data = cvp_df_30
)
```
```{r}
gratia::draw(gam_cvp_ad_AR, residual = TRUE, rug = FALSE)
```
The autocorrelation-corrected residuals are stored in `gam_cvp_ad_AR$std.rsd`. Again, we can use `acf()` to visualise the residual autocorrelation.
```{r}
acf(gam_cvp_ad_AR$std.rsd)
```
We can see that this has markedly reduced the autocorrelation in the residuals. The `rho` chosen with this approach is not guaranteed to be optimal. It is possible to search for the value of rho minimizing the REML-score, but this is computationally expensive as it requires refitting the model for each rho. For detail about estimating autocorrelation of residuals, see Simpson, 2018 (https://doi.org/10.3389/fevo.2018.00149).
## The full model with two sections of data: before and after fluid
To fit the model in the paper's Fig. 6, we first create a data frame that contains both 1-minute sections (before and after fluid).
```{r}
# Section before fluid
cvp_pre <- cvp_df %>%
filter(between(time, sample_cvp$fluid_start - 60, sample_cvp$fluid_start)) %>%
mutate(time_s = time - time[1])
# Section after fluid
cvp_post <- cvp_df %>%
# Select segment starting 60 seconds after fluid administration ends,
# to get more steady data
filter(between(time, sample_cvp$fluid_end + 60, sample_cvp$fluid_end + 120)) %>%
mutate(time_s = time - time[1])
# Combine sections
cvp_pre_post <- bind_rows("pre fluid" = cvp_pre,
"post fluid" = cvp_post,
.id = "section") %>%
mutate(section_f = factor(section, levels = c("pre fluid", "post fluid"))) %>%
group_by(section_f) %>%
# Create a variable that marks the start of a new section
# (used to indicate that residuals are not expected to be correlated between the end of one
# section and the beginning of the next).
mutate(section_start = row_number() == 1) %>%
ungroup()
```
With the dataset ready, we can fit the model:
```{r, cache=TRUE}
gam_cvp_fluid <- bam(
CVP ~
# In addition to the intercept, we fit an additional constant for
# all but the first section (here this is section_f = "post fluid").
section_f +
# With the `by` parameter, a separate smooth is estimated for each level of
# the given factor `section_f`.
s(P_wave_index, bs = 'cr', k = 40, by = section_f) +
s(insp_rel_index, bs = "ad", k = 60, m = 5,
xt = list(bs = "cc"), by = section_f) +
ti(
P_wave_index,
insp_rel_index,
bs = c('cr', 'cc'),
# We reduce the complexity of the model by reducing the
# dimensions of the interaction term.
k = c(30, 30),
by = section_f
) +
s(time_s, by = section_f),
knots = list(insp_rel_index = c(0,1)),
nthreads = 4,
rho = 0.94,
# The AR.start parameter marks breaks in the autocorrelation structure.
AR.start = section_start,
# In this example we use the gamma parameter to enforce extra smoothing
# of the model.
gamma = 5,
data = cvp_pre_post,
)
```
```{r gam-cvp-fluid-draw, fig.width=10, fig.height=8, out.width = "100%"}
#| fig.cap = "Smooth effects of the GAM of CVP before and after fluid."
gratia::draw(gam_cvp_fluid, residuals = TRUE, rug = FALSE)
```
In the paper's Fig. 6 the fit is visualised as two contour plots that sum all the effects except detrending (`s(time_s)`). Here is a similar visualisation:
```{r, fig.width=10, fig.height=5, out.width = "100%"}
#| fig.cap = "Combined visualization of the GAM of CVP before and after fluid, excluding the detrending smooth."
#|
# Create a new dataset containing a grid of
# `P_wave_index` and `insp_rel_index` to generate predictions over.
# Use a grid with a resolution of 100x100.
cvp_contour_data <- expand.grid(
section_f = factor(c("pre fluid", "post fluid"), levels = c("pre fluid", "post fluid")),
P_wave_index = seq(0, 0.7, length.out = 100),
insp_rel_index = seq(0, 1, length.out = 100),
# We will not use `time` in our prediction, but predict.gam() expects the
# variable to be present.
time_s = 9999)
cvp_contour_data$CVP_pred <- predict(gam_cvp_fluid,
newdata = cvp_contour_data,
exclude = c("s(time_s):section_fpost fluid",
"s(time_s):section_fpre fluid"))
ggplot(cvp_contour_data, aes(x = P_wave_index, y=insp_rel_index, z = CVP_pred)) +
geom_raster(aes(fill = CVP_pred)) +
geom_contour(binwidth = 1, color = "black") +
scale_fill_viridis_c() +
guides(fill = guide_colorbar(barheight = 15, title = "CVP")) +
facet_wrap(vars(section_f))
```
## Other common challenges
There is more noise in the CVP waveform in the 30 seconds before the green area in Figure \@ref(fig:cvp-overview). We will use that section of data to demonstrate a few common challenges.
```{r}
#| fig.cap = "Different visualizations of the noisy data used for the second CVP model."
cvp_df_noise <- sample_cvp$cvp %>%
filter(between(time, sample_cvp$fluid_start-60, sample_cvp$fluid_start-30)) %>%
# QRS time - PQ interval = P wave time.
add_time_since_event(sample_cvp$qrs$time - PQ_interval, prefix = "P_wave") %>%
add_time_since_event(sample_cvp$insp_start$time, prefix = "insp")
cvp_time <- ggplot(cvp_df_noise, aes(time, CVP)) +
geom_line()
cvp_insp <- ggplot(cvp_df_noise, aes(insp_rel_index, CVP, group = insp_n)) +
geom_line()
cvp_p <- ggplot(cvp_df_noise, aes(P_wave_index, CVP, group = P_wave_n)) +
geom_line()
cvp_time / (cvp_insp + cvp_p)
```
In the first iteration, we will use fewer knots in the cardiac smooth.
```{r, opts.label='nooutput', cache=TRUE}
gam_cvp_noise1 <- bam(
CVP ~
s(P_wave_index, bs = "cr", k = 15) + # only 15 knots
s(insp_rel_index, bs = "ad", k = 60, m = 5,
xt = list(bs = "cc")) +
ti(
P_wave_index,
insp_rel_index,
bs = c("cr", "cc"),
k = c(15, 30) # only 15 knots in the cardiac dimension
) +
s(time, bs = "cr"),
knots = list(insp_rel_index = c(0, 1)),
data = cvp_df_noise
)
```
```{r}
#| fig.cap = "Smooth effects of the noisy CVP GAM."
gratia::draw(gam_cvp_noise1, residuals = TRUE, rug = FALSE)
```
Here, the interaction smooth clearly overfits the data, and it looks like the cardiac smooth is not flexible enough to match the sharp x' peak (the most negative peak in this sample).
Optimally, the shape of a spline should be limited by the wiggliness penalty (smoothing parameter) and not the number of knots. We can use `k.check` the see if any splines are limited by their number of knots.
```{r}
k.check(gam_cvp_noise1)
```
In this table, `k'` is the maximum possible degrees of freedom (d.f.) of the smooth. This is one less than the number knots, as one d.f. is used to constrain the spline to have a sum of zero (see `?mgcv::identifiability`). For cyclic splines it is two less than `k`, as the two limiting knots are effectively a single knot. `edf` is the effective degrees of freedom after the smoothing penalty. A `k-index` < 1 indicates there is some residual pattern that is not contained in the smooth. If `k-index` is low and `edf` is close to `k'`, the smooth probably has too few knots. On the other hand, excessive knots is not a problem, (except for the higher computational cost).
We increase `k` to 40 for the cardiac smooth and the cardiac dimension in the interaction. To reduce overfitting from the interaction term, we increase `gamma` to make the model smoother.
```{r, opts.label='nooutput', cache=TRUE}
gam_cvp_noise2 <- bam(
CVP ~
s(P_wave_index, bs = "cr", k = 40) + # k has been increased
s(insp_rel_index, bs = "ad", k = 60, m = 5,
xt = list(bs = "cc")) +
ti(
P_wave_index,
insp_rel_index,
bs = c("cr", "cc"),
k = c(40, 30) # k has been increased for the cardiac dimension
) +
s(time, bs = "cr"),
knots = list(insp_rel_index = c(0,1)),
data = cvp_df_noise,
gamma = 5
)
```
```{r}
#| fig.cap = "Smooth effects of a revised version the noisy CVP GAM. This time with more knots in the cardiac smooth (s(P_wave_index)) and fixed smoothing parameters of the interaction smooth."
gratia::draw(gam_cvp_noise2, residuals = TRUE, rug = FALSE)
```
```{r, opts.label='wide', fig.height=3}
cvp_df_noise_pred2 <- mutate(cvp_df_noise,
pred = predict(gam_cvp_noise2),
resid = resid(gam_cvp_noise2))
ggplot(cvp_df_noise_pred2, aes(time, CVP)) +
geom_line() +
geom_line(aes(y = pred), color = "red")+
geom_line(aes(y = resid), color = "orange")
```
# Session info
*This section shows information about the system used to generate this pdf file*
Rendered: `r Sys.time()`
```{r session-info, include=TRUE, echo=TRUE}
sessionInfo()
```