forked from henckr/treeML
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtreeML.Rmd
826 lines (674 loc) · 33.6 KB
/
treeML.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
---
title: 'Tree-based ML for insurance pricing'
author: 'Roel Henckaerts, Marie-Pier Côté, Katrien Antonio and Roel Verbelen'
output:
html_notebook:
toc: true
toc_float: true
code_folding : none
theme : cosmo
---
```{css, echo = FALSE}
pre code, pre, code {
white-space: pre !important;
overflow-x: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
}
```
<style>
body { text-align: justify}
</style>
This R Markdown Notebook illustrates the techniques used in the paper "Boosting insights in insurance tariff plans with tree-based machine learning methods" via code examples. The full paper is available on [arXiv](https://arxiv.org/abs/1904.10890) and is published in the [North American Actuarial Journal](https://www.tandfonline.com/toc/uaaj20/current).
```{r setup, include = FALSE}
library(tidyverse)
ggplot2::theme_set(theme_bw())
ggplot2::theme_update(text = element_text(size = 20))
```
## MTPL portfolio
We study a motor third party liability (MTPL) portfolio. Details on the data are available in Section 4.1 of the paper.
```{r data_read}
mtpl_data <- readRDS('mtpl_data.rds')
str(mtpl_data)
```
The following variables are used as predictor features, see Table 7 in Appendix A of the paper for a description.
```{r data_vars}
features <- c('coverage', 'fuel', 'sex', 'use', 'fleet',
'ageph', 'power', 'agec', 'bm',
'long', 'lat')
```
## Train and test split
We follow the same data partition approach as discussed in the "Cross-validation" paragraph of Section 3.3 in the paper. It partitions the data in six subsets and allows us to replicate the summary statistics of Table 2 as follows:
```{r data_fold}
# Add the fold indicator
mtpl_data <- mtpl_data %>% dplyr::arrange(nclaims, amount, expo) %>%
dplyr::mutate(fold = paste0('data', rep(seq_len(6), length = nrow(mtpl_data))))
# Calculate summary statistics
mtpl_data %>% dplyr::group_by(fold) %>%
dplyr::summarise(sum(nclaims) / sum(expo),
sum(amount) / sum(nclaims))
```
In this Notebook we assign $\mathcal{D}_3$ as the test set and the collection $\{ \mathcal{D}_1, \mathcal{D}_2, \mathcal{D}_4, \mathcal{D}_5, \mathcal{D}_6 \}$ as the training set. This matches the setup of "Data fold 3" in Figure 3 of the paper with the training data in blue and the test data in red.
```{r data_split}
# Subset the training data
mtpl_trn <- mtpl_data %>% dplyr::filter(fold != 'data3')
# Subset the test data
mtpl_tst <- mtpl_data %>% dplyr::filter(fold == 'data3')
```
This results in `r nrow(mtpl_trn)`/`r nrow(mtpl_tst)` observations in the train/test data respectively. In the following modeling sections, we use the hyper-parameter settings from Table 1 and the optimal tuning parameter settings from Table 3 for data fold 3.
## Frequency modeling
We start by modeling claim frequency, i.e., the number of claims a policyholder is expected to file with the insurer. It is important to account for the period of exposure-to-risk, i.e., the fraction of the year that a policyholder was covered by the policy and therefore exposed to the risk of filing a claim. The number of claims and exposure period are available in `mtpl_data` as the variables `nclaims` and `expo` respectively:
```{r freq_data, fig.width=8, fig.height=4}
gridExtra::grid.arrange(
ggplot(mtpl_data, aes(x = nclaims)) +
geom_bar(aes(y = (..count..)/sum(..count..)), col = 'black') +
labs(y = 'Proportion') + scale_y_continuous(labels = scales::percent),
ggplot(mtpl_data, aes(x = expo)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 0.05, col = 'black') +
labs(y = 'Proportion') + scale_y_continuous(labels = scales::percent),
ncol = 2
)
```
### Regression tree
We fit a Poisson regression tree for claim frequency with the `rpart()` function and `method = 'poisson'` from the `distRforest` package (available from Roel Henckaerts's [GitHub](https://github.com/henckr/distRforest)):
```{r freq_tree}
# devtools::install_github('henckr/distRforest')
library(distRforest)
tree_freq <- distRforest::rpart(
formula = as.formula(paste('cbind(expo, nclaims) ~', paste(features, collapse = ' + '))),
data = mtpl_trn,
method = 'poisson',
parms = list(shrink = 0.125), # gamma in Table 3
control = rpart.control(cp = 1.1e-4, # cp in Table 3
minbucket = 0.01 * nrow(mtpl_trn), # kappa in Table 1
xval = 0,
maxcompete = 0,
maxsurrogate = 0)
)
```
*Note: [`distRforest`](https://henckr.github.io/distRforest/) is an extension of `rpart` and allows for severity distributions and random forest generation.*
### Random forest
We fit a Poisson random forest for claim frequency with the `rforest()` function and `method = 'poisson'` from the `distRforest` package (available from Roel Henckaerts's [GitHub](https://github.com/henckr/distRforest)):
```{r freq_rf}
set.seed(54321)
rf_freq <- distRforest::rforest(
formula = as.formula(paste('cbind(expo, nclaims) ~', paste(features, collapse = ' + '))),
data = mtpl_trn,
method = 'poisson',
ntrees = 400, # T in Table 3
ncand = 8, # m in Table 3
subsample = 0.75, # delta in Table 1
parms = list(shrink = 0.25), # gamma in Table 1
control = rpart.control(cp = 0, # cp in Table 1
minbucket = 0.01 * 0.75 * nrow(mtpl_trn), # kappa * delta in Table 1
xval = 0,
maxcompete = 0,
maxsurrogate = 0),
red_mem = TRUE # reduces the memory footprint of individual rpart trees
)
```
### Gradient boosting machine
We fit a Poisson GBM for claim frequency with the `gbm()` function and `distribution = 'poisson'` from the `gbm` package (available from Harry Southworth's [GitHub](https://github.com/harrysouthworth/gbm)):
```{r freq_gbm, message = FALSE}
# devtools::install_github('harrysouthworth/gbm')
library(gbm)
set.seed(54321)
gbm_freq <- gbm(
formula = as.formula(paste('nclaims ~ offset(log(expo)) +', paste(features, collapse = ' + '))),
data = mtpl_trn,
distribution = 'poisson',
n.trees = 1400, # T in Table 3
interaction.depth = 5, # d in Table 3
shrinkage = 0.01, # lambda in Table 1
bag.fraction = 0.75, # delta in Table 1
n.minobsinnode = 0.01 * 0.75 * nrow(mtpl_trn), # kappa * delta in Table 1
verbose = FALSE
)
```
*Note: Unlike the CRAN version of `gbm`, HS's implementation allows for a gamma distribution for claim severity.*
## Severity modeling
Next, we also model claim severity, i.e., the expected loss amount for a claim if it gets filed. The variable `amount` in `mtpl_data` contains the total loss amount over all claims filed by a policyholder. In order to get an estimate of the individual claim severity, the variable `average` contains the total loss amount divided by the number of claims. Only the observations containing a claim can be used to model claim severity:
```{r sev_data, fig.width=8, warning = FALSE, fig.height=4}
# Only retain the claims
mtpl_trn_claims <- mtpl_trn %>% dplyr::filter(nclaims > 0)
# Plot the density of all observations and those below 10 000 Euro
gridExtra::grid.arrange(
ggplot(mtpl_trn_claims, aes(x = average)) +
geom_density(adjust = 3, col = 'black', fill = 'gray') +
labs(y = 'Density'),
ggplot(mtpl_trn_claims, aes(x = average)) +
geom_density(adjust = 3, col = 'black', fill = 'gray') +
labs(y = 'Density') + xlim(0, 1e4),
ncol = 2
)
```
### Regression tree
We fit a gamma regression tree for claim severity with the `rpart()` function and `method = 'gamma'` from the `distRforest` package (available from Roel Henckaerts's [GitHub](https://github.com/henckr/distRforest)):
```{r sev_tree}
tree_sev <- distRforest::rpart(
formula = as.formula(paste('average ~', paste(features, collapse = ' + '))),
data = mtpl_trn_claims,
weights = nclaims,
method = 'gamma',
control = rpart.control(cp = 3.7e-3, # cp in Table 3
minbucket = 0.01 * nrow(mtpl_trn_claims), # kappa in Table 1
xval = 0,
maxcompete = 0,
maxsurrogate = 0)
)
```
### Random forest
We fit a gamma random forest for claim severity with the `rforest()` function and `method = 'gamma'` from the `distRforest` package (available from Roel Henckaerts's [GitHub](https://github.com/henckr/distRforest)):
```{r sev_rf}
set.seed(54321)
rf_sev <- distRforest::rforest(
formula = as.formula(paste('average ~', paste(features, collapse = ' + '))),
data = mtpl_trn_claims,
weights = nclaims,
method = 'gamma',
ntrees = 600, # T in Table 3
ncand = 1, # m in Table 3
subsample = 0.75, # delta in Table 1
control = rpart.control(cp = 0, # cp in Table 1
minbucket = 0.01 * 0.75 * nrow(mtpl_trn_claims), # kappa * delta in Table 1
xval = 0,
maxcompete = 0,
maxsurrogate = 0),
red_mem = TRUE # reduces the memory footprint of individual rpart trees
)
```
### Gradient boosting machine
We fit a gamma GBM for claim severity with the `gbm()` function and `distribution = 'gamma'` from the `gbm` package (available from Harry Southworth's [GitHub](https://github.com/harrysouthworth/gbm)):
```{r sev_gbm, message = FALSE}
set.seed(54321)
gbm_sev <- gbm(
formula = as.formula(paste('average ~', paste(features, collapse = ' + '))),
data = mtpl_trn_claims,
weights = nclaims,
distribution = 'gamma',
n.trees = 500, # T in Table 3
interaction.depth = 1, # d in Table 3
shrinkage = 0.01, # lambda in Table 1
bag.fraction = 0.75, # delta in Table 1
n.minobsinnode = 0.01 * 0.75 * nrow(mtpl_trn_claims), # kappa * delta in Table 1
verbose = FALSE
)
```
## Prediction functions
Now all ML models are fit, we can start analyzing them. To streamline this process, we define a predict function `predict_model` that can be applied to the different models in a uniform way:
```{r pred_fun}
# Generic prediction function
predict_model <- function(object, newdata) UseMethod('predict_model')
# Prediction function for a regression tree
predict_model.rpart <- function(object, newdata) {
predict(object, newdata, type = 'vector')
}
# Prediction function for a random forest
predict_model.rforest <- function(object, newdata) {
predict(object, newdata)
}
# Prediction function for a GBM
predict_model.gbm <- function(object, newdata) {
predict(object, newdata, n.trees = object$n.trees, type = 'response')
}
```
## Model interpretation
We now focus on extracting insights from the ML models with variable importance scores, partial dependence plots (PDPs) and individual conditional expectations (ICEs). Details are available in Sections 3.4 and 4.3 of the paper.
### Variable importance
A first interesting insight is to know which features are important according to your model. This information can be obtained in the following way for the different models considered:
+ regression tree: access importance scores of an `rpart` object via `object$variable.importance`
+ random forest: apply the function `distRforest::importance_rforest` to the `rforest` object
+ gradient boosting machine: apply the function `summary` (with optional arguments) to the `gbm` object
These three options are coded in the function `var_imp` below to generate a uniform output. The function `plot_var_imp` takes the calculated importance scores and shows the results in a `ggplot` bar chart.
```{r var_imp_fun}
var_imp <- function(object) {
# Calculate non-normalized importance scores based on the model class
switch(class(object)[1],
'rpart' = data.frame(variable = names(object$variable.importance),
importance = object$variable.importance),
'rforest' = object %>% distRforest::importance_rforest() %>%
dplyr::select(variable, importance),
'gbm' = object %>% summary(plotit = FALSE, normalize = FALSE) %>%
setNames(c('variable', 'importance'))
) %>%
# Normalize the scores to sum to one
dplyr::mutate(scale_sum = round(importance / sum(importance), digits = 4))
}
plot_var_imp <- function(data) {
data %>% ggplot(aes(x = reorder(variable, scale_sum), y = scale_sum)) +
geom_bar(stat = 'identity') + coord_flip() +
labs(x = '', y = 'importance')
}
```
The variable importance scores for the three frequency ML models are shown below. These are the green bars for data fold 3 in the left panels of Figure 5 in the paper.
```{r var_imp_freq, fig.width = 8, fig.height=4}
gridExtra::grid.arrange(tree_freq %>% var_imp %>% plot_var_imp,
rf_freq %>% var_imp %>% plot_var_imp,
gbm_freq %>% var_imp %>% plot_var_imp,
ncol = 3)
```
The variable importance scores for the three severity ML models are shown below. These are the green bars for data fold 3 in the right panels of Figure 5 in the paper.
```{r var_imp_sev, fig.width = 8, fig.height=4}
gridExtra::grid.arrange(tree_sev %>% var_imp %>% plot_var_imp,
rf_sev %>% var_imp %>% plot_var_imp,
gbm_sev %>% var_imp %>% plot_var_imp,
ncol = 3)
```
### PDPs
We use partial dependence plots (PDPs) to get an insight on the relation between a feature and the target. The function `par_dep` performs the essential steps to generate such a PD effect. The following steps are performed for each value in a predefined grid of the variable of interest:
+ use the original training data (or a subset to speedup calculations)
+ change the value of the variable of interest to the current value in the grid for all observations
+ predict the model on this altered data set
+ calculate the mean of all these predictions to get the PD effect for the current grid value
```{r pdp_fun}
par_dep <- function(object, data, grid) {
# Initialize a vector to save the effect
pd_effect <- rep(0, nrow(grid))
# Iterate over the grid values to calculate the effect
for (i in seq_len(length(pd_effect))) {
pd_effect[i] <-
data %>%
dplyr::mutate(!! names(grid) := grid[i, ]) %>%
predict_model(object, newdata = .) %>%
mean()
}
return(pd_effect)
}
```
We will now use this function to generate the PD effect for the age of the policyholder in the frequency models:
```{r pdp_data, warning = FALSE}
# Use a random sample of the training observations
set.seed(54321)
mtpl_trn_sample <- mtpl_trn[sample(seq_len(nrow(mtpl_trn)), size = 10000), ]
# Define the grid for the ages
grid_ageph <- data.frame('ageph' = 18:90)
# Calculate the PD effect for each ML model
grid_ageph <- grid_ageph %>%
dplyr::mutate(tree = tree_freq %>% par_dep(data = mtpl_trn_sample,
grid = grid_ageph),
rf = rf_freq %>% par_dep(data = mtpl_trn_sample,
grid = grid_ageph),
gbm = gbm_freq %>% par_dep(data = mtpl_trn_sample,
grid = grid_ageph))
```
After some reshaping we can plot these effects on top of each other. The effects for the tree and the gbm correspond to the green lines in the bottom panels of Figure 6.
```{r pdp_plot, fig.width = 8, fig.height=4}
grid_ageph %>% reshape2::melt(id.vars = 'ageph',
value.name = 'pd',
variable.name = 'method') %>%
ggplot(aes(x = ageph, y = pd)) +
geom_line(aes(group = method, colour = method))
```
### ICEs
An individual conditional expectation (ICE) curve is generated in a very comparable way to a PDP. The same steps as listed above are followed, only the last step is not performed. An ICE curve shows the individual predictions instead of averaging all the predictions (like in a PDP). The function `ice` to generate an ICE curve is therefore very similar to `par_dep`:
```{r ice_fun}
ice <- function(object, data, grid) {
# Initialize a matrix to save the effect
ice_effect <- matrix(0, nrow = nrow(grid), ncol = nrow(data))
# Iterate over the grid values to calculate the effect
for (i in seq_len(nrow(ice_effect))) {
ice_effect[i, ] <-
data %>%
dplyr::mutate(!! names(grid) := grid[i, ]) %>%
predict_model(object, newdata = .)
}
return(cbind(grid, ice_effect))
}
```
We will now use this function to generate the ICE effect for the bonus-malus level in frequency models:
```{r ice_data, warning = FALSE}
# Use a random sample of the training observations
set.seed(54321)
mtpl_trn_sample <- mtpl_trn[sample(seq_len(nrow(mtpl_trn)), size = 1000), ]
# Define the grid for the ages
grid_bm <- data.frame('bm' = 0:22)
# Calculate the ICE effect
ice_tree <- tree_freq %>% ice(data = mtpl_trn_sample,
grid = grid_bm)
ice_gbm <- gbm_freq %>% ice(data = mtpl_trn_sample,
grid = grid_bm)
```
After some reshaping we plot these ICE curves with the PD effect on top, as in Figure 8 of the paper:
```{r ice_plot, fig.width = 8, fig.height=4}
gridExtra::grid.arrange(
ice_tree %>% reshape2::melt(id.vars = 'bm',
value.name = 'ice',
variable.name = 'observation') %>%
dplyr::group_by(bm) %>%
dplyr::mutate(pd = mean(ice)) %>%
ggplot(aes(x = bm)) +
geom_line(aes(y = ice, group = observation), color = 'grey', alpha = 0.1) +
geom_line(aes(y = pd), size = 1, color = 'navy'),
ice_gbm %>% reshape2::melt(id.vars = 'bm',
value.name = 'ice',
variable.name = 'observation') %>%
dplyr::group_by(bm) %>%
dplyr::mutate(pd = mean(ice)) %>%
ggplot(aes(x = bm)) +
geom_line(aes(y = ice, group = observation), color = 'grey', alpha = 0.1) +
geom_line(aes(y = pd), size = 1, color = 'navy'),
ncol = 2
)
```
## Interaction effects
Tree-based models are often praised for their ability to detect interaction effects between variables. Friedman’s *H*-statistic estimates the interaction strength by measuring how much of the prediction variance originates from the interaction, see Section 4.4 in the paper for the details. The function `interact.gbm` calculates the *H*-statistic for a `gbm` object. (*Note: the function `interact.gbm` is not exported in Harry Southworth's version, so I include the function in the Rmd source of this Notebook.*)
```{r interact_fun, include=FALSE}
interact.gbm <- function(x, data, i.var = 1, n.trees = x$n.trees){
###############################################################
# Do sanity checks on the call
if (x$interaction.depth < length(i.var)){
stop("interaction.depth too low in model call")
}
if (all(is.character(i.var))){
i <- match(i.var, x$var.names)
if (any(is.na(i))) {
stop("Variables given are not used in gbm model fit: ", i.var[is.na(i)])
}
else {
i.var <- i
}
}
if ((min(i.var) < 1) || (max(i.var) > length(x$var.names))) {
warning("i.var must be between 1 and ", length(x$var.names))
}
if (n.trees > x$n.trees) {
warning(paste("n.trees exceeds the number of trees in the model, ",
x$n.trees,". Using ", x$n.trees, " trees.", sep = ""))
n.trees <- x$n.trees
}
# End of sanity checks
###############################################################
unique.tab <- function(z,i.var) {
a <- unique(z[,i.var,drop=FALSE])
a$n <- table(factor(apply(z[,i.var,drop=FALSE],1,paste,collapse="\r"),
levels=apply(a,1,paste,collapse="\r")))
return(a)
}
# convert factors
for(j in i.var) {
if(is.factor(data[,x$var.names[j]]))
data[,x$var.names[j]] <-
as.numeric(data[,x$var.names[j]])-1
}
# generate a list with all combinations of variables
a <- apply(expand.grid(rep(list(c(FALSE,TRUE)), length(i.var)))[-1,],1,
function(x) as.numeric(which(x)))
FF <- vector("list",length(a))
for(j in 1:length(a)) {
FF[[j]]$Z <- data.frame(unique.tab(data, x$var.names[i.var[a[[j]]]]))
FF[[j]]$n <- as.numeric(FF[[j]]$Z$n)
FF[[j]]$Z$n <- NULL
FF[[j]]$f <- .Call("gbm_plot",
X = as.double(data.matrix(FF[[j]]$Z)),
cRows = as.integer(nrow(FF[[j]]$Z)),
cCols = as.integer(ncol(FF[[j]]$Z)),
n.class = as.integer(x$num.classes),
i.var = as.integer(i.var[a[[j]]] - 1),
n.trees = as.integer(n.trees),
initF = as.double(x$initF),
trees = x$trees,
c.splits = x$c.splits,
var.type = as.integer(x$var.type),
PACKAGE = "gbm")
# FF[[jj]]$Z is the data, f is the predictions, n is the number of levels for factors
# Need to restructure f to deal with multinomial case
FF[[j]]$f <- matrix(FF[[j]]$f, ncol=x$num.classes, byrow=FALSE)
# center the values
FF[[j]]$f <- apply(FF[[j]]$f, 2, function(x, w){
x - weighted.mean(x, w, na.rm=TRUE)
}, w=FF[[j]]$n)
# precompute the sign of these terms to appear in H
FF[[j]]$sign <- ifelse(length(a[[j]]) %% 2 == length(i.var) %% 2, 1, -1)
}
H <- FF[[length(a)]]$f
for(j in 1:(length(a)-1)){
i1 <- apply(FF[[length(a)]]$Z[,a[[j]], drop=FALSE], 1, paste, collapse="\r")
i2 <- apply(FF[[j]]$Z,1,paste,collapse="\r")
i <- match(i1, i2)
H <- H + with(FF[[j]], sign*f[i,])
}
# Compute H
w <- matrix(FF[[length(a)]]$n, ncol=1)
f <- matrix(FF[[length(a)]]$f^2, ncol=x$num.classes, byrow=FALSE)
top <- apply(H^2, 2, weighted.mean, w = w, na.rm = TRUE)
btm <- apply(f, 2, weighted.mean, w = w, na.rm = TRUE)
H <- top / btm
if (x$distribution$name=="multinomial"){
names(H) <- x$classes
}
# If H > 1, rounding and tiny main effects have messed things up
H[H > 1] <- NaN
return(sqrt(H))
}
```
We now calculate two-way interaction strengths between variables and verify some values in Table 4 of the paper:
```{r interact_gbm}
gbm_freq %>% interact.gbm(data = mtpl_trn,
i.var = c('fuel', 'power')) %>% round(4)
gbm_freq %>% interact.gbm(data = mtpl_trn,
i.var = c('ageph', 'sex')) %>% round(4)
gbm_freq %>% interact.gbm(data = mtpl_trn,
i.var = c('agec', 'coverage')) %>% round(4)
gbm_freq %>% interact.gbm(data = mtpl_trn,
i.var = c('ageph', 'power')) %>% round(4)
```
We use the partial dependence effects of a variable, grouped by another variable, to get an insight on the interaction behavior between those two variables. The function `par_dep_by` allows to generate such grouped PD effects:
```{r pdp_by_fun}
par_dep_by <- function(object, data, grid, by_var, ngroups = NULL) {
# Initialize a matrix to save the effect
ice_effect <- matrix(0, nrow = nrow(data), ncol = nrow(grid))
# Iterate over the grid values to calculate the effect
for (i in seq_len(ncol(ice_effect))) {
ice_effect[, i] <-
data %>%
dplyr::mutate(!! names(grid) := grid[i, ]) %>%
predict_model(object, newdata = .)
}
# Add the grouping variable to the effect
pd_gr <- data %>% dplyr::select(!! by_var) %>%
cbind(ice_effect)
# Bin the grouping variable in groups
if (!is.null(ngroups)) {
bins <- data %>% dplyr::pull(!! by_var) %>%
cut(breaks = unique(quantile(., probs = seq(0, 1, 1/ngroups))),
include.lowest = TRUE, dig.lab = 4)
pd_gr <- pd_gr %>% dplyr::mutate(!! by_var := bins)
}
# Calculate the PD effect for each group
pd_gr <- pd_gr %>% dplyr::group_by_at(by_var) %>%
dplyr::summarise_all(mean) %>%
dplyr::rename(setNames(as.character(seq_len(nrow(grid))), unlist(grid)))
# Center the PD effects to start from zero
pd_gr[2:ncol(pd_gr)] <- pd_gr[2:ncol(pd_gr)] - pd_gr[[2]]
return(pd_gr)
}
```
We calculate the PD effect for `power`, grouped by `ageph` and `fuel`. Note that we use `ngroups = 5` for the continuous variable `ageph` but `ngroups = NULL` for the factor variable `fuel` (one group for each factor level).
```{r pdp_by_data, warning=FALSE}
pd_power_ageph <- gbm_freq %>% par_dep_by(data = mtpl_trn_sample,
grid = data.frame('power' = 30:150),
by_var = 'ageph',
ngroups = 5)
pd_power_fuel <- gbm_freq %>% par_dep_by(data = mtpl_trn_sample,
grid = data.frame('power' = 30:150),
by_var = 'fuel',
ngroups = NULL)
```
We visualize the interaction behavior between `power` and both `ageph` and `fuel`. More graphs are available in Figure 9 in the paper. (*Note: the bin labels for `ageph` differ because a smaller sample size is used in this Notebook.*)
```{r pdp_by_plot, fig.width=8, fig.height=4}
gridExtra::grid.arrange(
pd_power_ageph %>% reshape2::melt(id.var = 'ageph',
value.name = 'pd_group',
variable.name = 'power') %>%
dplyr::mutate(power = as.numeric(as.character(power))) %>%
ggplot(aes(x = power, y = pd_group)) +
geom_line(aes(group = ageph, colour = ageph)),
pd_power_fuel %>% reshape2::melt(id.var = 'fuel',
value.name = 'pd_group',
variable.name = 'power') %>%
dplyr::mutate(power = as.numeric(as.character(power))) %>%
ggplot(aes(x = power, y = pd_group)) +
geom_line(aes(group = fuel, colour = fuel)),
ncol = 2
)
```
## Statistical performance
After gaining some insights from the different ML models, we now put focus on comparing the out-of-sample performance on the test data `mtpl_tst`. We predict each ML model on the test data:
```{r pred_data, warning=FALSE}
oos_pred <- tibble::tibble(
tree_freq = tree_freq %>% predict_model(newdata = mtpl_tst),
rf_freq = rf_freq %>% predict_model(newdata = mtpl_tst),
gbm_freq = gbm_freq %>% predict_model(newdata = mtpl_tst),
tree_sev = tree_sev %>% predict_model(newdata = mtpl_tst),
rf_sev = rf_sev %>% predict_model(newdata = mtpl_tst),
gbm_sev = gbm_sev %>% predict_model(newdata = mtpl_tst)
)
```
These predictions are compared to the observed values in `mtpl_tst` with the Poisson/gamma deviance for frequency/severity models respectively:
```{r dev_fun}
# Poisson deviance
dev_poiss <- function(ytrue, yhat) {
-2 * mean(dpois(ytrue, yhat, log = TRUE) - dpois(ytrue, ytrue, log = TRUE), na.rm = TRUE)
}
# Gamma deviance
dev_gamma <- function(ytrue, yhat, wcase) {
-2 * mean(wcase * (log(ytrue/yhat) - (ytrue - yhat)/yhat), na.rm = TRUE)
}
```
The out-of-sample deviances are calculated below. These are the values for data fold 3 in Figure 10 of the paper.
```{r oos_comp}
# Calculate the Poisson deviance for the frequency models
oos_pred %>% dplyr::select(ends_with('_freq')) %>%
purrr::map(~ dev_poiss(mtpl_tst$nclaims, .x * mtpl_tst$expo))
# Calculate the gamma deviance for the severity models
oos_pred %>% dplyr::select(ends_with('_sev')) %>%
purrr::map(~ dev_gamma(mtpl_tst$average, .x, mtpl_tst$nclaims))
```
## Economic lift
After comparing the ML models for frequency and severity, we now turn to a comparison at the premium level. We calculate the predicted premiums for the test data `mtpl_tst` by multiplying the frequency and severity:
```{r prem_data}
oos_pred <- oos_pred %>% dplyr::mutate(
tree_prem = tree_freq * tree_sev,
rf_prem = rf_freq * rf_sev,
gbm_prem = gbm_freq * gbm_sev
)
```
The predicted premium totals for each model (adjusted for exposure) are calculated below. These correspond to the values in Table 5 of the paper for data fold 3. (*Note: the values for the random forest are slighlty different due to an implementation update to the `distRforest` package regarding subsampling.*)
```{r prem_total}
oos_pred %>% dplyr::select(ends_with('_prem')) %>%
dplyr::summarise_all(~ sum(.x * mtpl_tst$expo))
```
We now focus on some model lift measures, which are introduced and analyzed in Sections 5.1 and 5.2 of the paper. To streamline the coding we add the observed target values from the test data `mtpl_tst` to the predictions data:
```{r lift_data}
oos_pred <- oos_pred %>% dplyr::mutate(
nclaims = mtpl_tst$nclaims,
expo = mtpl_tst$expo,
amount = mtpl_tst$amount
)
```
### Loss ratio lift
The loss ratio lift is assessed by applying the following steps:
+ sort policies from smallest to largest relativity
+ bin the policies in groups of equal exposure
+ calculate the loss ratio in each bin using the benchmark premium
```{r lrl_fun}
loss_ratio_lift <- function(data, bench, comp, ngroups) {
# Calculate relativity and sort from small to large
data %>% dplyr::mutate(r = get(paste0(comp, '_prem')) / get(paste0(bench, '_prem'))) %>%
dplyr::arrange(r) %>%
# Bin in groups of equal exposure
dplyr::mutate(bin = cut(cumsum(expo),
breaks = sum(expo) * (0:ngroups) / ngroups,
labels = FALSE)) %>%
dplyr::group_by(bin) %>%
dplyr::mutate(r_lab = paste0('[', round(min(r), 2), ',', round(max(r), 2), ']')) %>%
# Calculate loss ratio per bin
dplyr::summarise(r_lab = r_lab[1],
loss_ratio = sum(amount) / sum(get(paste0(bench, '_prem'))),
sum_expo = sum(expo))
}
```
We calculate the loss ratio lifts for the tree as benchmark and GBM as competitor and vice versa:
```{r lrl_data}
lrl_gbm_tree <- oos_pred %>% loss_ratio_lift(bench = 'tree',
comp = 'gbm',
ngroups = 5)
lrl_gbm_tree
lrl_tree_gbm <- oos_pred %>% loss_ratio_lift(bench = 'gbm',
comp = 'tree',
ngroups = 5)
lrl_tree_gbm
```
Plotting these results next to each other clearly shows that the GBM aligns the risk better compared to the tree:
```{r lrl_plot, fig.width=8, fig.height=4}
gridExtra::grid.arrange(
lrl_gbm_tree %>% ggplot(aes(x = r_lab, y = loss_ratio)) +
geom_bar(stat = 'identity') +
ggtitle('comp: gbm / bench: tree'),
lrl_tree_gbm %>% ggplot(aes(x = r_lab, y = loss_ratio)) +
geom_bar(stat = 'identity') +
ggtitle('comp: tree / bench: gbm'),
ncol = 2
)
```
### Double lift
The double lift is assessed by applying the following steps:
+ sort policies from smallest to largest relativity
+ bin the policies in groups of equal exposure
+ calculate the average loss amount and average premiums (comp & bench) in each bin
+ calculate the percentage error of premium (comp & bench) to loss in each bin
```{r dbl_fun}
double_lift <- function(data, bench, comp, ngroups) {
# Calculate relativity and sort from small to large
data %>% dplyr::mutate(r = get(paste0(comp, '_prem')) / get(paste0(bench, '_prem'))) %>%
dplyr::arrange(r) %>%
# Bin in groups of equal exposure
dplyr::mutate(bin = cut(cumsum(expo),
breaks = sum(expo) * (0:ngroups) / ngroups,
labels = FALSE)) %>%
dplyr::group_by(bin) %>%
dplyr::mutate(r_lab = paste0('[', round(min(r), 2), ',', round(max(r), 2), ']')) %>%
# Calculate percentage errors for both tariffs
dplyr::summarise(r_lab = r_lab[1],
error_comp = mean(get(paste0(comp, '_prem'))) / mean(amount) - 1,
error_bench = mean(get(paste0(bench, '_prem'))) / mean(amount) - 1,
sum_expo = sum(expo))
}
```
We calculate the double lift for the tree as benchmark and GBM as competitor:
```{r dbl_data}
dbl_gbm_tree <- oos_pred %>% double_lift(bench = 'tree',
comp = 'gbm',
ngroups = 5)
dbl_gbm_tree
```
Plotting these results clearly shows that the GBM aligns the risk better compared to the tree:
```{r dbl_plot, fig.width=8, fig.height=4}
dbl_gbm_tree %>% reshape2::melt(id.vars = c('r_lab', 'bin' , 'sum_expo'),
value.name = 'perc_err',
variable.name = 'tariff') %>%
ggplot(aes(x = r_lab, y = perc_err)) +
geom_line(aes(group = tariff, colour = tariff))
```
### Gini index
The last measure for economic lift that we analyze is the Gini index obtained from an ordered Lorenz curve. The function `gini()` from the `cplm` package allows to calculate Gini indices for competing models. The mini-max strategy selects the GBM as the model that is least vulnerable to alternative models:
```{r gini_fun, message=FALSE}
library(cplm)
gini(loss = 'amount',
score = paste0(c('tree', 'rf', 'gbm'), '_prem'),
data = as.data.frame(oos_pred))
```
We can program the mini-max strategy explicitly as follows, which gives the ranking `gbm > rf > tree`:
```{r gini_data}
gini(loss = 'amount',
score = paste0(c('tree', 'rf', 'gbm'), '_prem'),
data = as.data.frame(oos_pred)) %>%
slot('gini') %>%
as.data.frame() %>%
dplyr::mutate(max_gini = pmax(tree_prem, rf_prem, gbm_prem)) %>%
dplyr::mutate(bench = c('tree', 'rf', 'gbm')) %>%
dplyr::arrange(max_gini)
```
*Note: the values differ from those in Table 6 of the paper because the analysis in this Notebook uses only the out-of-sample observations from $\mathcal{D}_3$, while the results in the paper use the out-of-sample data from all six folds $\mathcal{D}_1$ up to $\mathcal{D}_6$. The conclusions remain the same however.*
## Conclusions
This Notebook replicates most of the results from our paper on "Boosting insights in insurance tariff plans with tree-based machine learning methods". Hopefully this helps you to jumpstart your tree-based ML analysis for insurance pricing or related applications. Happy coding!