-
Notifications
You must be signed in to change notification settings - Fork 58
/
Copy pathGLM2.rmd
1222 lines (917 loc) · 47.1 KB
/
GLM2.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Generalized Linear Models 2"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output:
tufte::tufte_html:
toc: true
number_sections: true
---
```{r, include=FALSE, echo=FALSE}
library(tufte)
```
# Setup
You can download the `R`markdown file for this content here
[https://jwiley.github.io/MonashHonoursStatistics/GLM2.rmd](https://jwiley.github.io/MonashHonoursStatistics/GLM2.rmd).
We will be using a few packages for this content and also the baseline
data only from the data collection exercise.
```{r setup}
options(digits = 3)
library(haven)
library(data.table)
library(JWileymisc)
library(ggplot2)
library(ggpubr)
library(visreg)
library(survey)
## read in data
db <- as.data.table(read_sav("B 19032020.sav")) # baseline
## create some binary variables
db[, StressHigh := as.integer(stress >= 31)]
db[, SelfesteemHigh := as.integer(selfesteem >= 15)]
## read in some example data from the internet
## (requires internet connection to run)
dcount <- fread("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
```
# Generalized Linear Models
We previously learned about linear regression and how it fit into the
Generalized Linear Models (GLM) framework (if you are not clear on
this, see:
[http://joshuawiley.com/MonashHonoursStatistics/GLM1.html](http://joshuawiley.com/MonashHonoursStatistics/GLM1.html)).
Now we are going to learn about two other special cases of GLMs:
binary logistic regression and poisson regression.
# Poisson Regression
Poisson regression is used when the outcome is a count variable. Count
variables are discrete and must be whole, 0 or positive numbers (e.g.,
0, 1, 2, 3, etc.). The poisson distribution is used most often when
the counts are relatively rare. Some examples of research use cases
where a poisson may be appropriate are:
- examining risk factors for the number of accidents someone gets into
over a 12 month period
- analysing the number of children people have
- predicting how many friends people have
- evaluating whether an intervention reduced the number of times
someone missed their medication in the last month
- testing whether the total number of health care appointments over
six months can be lowered by treating mental health
For count variables, it is common and not necessarily a problem if
most people score a 0, with relatively few people having counts of 1,
2, 3, etc.
In the simulated `dcount` data we downloaded from the internet, there
is a variable, `num_awards` which is the number of awards a student
got. We can use `testDistribution()` to examine this and get a default
comparison against a normal distribution as shown in the following
figure. The density plot is "lumpy" showing the discrete nature of the
data. No transformation would make this data approximately normally
distributed.
```{r, fig.width = 6, fig.height = 4}
plot(testDistribution(dcount$num_awards))
```
Rather than transform the data, we can change the distribution. We
again use `testDistribution()` but this time specify the distribution
as "poisson". The poinsson distribution is a discrete distribution so
a continuous density curve is not plotted. Instead, we get discrete
bars where the solid dark grey bars show the observed density of
different values and the blue bars show the density under a poisson
distribution. The deviaties plot at the bottom again shows deviations
from the assumed distribution, in this case, the poisson. Although the
densities are not a perfect match, they are relatively close and we
can see from the log likelihood values (labelled `LL`) printed in the graphs that
the poisson has a **much** higher log likelihood for the data than
does the normal distribution we tried earlier.
```{r, fig.width = 6, fig.height = 4}
plot(testDistribution(dcount$num_awards, distr = "poisson"))
```
Like the normal distribution, the poisson distribution is a family of
distributions and the specific one you will get is controlled by a
single parameter, $\lambda$, which is both the mean and the variance
of the poisson distribution. You can find out more about the poisson
distribution
[here](https://en.wikipedia.org/wiki/Poisson_distribution)
including the relevant equations for it. The figure that follows shows
the density for different counts under a poisson distribution with
varying $\lambda$ values.
```{r, echo=FALSE, fig.width = 6, fig.height = 4}
plotdat <- data.table(x = 0:20)
plotdat[, Lambda1 := dpois(x, lambda = 1)]
plotdat[, Lambda3 := dpois(x, lambda = 3)]
plotdat[, Lambda5 := dpois(x, lambda = 5)]
plotdat[, Lambda10 := dpois(x, lambda = 10)]
ggplot(melt(plotdat, id.vars = "x"), aes(x, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
theme_pubr() +
xlab("x") +
ylab("Density of the poisson distribution") +
scale_fill_discrete("")
```
Linear regression rarely works well for count outcomes for two reasons:
- First, when an outcome can only be 0 or positive whole numbers, a
straight line often is a bad fit and certainly at extremes (e.g.,
negative values) is impossible.
- Second, unless the average number of counts is fairly high, a count
variable and its residuals will not follow anything approximating a
normal distribution: the normality assumption would be violated.
The GLM solves each of these problems separately
- Link functions transform the linear predicted value $\eta$ so that
it never goes below 0
- Rather than assume a normal distribution, the GLM can assume a
poisson distribution
- Whereas the Normal distribution had two parameters, mean and
standard deviation, the poisson distribution only has one: the
average which is the same as the variance, called $\lambda$
For poisson regression, the link function is defined as:
$$ \eta = g(\lambda) = ln(\lambda) $$
where $ln()$ is the **n**atural **l**ogarithm (or log with base e,
euler's number).
`r margin_note("Although lambda can be exactly 0,
Here $\lambda$ is the average number of counts / events. Although
counts must be at least 0 and strictly whole numbers, the average number of
counts, $\lambda$ can be any real, positive number number > 0.
Graphing $\lambda$ against $\lambda$ we get
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1a. Average counts against average counts"}
tmp <- data.frame(lambda = seq(from = .05, to = 10, by = .05))
ggplot(tmp, aes(lambda, lambda)) +
geom_point() +
theme_pubr()
```
The link function, the natural logarithm, unbounds $\lambda$ on the
left side because the log of 0 is negative infinity
The followinggraph shows the raw $\lambda$ values against the
natural log transformed values.
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1b. Average counts against log transformed average counts"}
ggplot(tmp, aes(lambda, log(lambda))) +
geom_point() +
theme_pubr()
```
In this way, the link function for poisson regression takes the
average number of counts, which can only fall > 0 and transforms it to
fall between negative infinity and positive infinity
Now we have a continuous and unbounded outcome we can apply
a linear model to!
To go from predictions on the linear model back to the original
count scale, we use the inverse link function:
- $\lambda = g^{-1}(\eta) = e^{\eta}$
That is, suppose we had the linear predicted values, $\eta$ shown
below graphically
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 2a. Linear predictor against linear predictor"}
tmp2 <- data.frame(eta = seq(from = -5, to = 5, by = .1))
ggplot(tmp2, aes(eta, eta)) +
geom_point() +
theme_pubr()
```
Using the inverse link function, we transform them to again fall
> 0.
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 2b. Linear predictor (log scale) against average counts"}
ggplot(tmp2, aes(eta, exp(eta))) +
geom_point() +
theme_pubr()
```
## Poisson Regression Assumptions
Much like a normal distribution was assumed for normal linear
regression, for poisson regression a poisson distribution is
assumed.
$$ y \sim Pois(\lambda) $$
In poisson regression although for relatively less common outcomes,
outliers on the left hand side may not be a concern, large positive
outliers (e.g., someone with an extreme number of awards, or who has
multiple health appointments daily, etc.) may be a concern.
Outliers on predictors may still be a concern.
Poisson regression still requires that the errors be independent
(i.e., individuals are independent, not dependent or repeated
measures).
Poisson regression also assumes that on the link scale (natural log scale)
there is a linear relationship.
Finally, poisson regression generally requires a large sample size.
There are no degrees of freedom so it requires a large enough sample
that **parameters** are distributed about normally relying on the
central limit theorem.
## Poisson Regression in `R`
To estimate a poisson regression in `R`, we can use the `glm()`
function which allows us to fit a variety of different GLMs, including
poisson regression. `glm()` works almost identically to `lm()`, with a
few caveats. The first caveat is that for anything other than a linear
model, we need to specify the `family = ` argument, which controls
which specific type of GLM is estimated. In this case we want a
poisson regression, so we use `family = poisson()`.
The second caveat is that although some functions we are used to from
working with `lm()` will carry over to working with `glm()` models,
many functions will not work or will not work the same way. Because
linear regression is the most common, there tend to be the most tools
and convenience features implemented for it. In addition, some things,
like $R^2$ variance explained that makes sense with linear regression
does not really make sense for poisson regression.
The following code fits a poisson regression to the number of awards
`num_awards` predicted by math scores, `math`. There is not a single
agreed upon way to evaluate the residuals of these models for whether
they meet the assumptions, so we may just evaluate the distribution of
the outcome to see if it closely fits a poisson distribution.
The figure suggests its fairly close, so we might proceed with a
summary of the model from `summary()`.
```{r, fig.width = 6, fig.height = 4}
m <- glm(num_awards ~ math, data = dcount, family = poisson())
plot(testDistribution(dcount$num_awards, distr = "poisson"))
summary(m)
```
The output from `summary(m)` is fairly similar to what we would see
for a linear regression in `R`.
- Deviance Residuals: these still represent model residuals, but they
are no longer raw residuals defined as $y - \hat{y}$. Deviance
residuals are how much each data point contributed to the residual
deviance, a sort of measure of overall model fit. They are analogous
to residuals in linear regression, in that when you sum and square
these residuals, you get the sums of squares used to assess model
fit. They differ in that they are not calculated as the raw
difference between each data point and the predicted value.
- Coefficients: the coefficients table is interpretted nearly
identically as in linear regression. The estimates are the
regression coefficients, the standard errors are given next. Instead
of t values, z values are presented because in general for GLMs we
cannot calculate degrees of freedom so we rely on the central limit
theory, and so instead of a t-test, we use a z-test (recall from
early stats that with an infinite sample size, the t-distribution
becomes identical to the normal distribution and thus the t-test and
z-test are the same, with an infinite sample and they are
functionally the same with a relatively large sample). P values are
in the final column, now based of the z-test instead of the t-tests
used in linear regression.
- There is no model $R^2$ as that only works clearly for linear
regression. Instead, we get null and residual deviance values and an
AIC, which stands for the Akaike Information Criterion, a relative
measure of model fit taking into account model complexity. We'll
learn more about this much later in the semester, for now you don't
need to worry about them.
- The number of Fisher scoring iterations. This is included because
unlike linear regression where we can directly calculate the best
parameter estimates. For other GLMs, the computer tries different
values and finds the best values, in a process known as
optimization. At each step of the optimization it evaluates how well
the specific parameter estimates work and based on some criteria
whether to keep trying for better estimates or stop trying. Each one
of these steps is called an "iteration" and the number of iterations
can give a rough sense of how hard it was for the computer to find
the values. Again you can mostly ignore these for now, although it
will become more important as we talk about mixed effects models.
### Poisson Regression Interpretation
At one level, we can interpret the coefficients for poisson regression
nearly identically to those for linear regression, as long as we use
the log scale. An example follows.
A poisson regression predicting the number of awards each student
received from their math scores showed that students who had a 0 math
score were expected to have
`r sprintf("%0.2f", coef(m)[["(Intercept)"]])` log awards,
`r sprintf("%s", formatPval(coef(summary(m))["(Intercept)", "Pr(>|z|)"], includeP = TRUE))`.
Each one unit higher math score was associated with
`r sprintf("%0.2f", coef(m)[["math"]])` higher log awards,
`r sprintf("%s", formatPval(coef(summary(m))["math", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.
```{r, fig.width = 6, fig.height = 4}
visreg(m, xvar = "math", partial = FALSE, rug = FALSE, gg = TRUE) +
ylab("predicted log number of awards") +
theme_pubr()
```
By specifying that we are working on a log scale, we can interpret the
results as we did before and graph the results almost the same as
before, other than needing to mention the log. Although these
interpretations are technically accurate, for most people, it is hard
to understand results on a log scale.
If we exponentiate the regression coefficients, specifically the
slopes, we get what are called **Incident Rate Ratios** or IRRs.
IRRs indicate how many more times the outcome will be for a one unit
change in the predictor. For example, if the $IRR = 2$ and the base
rate was 1, then one unit higher would be $1 * 2 = 2$. If the base
rate was 2, then a one unit higher would be $2 * 2 = 4$.
IRRs are interpretted as for each one unit higher predictor score,
there are IRR times as many events of the outcome. Because IRRs are
multiplicative, an IRR of 1 means that a one unit change in the
predictor is associated with 1 times as many events on the outcome,
that is no change. Thus on the IRR scale, a coefficient of 1 is
equivalent to a coefficient of 0 on the link (log) scale.
Note that although we can exponentiate coefficients, confidence
intervals, or predictions, we should *not* exponentiate standard
errors, z values or p values. To get the IRRs in `R`, we can extract
the coefficients from the model using the `coef()` function and then
exponentiate using the `exp()` function and we follow a nearly
identical process for the confidence intervals.
```{r}
exp(coef(m))
exp(confint(m))
```
We can interpret poisson regression using the IRRs as follows.
A poisson regression predicting the number of awards each student
received from their math scores showed that students who had a 0 math
score were expected to have
`r sprintf("%0.2f", exp(coef(m)[["(Intercept)"]]))` awards,
[95% CI `r paste(sprintf("%0.2f", exp(confint(m))["(Intercept)", ]),
collapse = " - ")`].
Each one unit higher math score was associated with having
`r sprintf("%0.2f", exp(coef(m)[["math"]]))` times the number of
awards,
[95% CI `r paste(sprintf("%0.2f", exp(confint(m))["math", ]),
collapse = " - ")`],
`r sprintf("%s", formatPval(coef(summary(m))["math", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.
```{r, fig.width = 6, fig.height = 4}
visreg(m, xvar = "math", scale = "response",
partial = FALSE, rug = FALSE, gg = TRUE) +
ylab("predicted number of awards") +
theme_pubr()
```
Note that when calling `visreg()`, we added the new argument, `scale =
"response"`. This tells `R` that we do not want the results on the
linear link scale, we want them on the original response scale. This
argument for `visreg()` is helpful any time you are working with non
linear link functions. The figure also shows how although on the link
scale there was a perfectly linear model prediction, the linear model
on the original response scale is clearly not linear.
Poisson regression with multiple predictors follows the same
interpretation, but just as with linear regression you need to specify
that what is being evaluated is the unique / independent association
of each predictor with the outcome, independent of other variables in
the model.
### You Try It - Poisson
`r margin_note("There is no association between prog and num_awards so
do not be surprised if the graphs etc are not that interesting.")`
Fit a poisson regression predicting the number of awards, `num_awards`
from `prog` in the `dcount` data. Try to interpret the results using
IRRs and make a graph.
```{r tryitpoisson}
## fit poisson regression
## calculate IRRs and confidence intervals on the IRR scale
## graph and visualize the results
```
# Binary Logistic Regression
Binary logistic regression is regression when the outcome only takes
on two values: 0 or 1. Logistic regression is useful for many
questions, such as:
- What predicts whether someone will have major depression or not?
- Does one treatment have a higher probability of patients *remitting*
from major depression than another treatment?
- What is the probability that a patient will be re-admitted to the
hospital within 30 days of discharge?
- What predicts whether an individual will live or die before age 60?
- If a bank gives a loan to someone, what is their probability of not
being able to pay it back?
- Do older adults have a higher probability of using CAM than younger
adults?
Linear regression will not work for binary outcomes for two reasons:
- First, when an outcome can only be 0 or 1, a straight line is a very
bad fit
- straight lines could predict a value of 1.4 or 2 or -0.3 when
those values are not possible
- Second, there is no way that a binary variable or its residuals will
follow a normal distribution: the normality assumption would be
violated
The GLM solves each of these problems separately
- Link functions transform the linear predicted value $\eta$ so that
it never goes below 0 and never goes about 1
- Rather than assume a normal distribution, the GLM uses a new
distribution, the Bernoulli distribution
- Whereas the Normal distribution had two parameters, mean and
standard deviation, the Bernoulli distribution only has one: the
average probability that an event will occur, sometimes $p$ or $\mu$
For logistic regression, the link function is defined as:
$$ \eta = g(\mu) = ln\left(\frac{\mu}{1 - \mu}\right) $$
That is called the *logit* function.
Here $\mu$ is the probability that the outcome will
be 1. Probabilities range from 0 to 1: an outcome cannot happen less
than never (probability of 0) or more than always (probability of 1).
Graphing $\mu$ against $\mu$ we get
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1a. Probabilities against Probabilities"}
tmp <- data.frame(mu = seq(from = .01, to = .99, by = .01))
ggplot(tmp, aes(mu, mu)) +
geom_point() +
theme_pubr()
```
The first part of the *logit* function:
- $\frac{\mu}{1 - \mu}$
unbounds $\mu$ on the right side, so that as it goes to 1,
transformed, it goes to infinity, shown graphically below
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1b. Probabilities against right unbounded probabilities"}
ggplot(tmp, aes(mu, mu/(1 - mu))) +
geom_point() +
theme_pubr()
```
The next part, the natural logarithm, unbounds it on the left side
because the log of 0 is negative infinity
- $ln\left(\frac{\mu}{1 - \mu}\right)$
Combined, the graph below shows the raw $\mu$ values against the
transformed values.
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 1c. Probabilities against unbounded probabilities (logit scale)"}
ggplot(tmp, aes(mu, log(mu/(1 - mu)))) +
geom_point() +
theme_pubr()
```
In this way, the link function for logistic regression takes the
probability of the event occuring, which can only fall between 0 and
1, and transforms it to fall between negative infinity and positive
infinity
Now we have a continuous and unbounded outcome we can apply
a linear model to!
To go from predictions on the linear model back to the probability
scale, we use the inverse link function:
$$ \mu = g^{-1}(\eta) = \frac{1}{1 + e^{-\eta}} $$
That is, suppose we had the linear predicted values, $\eta$ shown
below graphically
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Fig 2a. Linear predictor against linear predictor"}
tmp2 <- data.frame(eta = seq(from = -5, to = 5, by = .1))
ggplot(tmp2, aes(eta, eta)) +
geom_point() +
theme_pubr()
```
Using the inverse link function, we transform them to again fall
between 0 and 1, like probabilities.
```{r, echo = FALSE, fig.width = 6, fig.height = 4, fig.cap = "Figure 2b. Linear predictor (logit scale) against probabilities"}
ggplot(tmp2, aes(eta, 1/(1 + exp(-eta)))) +
geom_point() +
theme_pubr()
```
## Logistic Regression Assumptions
Much like a normal distribution was assumed for normal linear
regression, for logistic regression a Bernoulli distribution is
assumed.
$$ y \sim Bern(\mu) $$
The Bernoulli only has the one parameter, the probability of the event
occuring.
In logistic regression the outcome is either 0 or 1, so outliers on
the outcome are not a concern
Outliers on predictors may still be a concern
One problem that sometimes arises is known as separation, this occurs
when some predictor perfectly predicts the outcome or separates the
outcome, most often when the outcome is rare or the sample size is
small.
- In a sample of 60 people where only 10 have major depression, if 80%
of the sample also have PTSD, it may happen that all cases of major
depression occurred in people with PTSD, so the percentage of
depression in those without PTSD is exactly 0%, which is
separation.
- The usual solution is to remove predictors or collapse groups in
these cases
Logistic regression still requires that the errors be independent
(i.e., individuals are independent, not dependent or repeated
measures).
Logistic regression also assumes that on the linke scale (logit scale)
there is a linear relationship.
Finally, logistic regression generally requires a large sample size.
There are no degrees of freedom, requires large enough sample that
parameters are distributed about normally (central limit theorem).
## Logistic Regression in `R` 1
For logistic regression in `R`, we will use the `StressHigh`
categorical variable we created based off stress scores in the
baseline data collection exercise data.
First, let's take a look at some basic descriptives on a few variables
we will work with. The `egltable()` function calculates basic
descriptive estimates optionally by group levels in a table.
The first argument is a list of variable names and then we give it a
dataset. The argument `strict=FALSE` is used to indicate that if a
variable has few levels, like `StressHigh` which is only 0 and 1, even
though `R` stores it as a continuous variable, it should be treated as
categorical for the table, otherwise `egltable()` will report means
instead of frequencies unless the variable is a string or factor
variable.
```{r}
egltable(
c("StressHigh", "stress", "SelfesteemHigh", "selfesteem"),
data = db, strict=FALSE)
```
Now let's look at a cross tabulation of high stress and high self
esteem, which we do by specifying a grouping variable as
`StressHigh`. The columns of the table are now levels of self
esteem. We get frequencies and percentages for `SelfesteemHigh` and means
and standard deviations for `stress` and `selfesteem` along with a
chi-squared test and t-test and effect sizes. The chi-square test is
significant, p < .001 indicating that high/low self esteem and
high/low stress are not independent. We can also see that there is
some data in all cells of the high/low self esteem and stress, although
only three people reported both high self esteem and high stress.
```{r}
egltable(
c("stress", "SelfesteemHigh", "selfesteem"),
g = "StressHigh",
data = db, strict=FALSE)
```
As with poisson regression, we can fit a logistic regression using the
`glm()` function. We use the usual regression formula approach,
specify the dataset and set `family = binomial()`, which gives us a
logistic regression.
```{r}
mlog <- glm(StressHigh ~ SelfesteemHigh, data = db, family = binomial())
summary(mlog)
```
The output from `summary(mlog)` is fairly similar to what we would see
for a linear regression in `R`.
- Deviance Residuals: these still represent model residuals, but they
are no longer raw residuals defined as $y - \hat{y}$. Deviance
residuals are how much each data point contributed to the residual
deviance, a sort of measure of overall model fit. They are analogous
to residuals in linear regression, in that when you sum and square
these residuals, you get the sums of squares used to assess model
fit. They differ in that they are not calculated as the raw
difference between each data point and the predicted value.
- Coefficients: the coefficients table is interpretted nearly
identically as in linear regression. The estimates are the
regression coefficients, the standard errors are given next. Instead
of t values, z values are presented because in general for GLMs we
cannot calculate degrees of freedom so we rely on the central limit
theory, and so instead of a t-test, we use a z-test (recall from
early stats that with an infinite sample size, the t-distribution
becomes identical to the normal distribution and thus the t-test and
z-test are the same, with an infinite sample and they are
functionally the same with a relatively large sample). P values are
in the final column, now based of the z-test instead of the t-tests
used in linear regression.
- There is no model $R^2$ as that only works clearly for linear
regression. Instead, we get null and residual deviance values and an
AIC, which stands for the Akaike Information Criterion, a relative
measure of model fit taking into account model complexity. We'll
learn more about this much later in the semester, for now you don't
need to worry about them.
- The number of Fisher scoring iterations. This is included because
unlike linear regression where we can directly calculate the best
parameter estimates. For other GLMs, the computer tries different
values and finds the best values, in a process known as
optimization. At each step of the optimization it evaluates how well
the specific parameter estimates work and based on some criteria
whether to keep trying for better estimates or stop trying. Each one
of these steps is called an "iteration" and the number of iterations
can give a rough sense of how hard it was for the computer to find
the values. Again you can mostly ignore these for now, although it
will become more important as we talk about mixed effects models.
### Logistic Regression Interpretation
At one level, we can interpret the coefficients for logistic
regression nearly identically to those for linear regression, as long
as we use the logit scale. An example follows.
A logistic regression predicting high stress from selfesteem (high /
low) showed that students who had low selfesteem were expected to have
`r sprintf("%0.2f", coef(mlog)[["(Intercept)"]])` log odds of being
high stress,
`r sprintf("%s", formatPval(coef(summary(mlog))["(Intercept)", "Pr(>|z|)"], includeP = TRUE))`.
Being in the high versus low selfesteem group was associated with
`r sprintf("%0.2f", coef(mlog)[["SelfesteemHigh"]])` lower log odds
of being high stress,
`r sprintf("%s", formatPval(coef(summary(mlog))["SelfesteemHigh", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.
```{r, fig.width = 6, fig.height = 4}
visreg(mlog, xvar = "SelfesteemHigh", partial = FALSE, rug = FALSE, gg = TRUE) +
ylab("predicted log odds of high stress") +
theme_pubr()
```
By specifying that we are working on the log odds (logit) scale, we
can interpret the results as we did before and graph the results
almost the same as before, other than needing to mention the log
odds. Although these interpretations are technically accurate, for
most people, it is hard to understand results on a log odds scale.
If we exponentiate the regression coefficients, specifically the
slopes, we get what are called **Odds Ratios** or ORs.
ORs indicate how many more times the odds of occuring the outcome will
be for a one unit change in the predictor. For example, if the $OR = 2$ and the base
odds was 0.5, then one unit higher would be $0.5 * 2 = 1$ odds. If the base
odds was 2, then a one unit higher would be $2 * 2 = 4$ odds.
ORs are interpretted as for each one unit higher predictor score,
there are OR times the odds of the outcome. Because ORs are
multiplicative, an OR of 1 means that a one unit change in the
predictor is associated with 1 times the odds of the outcome,
that is no change. Thus on the OR scale, a coefficient of 1 is
equivalent to a coefficient of 0 on the link (log odds) scale.
Note that although we can exponentiate coefficients, confidence
intervals, or predictions, we should *not* exponentiate standard
errors, z values or p values. To get the ORs in `R`, we can extract
the coefficients from the model using the `coef()` function and then
exponentiate using the `exp()` function and we follow a nearly
identical process for the confidence intervals.
```{r}
exp(coef(mlog))
exp(confint(mlog))
```
We can interpret logistic regression using the ORs as follows.
A logistic regression predicting high stress from selfesteem (high /
low) showed that students who had low selfesteem were expected to have
`r sprintf("%0.2f", exp(coef(mlog)[["(Intercept)"]]))` odds of being
high stress,
[95% CI `r paste(sprintf("%0.2f", exp(confint(mlog))["(Intercept)", ]),
collapse = " - ")`].
Being in the high vs low self esteem group was associated with
`r sprintf("%0.2f", exp(coef(mlog)[["SelfesteemHigh"]]))` times the
odds of being high stress,
[95% CI `r paste(sprintf("%0.2f", exp(confint(mlog))["SelfesteemHigh", ]),
collapse = " - ")`],
`r sprintf("%s", formatPval(coef(summary(mlog))["SelfesteemHigh", "Pr(>|z|)"], includeP = TRUE))`.
A graph of these results follows.
```{r, fig.width = 6, fig.height = 4}
visreg(mlog, xvar = "SelfesteemHigh", scale = "response",
partial = FALSE, rug = FALSE, gg = TRUE) +
ylab("predicted probability of being high stress") +
theme_pubr()
```
Note that when calling `visreg()`, we added the new argument, `scale =
"response"`. This tells `R` that we do not want the results on the
linear link scale, we want them on the original response scale. This
argument for `visreg()` is helpful any time you are working with non
linear link functions. The figure also shows how although on the link
scale there was a perfectly linear model prediction, the linear model
on the original response scale is clearly not linear.
Logistic regression with multiple predictors follows the same
interpretation, but just as with linear regression you need to specify
that what is being evaluated is the unique / independent association
of each predictor with the outcome, independent of other variables in
the model.
In addition to the odds ratio (OR) scale, with a simple model, we can
easily convert from the logit scale to the probability scale, which
may be even easier to interpret.
People with lower self esteem are the reference group, captured by the intercept
$$ \frac{1}{1 + e^{- (`r coef(mlog)[1]`)}} = `r as.vector(plogis(coef(mlog)[1]))` $$
People with lower self esteem have a
`r as.vector(plogis(coef(mlog)[1]))` probability of being high stress
The coefficient for self esteem, indicates that people with high self
esteem have `r coef(mlog)[2]` higher log odds (the logit scale) of
being high stress than do people with low self esteem.
To calculate the probability that people with high self esteem will be
high stress:
$$ \frac{1}{1 + e^{- (`r coef(mlog)[1]` + `r coef(mlog)[2]`)}} = `r as.vector(plogis(sum(coef(mlog))))` $$
People with high self esteem have a
`r as.vector(plogis(sum(coef(mlog))))` probability of being high
stress
Stated on the the odds scale.
- People with high self esteem have
$e^{`r coef(mlog)[2]`} = `r exp(coef(mlog)[2])`$
times the odds of being high stress as do people with low self
esteem
- This is called the odds ratio (OR)
An example summary of this analysis is:
"People with high self esteem were less likely to be high stress than
people with low self esteem, (OR = `r exp(coef(mlog)[2])`, p
< .001)."
Another way of saying the same thing:
"People with high self esteem had `r exp(coef(mlog)[2])` times the
odds of being high stress as people with low self eseteem (p < .001)."
## Logistic Regression in `R` 2
Next we examine a continuous predictor, continuous self esteem.
The analysis in `R` is basically the same as the previous example:
```{r}
mlog2 <- glm(StressHigh ~ selfesteem, data = db, family = binomial())
summary(mlog2)
```
- The coefficient for self esteem indicates that for each one unit higher
self esteem, people have `r coef(mlog2)[2]` lower log
odds of being high stress
- In odds ratios:
- For each additional unit of self esteem, participants have
`r exp(coef(mlog2)[2])` times the odds of being high stress (p <
.001).
- One challenge with the odds ratio metric is that it is multiplicative
- If the baseline odds are 2, then 1.2 times the odds would be
2.4 (an absolute difference of 0.4)
- If the baseline odds are 4, then 1.2 times the odds would be 4.8
(an absolute difference of 0.8)
- Important to keep this interpretation in mind, odds ratios do not
directly indicate the probability of the event occuring, just how
many times higher the odds
We can calculate the specific probability of being high stress as we
did for categorical self esteem
Probability of being high stress for participants with 0 self esteem
$$\frac{1}{1 + e^{- (`r coef(mlog2)[1]`)}} = `r as.vector(plogis(coef(mlog2)[1]))`$$
Participants with 0 self esteem have a
`r as.vector(plogis(coef(mlog2)[1]))` probability of being high stress.
To calculate the probability of being high stress for participants with a 20 score on
self esteem, we first calculate the log odds:
$$`r coef(mlog2)[1]` + `r coef(mlog2)[2]` * 20 = `r coef(mlog2)[1] + 20 * coef(mlog2)[2]`$$
Then we convert the log odds to probabilities:
$$\frac{1}{1 + e^{- (`r coef(mlog2)[1] + 20 * coef(mlog2)[2]`)}} = `r as.vector(plogis(coef(mlog2)[1] + 20 * coef(mlog2)[2]))`$$
Participants with a score of 20 on self esteem have a
`r as.vector(plogis(coef(mlog2)[1] + 20 * coef(mlog2)[2]))`
probability of being high stress. This matches what we can see
visually when we use `visreg()` to plot on the probability scale.
```{r, fig.width = 6, fig.height = 4, fig.cap = "Probability of being high stress by self esteem level with 95% confidence intervals"}
visreg(mlog2, xvar = "selfesteem", scale = "response",
partial = FALSE, rug = FALSE, gg = TRUE) +
ylab("predicted probability of being high stress") +
theme_pubr()
```
Although logistic regression models do not have $R^2$ values, there
are some fit indices. For example, we could calculate the predicted
probabilities for each person and if they are >= .50, classify them as
the model predicting high stress and if < .50 classify them as the
model predicting low stress. Then we can compare the model predictions
to the observed data to calculate the model accuracy.
```{r}
accdata <- data.table(
Observed = db$StressHigh,
Predicted = as.integer(predict(mlog2, type = "response") >= .50))
## quick view of the data
head(accdata)
## simple frequencies table
xtabs(~ Observed + Predicted, data = accdata)
## accuracy is the average number of times predictions match observed
mean(accdata$Observed == accdata$Predicted)
```
### Average Marginal Effects
Because the probability scale is easier to interpret than is even the
odds ratio scale, people often desire to convert results from logistic
regression into probabilities. A challenge on the probability scale is
that the results are not linear. For example if looking at the graph
of continuous self esteem predicting the probability of being high
stress, the probability changes much less when moving from a self
esteem score of 4 to 5 then from 10 to 11.
We can see this in practice by using the `predict()` function to
generate some predicted probabilities. We make a small new dataset
with some particular self esteem values (4, 5, 10, 11) and specify the
argument `type = "response"` to get probabilities rather than
predicted log odds.
```{r, results = "asis"}
newdata <- data.table(
selfesteem = c(4, 5, 10, 11))
newdata$Prob <- predict(mlog2,
newdata = newdata,
type = "response")
knitr::kable(newdata)
```
The probabilities show that the predicted probability of being high
stress changes very little when moving from 4 to 5 self esteem but
quite a bit from 10 to 11. This illustrates a challenge: if you want
to talk about how much the probability of being high stress changes
for a one unit change in self esteem, it matters where the self esteem
scores started.
One principled approach to solving this problem is called the **average
marginal effect**. There are two parts to this. A **marginal** effect
is the instantaneous effect of change at a particular
point. Technically, it is the derivative of the slope for those that
helps. Because each participant has their own actual self esteem, each
participant also will have their own marginal effect, their own
instaneous rate of change in the probability of being high stress for
a change in self esteem. The **average** part of average marginal
effect is that we can average the marginal effects across all
participants in the sample to get an average marginal effect, and this
number represents the average change in probability that you would
expect if you took a population that looked like our sample and could
increase everyone's self esteem by one unit.
Calculating the average marginal effect is fairly easy.
1. Using the original dataset, generate predicted probabilities for
everyone
2. Add a small quantity, $h$ to every self esteem score in the dataset