-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathPaSSA.Rmd
642 lines (409 loc) · 32 KB
/
PaSSA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
---
title: "Power and Sample Size Analysis"
author: "Clay Ford, Statistical Research Consultant, UVA Library"
output: html_notebook
editor_options:
chunk_output_type: inline
---
## Quick Intro to R Notebooks and R Markdown
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. This file was created in RStudio by going to File...New File...R Notebook.
Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter* (Win/Linux) or *Cmd+Shift+Return* (Mac).
```{r}
plot(cars)
```
To hide the output, click the Expand/Collapse output button. To clear results (or an error), click the "x".
You can also press *Ctrl+Enter* (Win/Linux) or *Cmd+Return* (Mac) to run one line of code at a time (instead of the entire chunk).
Add a new R code chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac).
## CODE ALONG 0
Insert a code chunk and run the code `sample(x = 1:6, size = 1)`
## Workshop agenda
- What is power and how does it relate to sample size?
- When and why to calculate power and sample size for a study
- How to do it for various statistical tests in R
- Intro to using simulation for estimating power and sample size
- Sample size estimation for multivariable prediction models
## Hello, my name is...
- I suspect people place sticky-back name tags on the left side of their chest about _75%_ of the time (probably because most people are right handed). I create an experiment to test this _hypothesis_.
- Experiment: randomly sample _n_ people and calculate the proportion _p_ of people who place a name tag on the left.
- Analysis: conduct a _one-sample proportion test_ to see if the sample proportion is significantly greater than random chance (50%).
- Decision: reject the _null hypothesis_ of random chance if the p-value of the one-sample proportion test is below _0.05_.
Two possible questions to consider before starting the study:
1. What should our _sample size_ be if we want high probability of detecting the hypothesis assuming it's true?
2. Do we have high enough probability, or _power_, of detecting the hypothesis if our sample size is already determined?
## What is Power?
Power is a _probability_. It ranges from [0, 1]. Power is the probability of correctly rejecting a false null hypothesis.
Returning to our example, we have two possibilities, or _hypotheses_:
1. Percentage of people who place name tags on the left side of their chest is random (50%), the _null hypothesis_
2. Percentage of people who place name tags on the left side of their chest is greater than 50%, the _alternative hypothesis_
Let's assume the alternative hypothesis is _correct_.
Further, let's assume at least 75% of people place name tags on the left side of their chest. This means we estimate an _effect size_ of 0.75 - 0.50 = 0.25.
If our assumptions are true we would like our hypothesis test to have a _high probability_, or _high power_, of returning the correct result (ie, a p-value below 0.05). To ensure high power we need a sufficiently large sample size. High power is usually considered to be at least _0.80_, if not higher.
One approach is to sample as many people as possible. But that can be expensive and time-consuming. So we usually want to _find the smallest sample size that provides the power we desire_.
## Power and sample size analysis for our experiment
Let's go ahead and calculate the sample size for our simple experiment. For this we'll use the {pwr} package. The `pwr.p.test()` function can be used to calculate power or sample size for a one-sample proportion test.
Notice we first have to calculate a _transformed effect size_ using the `ES.h()` function. The `p1` argument is the alternative proportion (0.75), the `p2` argument is the null proportion (0.50). The `h` notation is due to Cohen (1988). It represents effect sizes for proportions.
The reason we have to transform the effect size for proportions is to distinguish between _absolute_ and _relative_ differences.
For example, the _absolute differences_ in these proportions are equivalent:
0.75 - 0.50 = 0.25
0.26 - 0.01 = 0.25
But notice the _relative differences_ are very different:
0.75/0.50 = 1.5 (ie, 0.75 is 1.5 times greater than 0.50)
0.26/0.01 = 26 (ie, 0.26 is 26 times greater than 0.01)
The `ES.h()` function transforms the absolute difference into an effect size that takes the relative difference into account.
Since we want sample size, we _leave it out of the function_. We specify our desired power (0.80); our significance level (0.05) which is the threshold we cross for "rejecting" the null; and the alternative ("greater") which says we think the alternative hypothesis is _greater than that null_.
```{r}
library(pwr) # do this once per session
h <- ES.h(p1 = 0.75, p2 = 0.50)
pwr.p.test(h = h, sig.level = 0.05, power = 0.80,
alternative = "greater")
```
Our answer is the line that begins with "n". Our minimal sample size to have 80% power of rejecting the null at 0.05 is 23. Always round up. So _if our alternative hypothesis is correct_, we should aim to sample _at least_ 23 people, if not more.
## IMPORTANT POINTS
- We haven't yet run the experiment. We're still in the planning stages.
- We don't know if our hypothesized effect size is correct (0.75 - 0.50 = 0.25). We're just pretending it is. Hopefully we have good reason for proposing it.
- A sample size of 23 does NOT guarantee we will "get a significant result"! It simply provides probability of 0.80 of rejecting the null based on our assumption that the effect size is 0.75 - 0.50 = 0.25.
- We have probability of 0.2 of incorrectly failing to reject null (1 - 0.80), again based on our assumption that the effect size is 0.75 - 0.50 = 0.25.
- Accept and embrace the uncertainty in this power and sample size "analysis". It's a thought experiment.
## Analyzing the data
The analysis of the data is not the topic of the workshop, but I thought it might help to see an example analysis for this toy example.
Let's say we randomly sample 23 people and observe 18 put their name tag on the left side of their chest. Here's how we analyze the data using the `prop.test()` function.
```{r}
prop.test(x = 18, n = 23, p = 0.5, alternative = "greater")
```
## CODE ALONG 1
Add new R code chunks by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac).
(1) How large a sample do we need to achieve 0.90 power with a significance level of 0.01?
(2) What is the power of our experiment if we sample 30 people with a significance level of 0.01?
(3) What is the power of our experiment if we sample 50 people with a significance level of 0.01 and an assumed effect size of 65% - 50% = 15%?
## Remember
- Detecting smaller effects requires larger sample sizes
- Using lower significance levels requires larger sample sizes
- Higher power requires larger sample sizes
## Type I and Type II errors
If we conclude people prefer to put name tags on the left side of their chest when they actually don't we have made a _Type I error_. (Rejecting the null hypothesis in error)
Our tolerance for a Type I error is the _significance level_. Usually 0.05, 0.01 or lower.
If we conclude people have no preference for which side of the chest they place their name tag when they really do prefer left we have made a _Type II error_. (Failing to reject null hypothesis in error)
Our tolerance for a Type II error is 1 - Power.
We rarely know if we have made these errors.
## Picking an effect size
Hypothesizing realistic and meaningful effect sizes is the most important part of power and sample size analysis. This requires subject matter expertise. A rule of thumb is to pick the _smallest effect you don't want to miss_. Or stated another way, pick the smallest effect that would be worth reporting or that would "make news".
Cohen defines "effect size" as "the degree to which the null hypothesis is false." Example: If our null is 0.50, and the alternative 0.75, the absolute effect size is 0.25.
## Two-sample proportion test
Let's say I randomly survey male and female UVA undergrad students and ask them if they consume alcohol at least once a week.
- null hypothesis: no difference in the proportion that answer yes.
- alternative hypothesis: there is a difference
This is "two-sided" alternative; one sex has higher proportion, I don't know which.
I'd like to detect a difference as small as 5%. That's my effect size.
How many students do I need to sample in _each group_ if we want 0.90 power and a significance level of 0.05?
We can either use the `pwr.2p.test()` function from the {pwr} package or use the base R `power.prop.test()` function. Notice we have to specify proportions that differ by 5%, and that the required sample size _changes dramatically_ based on where these two proportions fall in the range from 0 to 1.
```{r}
pwr.2p.test(h = ES.h(p1 = 0.20, p2 = 0.25), sig.level = 0.05, power = 0.90)
```
Notice that n is the number in each group!
The base R function does not require us to use a separate effect size function.
```{r}
power.prop.test(p1 = 0.20, p2 = 0.25, sig.level = 0.05, power = 0.90)
```
When we consider proportions closer to 0 or 1, the sample size required decreases because the relative difference in proportions is bigger.
```{r}
pwr.2p.test(h = ES.h(p1 = 0.05, p2 = 0.10), sig.level = 0.05, power = 0.90)
```
```{r}
power.prop.test(p1 = 0.05, p2 = 0.10, sig.level = 0.05, power = 0.90)
```
## CODE ALONG 2
Add new R code chunks by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac).
What sample size do we need for each group if assume proportions to be 0.10 and 0.15? Again assume we desire power to be 0.90 and a significance level of 0.05.
## Two-sample t-test
I'm interested in learning if there is a difference in the mean price of what male and female students pay at the library coffee shop. Let's say I randomly observe 30 male and 30 female students check out from the coffee shop and note their total purchase price. How powerful is this experiment if I want to detect a difference of 75 cents in either direction?
The base R `power.t.test()` function makes this fairly simple to answer. The effect size (75 cents) is specified as `delta`. We also need to specify a _hypothesized population standard deviation_ as `sd`. Imagine we have access to the population of all purchase prices ever observed and we calculate the standard deviation of those values. We'd like to do that, but can't. So we need to _estimate_ the standard deviation of that population. One rule of thumb to use if we have no subject matter expertise to guide us:
> take the difference between the maximum and minimum possible values and divide by 4 (or 6).
The rationale behind this rule of thumb is that the range (max - min) of a lot of data is usually equal to about 4SD or 6SD. So 1/4 or 1/6 of the range provides an approximation for the standard deviation. The 1/4 estimate is larger and therefore more conservative. This provides a little extra assurance that we’ll either estimate a big enough sample size or make a more modest estimate of power.
Let's say the maximum purchase price is `$10` and minimum price is `$1`. So our conservative guess at a standard deviation is:
(10 - 1)/4 = 2.25
```{r}
power.t.test(delta = 0.75, sd = 2.25, sig.level = 0.05, n = 30)
```
This experiment is not very powerful. If 75 cents really is the difference in the means, there is only about a probability of 0.25 we'll reject the null hypothesis with a sample size of 30 per group.
How many should we sample for a power of 0.90?
```{r}
power.t.test(delta = 0.75, sd = 2.25, sig.level = 0.05, power = 0.90)
```
The {pwr} package provides the `pwr.t.test()` function. It's a little more difficult to use because we have to specify the effect size using the `d` argument. It's simply delta divided by the standard deviation.
```{r}
pwr.t.test(d = 0.75/2.25, sig.level = 0.05, power = 0.90)
```
## CODE ALONG 3
Add new R code chunks by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac).
How many students should we sample for a power of 0.99 and a significance level of 0.001?
## Paired t-test
24 high school boys are put on a ultraheavy rope-jumping program. Does this decrease their 40-yard dash time (ie, make them faster)? We'll measure their 40 time _before_ the program and _after_. We'll use a _paired t-test_ to see if the _difference_ in times is less than 0 (after - before). These are not independent groups, so we can't use a 2-sample t-test. This also means we need to make an assumption about the _standard deviation of the differences_. Let's say it's about 0.25 seconds based on our rule of thumb above: (1 - 0)/4 = 0.25, where 1 means a full second faster and 0 means no change . How powerful is the test to detect a difference of about -0.1 seconds with 0.05 significance?
First we demonstrate using `power.t.test()`. We need to specify `type = "paired"` and `alternative = "one.sided"`, and provide the _absolute value_ of the effect size. Notice the estimated standard deviation is on the differences, not the populations.
```{r}
power.t.test(n = 24, delta = 0.1, sd = 0.25, sig.level = 0.05,
type = "paired", alternative = "one.sided")
```
Using the `pwr.t.test()` function we have to specify `type = "paired"` and either `alternative = "greater"` or `alternative = "less"`, depending on how we state the effect size. Since we're interested in a _decrease_ in time, we can provide a negative `d` and specify `alternative = "less"`.
```{r}
pwr.t.test(n = 24, d = -0.1/0.25, sig.level = 0.05,
type = "paired", alternative = "less")
```
## CODE ALONG 4
Add new R code chunks by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac).
How many pairs do we need to sample to achieve at least 0.95 power with a significance level of 0.01, assuming our hypothesized effect size is correct?
## One-way ANOVA
To test if more than two means are different we can use a one-way _ANOVA_. The null hypothesis is none of the means are different. The alternative is at least one is different. By itself the ANOVA is not a very interesting statistical test. Rejecting the null should be followed by comparisons to see where and how big the differences are.
I'm a web developer and I'm interested in 3 web site designs for a client. I'd like to know which design helps users find information fastest. I design an experiment where I have 3 groups of randomly selected people use one of the designs to find some piece of information and I record how long it takes. (All groups look for the same information.) How many people do I need in each group if I believe two of the designs will take 30 seconds and one will take 25 seconds? Assume population standard deviation is 5 seconds and that I desire power and significance levels of 0.90 and 0.05, respectively.
The base R `power.anova.test()` function requires you to specify...
- the number of `groups` you plan to compare
- the hypothesized between-group variance (`between.var`)
- the hypothesized within-group variance (`within.var`), which we assume is the same for all groups. (pretty strong assumption!)
Notice we need to specify _variance_, not standard deviation. We use the `var()` function to calculate the between-group variances and square 5 to get the within-group variance.
```{r}
power.anova.test(groups = 3, between.var = var(c(30,30,25)),
within.var = 5^2, sig.level = 0.05, power = 0.90)
```
The {pwr} package provides the `pwr.anova.test()` function which requires you to once again specify an effect size, this time using the `f` argument. The effect size `f` is the _standard deviation of the means divided by the common standard deviation of the populations_, each assuming a denominator of `k` instead of `k-1`. To replicate the results of `power.anova.test` we need to multiply the standard deviation of the means by `sqrt((k-1)/k)`. Now we use `sd()` to find the standard deviation of the between-group means.
```{r}
sd_means <- sd(c(30,30,25))
sd_pop <- 5
pwr.anova.test(k = 3, f = sd_means*sqrt(2/3)/sd_pop,
sig.level = 0.05, power = 0.90)
```
While the `pwr.anova.test()` function is a little more difficult to use because of the `f` argument, we might want to use it for the {pwr} package's plot method. Simply save the result and use it with `plot()`.
```{r}
pout <- pwr.anova.test(k = 3, f = sd_means*sqrt(2/3)/sd_pop,
sig.level = 0.05, power = 0.90)
plot(pout)
```
We can see that once the group size is larger than 30, there is little reason to make the sample bigger. AGAIN, assuming our estimated effect sizes are true.
## CODE ALONG 5
Add new R code chunks by pressing *Ctrl+Alt+I* (Win/Linux) or *Cmd+Option+I* (Mac).
How many people do I need in each group if I believe two of the designs will take 60 seconds and one will take 45 seconds? Assume population standard deviation is 10 seconds and that I desire power and significance levels of 0.95 and 0.01, respectively.
## Conventional effect sizes
Cohen proposed conventional effect sizes for each test called "small", "medium", and "large". _These are controversial because they're vague_. However, they may be worth calculating to give you a ballpark range of sample sizes.
The `cohen.ES` function returns an effect size for a given `test` and `size`.
- test: "p", "t", "r", "anov", "chisq", "f2"
- size: "small", "medium", "large"
For example, "small" effect size for a one-way ANOVA test.
```{r}
cohen.ES(test = "anov", size = "small")
```
Using the effect size with our previous example:
```{r}
pwr.anova.test(k = 3, f = 0.1,
sig.level = 0.05, power = 0.90)
```
To detect a "small" effect in a one-way ANOVA with 3 groups, 0.90 power, and 0.05 significance level, we need about 423 subjects per group. This is a ballpark figure. It's meant to give us an idea of the sample size we need.
## Using simulation to estimate power and sample size
Research questions often require analyses more sophisticated than t-tests and one-way ANOVAs which don't have easy power and sample size formulas. It can be useful to simulate data with effects you want to detect and see how often you actually detect them in an analysis. Basically we generate the data and analysis many, many times and calculate the proportion of times we reject the null hypothesis.
This requires some knowledge of programming as well as statistics. Here's a quick example for a two-sample t-test. Recall the previous code for estimating power for two groups of size 30:
```{r}
power.t.test(delta = 0.75, sd = 2.25, sig.level = 0.05, n = 30)
```
We can obtain a similar result using simulation. Let's simulate one data set and one analysis. The delta is the difference in means (ie, 5.75 - 5 = 0.75). We can use any means, as long as they differ by 0.75. Notice also the constant standard deviation for the two populations.
```{r}
x1 <- rnorm(n = 30, mean = 5.00, sd = 2.25)
x2 <- rnorm(n = 30, mean = 5.75, sd = 2.25)
t.test(x1, x2, var.equal = TRUE)
# is it less than significance level?
t.test(x1, x2, var.equal = TRUE)$p.value < 0.05
```
Now we repeat this 1000 times and see what proportion of times we get a p-value below our significance level. We can do this with the `replicate()` function in base R. Notice we simply wrap our previous code in curly braces `{}` to create an _expression_. The result is a vector of TRUE/FALSE values. The mean of a TRUE/FALSE vector is the proportion of TRUEs.
```{r}
result <- replicate(n = 1000, expr = {
x1 <- rnorm(n = 30, mean = 5.00, sd = 2.25)
x2 <- rnorm(n = 30, mean = 5.75, sd = 2.25)
t.test(x1, x2, var.equal = TRUE)$p.value < 0.05})
mean(result)
```
Your result will differ slightly from mine due to sampling variation, but the results will be similar to what we calculated with the `power.t.test()` function.
The power is quite low. We may want to try different sample sizes. With a little extra work we can program that as well. One way is to create a function out of the previous chunk of code and allow the sample size to be an argument.
```{r}
# n = sample size
sim_ttest <- function(n){
result <- replicate(n = 1000, expr =
{x1 <- rnorm(n = n, mean = 5.00, sd = 2.25)
x2 <- rnorm(n = n, mean = 5.75, sd = 2.25)
t.test(x1, x2, var.equal = TRUE)$p.value < 0.05})
mean(result)
}
```
Test the function with n = 50
```{r}
sim_ttest(n = 50)
```
Now "apply" the function to multiple sample sizes. The "s" in "sapply" means to "simplify" the resulting data structure (if possible). In this case it simplifies it to a vector.
```{r}
sapply(c(50, 75, 100, 150), sim_ttest)
```
We could modify our `sim_ttest()` function to take different standard deviation values, different p-value cutoffs, and different means.
## Simulate a 2-way ANOVA
Let's say we want to investigate different _weights_ of paper and different _designs_ of paper airplanes on the distance flown in inches. Perhaps we have two designs and two weights of paper. That's 2 x 2 = 4 combinations of paper airplane designs. What power do we have to detect a difference of 12 inches for design "b" with a significance level of 0.01, assuming a sample size of 10 per group and a common standard deviation of 10 inches between the four groups?
To begin, let's just simulate one data set and one analysis.
```{r}
n <- 10 # per group
design <- rep(c("a","b"), each = n*2)
weight <- rep(c("20","28"), times = n*2)
table(design, weight)
```
Now let's simulate our response (distance flown) according to the following effects:
- design a, 20 lb paper: 72 inches of flight
- design b, 20 lb paper: 72 + 12
- design a, 28 lb paper: 72 + -4
- design b, 28 lb paper: 72 + 12 + -4
These are known as _additive_ effects. The effect of design b is to ADD 12 inches to expected flight distance. The effect of 28 lb paper is to ADD -4 inches to expected flight distance.
The `rnorm()` function at the end adds "noise" to the distance and is analogous to setting the standard deviation for each design/weight combination to 10 (ie, constant variance in each group).
```{r}
dist <- 72 +
(design == "b")*12 +
(weight == "28")*-4 +
rnorm(n*4, mean = 0, sd = 10)
```
And finally run the analysis using the `lm()` function.
```{r}
m <- lm(dist ~ design + weight)
summary(m)
```
We can extract the hypothesis test for designb and check to see if it's less than 0.01 as follows:
```{r}
coef(summary(m))['designb','Pr(>|t|)'] < 0.01
```
Now repeat 1000 times:
```{r}
results <- replicate(1000, expr = {
dist <- 72 +
(design == "b")*12 +
(weight == "28")*-4 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight)
coef(summary(m))['designb','Pr(>|t|)'] < 0.01
})
mean(results)
```
That's fairly high power. _If our assumptions are correct_, 10 observations each of the four different airplane deigns provides a high probability of correctly rejecting the null of no design effect at a significance level of 0.01.
## Simulate a 2-way ANOVA with an interaction
What if we wanted to detect an _interaction_ between design and weight? Interactions imply the effects are NOT additive. The effect of design depends on weight and vice versa.
Let's say it would be of interest if the interaction was 6 inches at a significance level of 0.01. For example, the effect of design b is 12 inches for 20 lb paper, but 12 + 6 = 18 inches for 28 lb paper.
We might simulate our response according to the following:
- design a, 20 lb paper: 72
- design b, 20 lb paper: 72 + 12
- design a, 28 lb paper: 72 + -4
- design b, 28 lb paper: 72 + 12 + -4 + 6
To specify the interaction in `lm` we use the `:` operator (design:weight)
```{r}
dist <- 72 +
(design == "b")*12 +
(weight == "28")*-4 +
(design == "b")*(weight == "28")*6 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight + design:weight)
coef(summary(m))['designb:weight28','Pr(>|t|)'] < 0.01
```
Let's run it 1000 times and see what proportion of times we _detect_ the interaction at the significance level 0.01:
```{r}
results <- replicate(1000, expr = {
dist <- 72 + (design == "b")*12 +
(weight == "28")*-4 +
(design == "b")*(weight == "28")*6 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight + design:weight)
coef(summary(m))['designb:weight28','Pr(>|t|)'] < 0.01
})
mean(results)
```
We need a much bigger sample size to detect this interaction!
Let's turn this into a function so we can easily try different sample sizes.
```{r}
sim_model <- function(n){
results <- replicate(n = 1000, expr = {
design <- rep(c("a","b"), each = n*2)
weight <- rep(c("20","28"), times = n*2)
dist <- 72 +
(design == "b")*12 +
(weight == "28")*-4 +
(design == "b")*(weight == "28")*6 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight + design:weight)
coef(summary(m))['designb:weight28','Pr(>|t|)'] < 0.01})
mean(results)
}
# try samples sizes of 50, 100, 150
sapply(c(50,100,150), sim_model)
```
We need about 15 times the sample size (150 versus 10) to detect the interaction with decent power! Again, this is all one big thought experiment. We're making lots of assumptions. Hopefully they're reasonable and conservative assumptions.
## Post-hoc power analysis: not useful
Sometimes reviewers or editors will ask authors to perform a "post-hoc power calculation" if a study fails to produce statistically significant results. This means using the observed data and sample size to estimate power. The idea is to determine if the study sample size was too small to detect an effect. However, post-hoc power calculated using the study data is just the p-value reported a different way. It doesn't tell you anything new.
I wrote a StatLab article illustrating this:
https://data.library.virginia.edu/post-hoc-power-calculations-are-not-useful/
See also:
The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis (Hoenig and Heisey, 2001)
https://www.vims.edu/people/hoenig_jm/pubs/hoenig2.pdf
## Minimum sample size required for developing a multivariable prediction model
A recent publication in the BMJ provides "guidance on how to calculate the sample size required to develop a clinical prediction model." The focus of the paper is on medical applications but the methods are applicable to any statistical analysis involving the _development_ of a linear model. That is, you intend to model an outcome as a function of various candidate predictor variables, only some of which you may end up using.
The article:
Riley, RD, et al. Calculating the sample size required for developing a clinical prediction model. BMJ 2020;368:m441
https://doi.org/10.1136/bmj.m441 (Published 18 March 2020)
The authors created an R package called {pmsampsize} to easily implement their method. It has one function called `pmsampsize()`. It can be used to calculate sample sizes for linear models with continuous outcomes, logistic regression models, and survival analysis models. We'll cover linear models and logistic regression.
### Linear models
For a continuous outcome (ie, a number possibly with decimals that can be negative and/or positive), we need to specify the following arguments:
- type of prediction model: continuous outcome (`type = "c"`)
- anticipated adjusted R-squared (`rsquared`)
- anticipated number of model parameters _under consideration_ (parameters)
- anticipated mean outcome value in population (intercept)
- anticipated standard deviation of outcome value in population (sd)
Example: Let's say we plan on building a linear model to predict GPA at end of freshman year for a typical college student. Based on data we'll have access to, we plan to entertain 15 possible parameters (ie, coefficients in a model). We'd like a model with an R-squared of about 0.2. We have reason to believe the overall mean GPA of all college freshman is 3.0 with a standard deviation of 0.2. What minimum sample size do we need based on these assumptions?
```{r}
library(pmsampsize)
pmsampsize(type = "c",
parameters = 15,
rsquared = 0.2,
intercept = 3.0,
sd = 0.2)
```
We're mainly interested in the _Final_ sample size. At least 515. The four criteria are documented in the help page for `pmsampsize()`. The SPP is the estimated number of subjects we'll need per _predictor parameter_. A common rule of thumb has been 10 subjects per predictor parameter. We see here that rule of thumb is not very accurate.
NOTES:
- This does NOT mean the final model will have 15 predictors. It may just have 3 or 4. The 15 number is the maximum number of parameters the model might have.
- 15 predictor _parameters_ does NOT mean 15 predictors. For example, a categorical variable with 4 levels will have 3 parameters. That one predictor counts as 3 parameters.
- This is a _minimum_ sample size. If you can get more subjects, by all means get more.
- This assumes the sample is representative of the target population.
### Logistic regression
For a binary outcome (eg, 1/0, yes/no, success/failure, etc), we need to specify the following arguments:
- type of prediction model: binary outcome (`type = "b"`)
- anticipated Cox-Snell R-squared (`rsquared`)
- anticipated number of model parameters _under consideration_ (parameters)
- anticipated proportion of "successes" in population (prevalence)
Cox-Snell R-squared is a "generalized" R-squared measure of model performance. Unfortunately, unlike the traditional R-squared often used with linear models, the Cox-Snell R-squared can have an upper bound _less than 1_. The maximum value is dictated by the overall outcome proportion in the observed data. The following table provides some common outcome proportions and associated maximum Cox-Snell R-squared values:
```
outcome maximum
proportion C-S R-squared
---------------------------
0.5 0.75
0.4 0.74
0.3 0.71
0.2 0.63
0.1 0.48
0.05 0.33
0.01 0.11
```
Thus a model with good performance could have a small Cox-Snell R-squared. See Paul Allison's post on Statistical Horizons for more information and alternatives: https://statisticalhorizons.com/r2logistic/
Example: let's say we want to build a model to predict the probability a college graduate gains full-time employment within six months of graduation. Based on data we'll have access to, we plan to entertain 18 possible parameters (ie, coefficients in a model). We have reason to believe the overall proportion to obtain employment within 6 months is about 0.50. We'd like a model with a Cox-Snell R-squared of about 0.3. What minimum sample size do we need based on these assumptions?
```{r}
pmsampsize(type = "b",
parameters = 18,
csrsquared = 0.3,
prevalence = 0.50)
```
It appears we need at least 444 subjects with about 50% (222) actually gaining employment if we want to build a predictive model from 18 candidate parameters that ultimately has a Cox-Snell R-squared of at least 0.3. The EPP is the estimated _events per candidate predictor parameter_. Notice the output indicates that the maximum Cox-Snell R-squared value is 0.75.
## We're done!
If you would like to talk more about statistics or need statistical help with your research, we would love to hear from you: `[email protected]` or `[email protected]`
## References
- Champely, S. (2020). pwr: Basic Functions for Power Analysis. R package version 1.3-0. https://CRAN.R-project.org/package=pwr
- Cohen, J. (1988). _Statistical power analysis for the behavioral sciences (2nd ed.)_. Hillsdale, NJ: Lawrence Erlbaum.
- Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat. 2001; 55:19-24.
- Riley, RD, et al. Calculating the sample size required for developing a clinical prediction model. BMJ 2020;368:m441
https://doi.org/10.1136/bmj.m441 (Published 18 March 2020)
- Joie Ensor, Emma C. Martin and Richard D. Riley (2021). pmsampsize: Calculates the Minimum Sample Size Required for Developing a Multivariable Prediction Model. R package version 1.1.0. https://CRAN.R-project.org/package=pmsampsize
- Hogg, R and Tanis, E. (2006). _Probability and Statistical Inference (7th ed.)_. Pearson. (Ch. 9)
- Dalgaard, P. (2008). _Introductory Statistics with R_. Springer.
- Allison, p. (2013) What’s the Best R-Squared for Logistic Regression?. https://statisticalhorizons.com/r2logistic/ (Accessed 2023-02-10)