-
Notifications
You must be signed in to change notification settings - Fork 54
/
Copy pathcs-rep.qmd
178 lines (125 loc) · 6.1 KB
/
cs-rep.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# Case study: `rep()` {#sec-cs-rep}
```{r}
#| include = FALSE
source("common.R")
```
## What does `rep()` do?
`rep()` is an extremely useful base R function that repeats a vector `x` in various ways.
It takes a vector of data in `x` and has arguments (`times`, `each`, and `length.out`[^cs-rep-1]) that control how `x` is repeated.
Let's start by exploring the basics:
[^cs-rep-1]: Note that the function specification is `rep(x, ...)`, and `times`, `each`, and `length.out` do not appear explicitly.
You have to read the documentation to discover these arguments.
```{r}
x <- c(1, 2, 4)
rep(x, times = 3)
rep(x, length.out = 10)
```
`times` and `length.out` replicate the vector in the same way, but `length.out` allows you to specify a non-integer number of replications.
The `each` argument repeats individual components of the vector rather than the whole vector:
```{r}
rep(x, each = 3)
```
And you can combine that with `times`:
```{r}
rep(x, each = 3, times = 2)
```
If you supply a vector to `times` it works a similar way to `each`, repeating each component the specified number of times:
```{r}
rep(x, times = x)
```
## What makes this function hard to understand?
- `times` and `length.out` both control the same underlying variable in different ways, and if you set them both then `length.out` silently wins:
```{r}
rep(1:3, times = 2, length.out = 3)
```
- `times` and `each` are usually independent:
```{r}
rep(1:3, times = 2, each = 2)
```
But if you specify a vector for `times` you can't use each.
```{r}
#| error = TRUE
rep(1:3, times = c(2, 2, 2), each = 2)
```
- I think using `times` with a vector is confusing because it switches from replicating the whole vector to replicating individual values, like `each` usually does.
```{r}
rep(1:3, each = 2)
rep(1:3, times = 2)
rep(1:3, times = c(2, 2, 2))
```
## How might we improve the situation?
I think these problems have the same underlying cause: `rep()` is trying to do too much in a single function.
`rep()` is really two functions in a trench coat (@sec-strategy-functions) and it would be better served by a pair of functions, one which replicates element-by-element, and one which replicates the whole vector.
The following sections consider how we might do so, starting with what we should call the functions, then what arguments they'll need, then what an implementation might look like, and then considering the downsides of this approach.
### Function names
To create the new functions, we need to first come up with names: I like `rep_each()` and `rep_full()`.
`rep_each()` was a fairly easy name to come up with because it'll repeating each element.
`rep_full()` was a little harder and took a few iterations: I like that `full` has the same number of letters as `each`, which makes the two functions look like they belong together.
Some other possibilities I considered:
- `rep_each()` + `rep_every()`: each and every form a natural pair, but to me at least, repeating "every" element doesn't feel very different to repeating each element.
- `rep_element()` and `rep_whole()`: I like how these capture the differences precisely, but they are maybe too long for such commonly used functions.
### Arguments
Next, we need to think about their arguments.
They both will start with `x`, the vector to repeat.
Then their arguments differ:
- `rep_each()` needs an argument that specifies the number of times to repeat each element, which can either be a single number, or a vector the same length as `x`.
- `rep_full()` has two mutually exclusive arguments (@sec-implicit-strategies), either the number of times to repeat the whole vector or the desired length of the output.
What should we call the arguments?
We've already captured the different replication strategies in the function name, so I think the argument that specifies the number of times to replicate can be the same for both functions, and `times` seems reasonable.
What about the second argument to `rep_full()` which specifies the desired length of the output vector?
I draw inspiration from `rep()` which uses `length.out`.
But I think it's obvious that the argument controls the output length, so `length` is adequate.
### Implementation
We can combine these specifications with a simple implementation that uses the existing `rep` function.[^cs-rep-2]
[^cs-rep-2]: In real code I'd want to turn these into explicit unit tests so we can run them repeatedly as we make changes.
```{r}
rep_full <- function(x, times, length) {
rlang::check_exclusive(times, length)
if (!missing(length)) {
rep(x, length.out = length)
} else {
rep(x, times = times)
}
}
rep_each <- function(x, times) {
if (length(times) == 1) {
rep(x, each = times)
} else if (length(times) == length(x)) {
rep(x, times = times)
} else {
stop('`times` must be length 1 or the same length as `x`')
}
}
```
We can quickly check that the functions behave as we expect:
```{r}
x <- c(1, 2, 4)
# First the common times argument
rep_each(x, times = 3)
rep_full(x, times = 3)
# Then a vector times argument to rep_each:
rep_each(x, times = x)
# Then the length argumetn to rep_full
rep_full(x, length = 5)
```
### Downsides
One downside of this approach is if you want to both replicate each component *and* the entire vector, you have to use two function calls, which you might expect to be more verbose.
However, I don't think this is a terribly common use case, and if we use our usual call naming conventions, then the new call is the same length:
```{r}
rep(x, each = 2, times = 3)
rep_full(rep_each(x, 2), 3)
```
And it's only slightly longer if you use the pipe, which is maybe slightly more readable:
```{r}
x |> rep_each(2) |> rep_full(3)
```
::: callout-caution
Note that this implementation lacks any input checking so invalid inputs might work, warn, or throw an unhelpful error.
For example, since we're not checking that `times` and `length` argument to `rep_full()` are single integers, the following calls give suboptimal results:
```{r}
#| error: true
rep_full(1:3, 1:3)
rep_full(1:3, "x")
```
We'll come back to input checking later in the book.
:::