-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlife_tables.Rmd
351 lines (249 loc) · 17.2 KB
/
life_tables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
title: "Life tables in R using the `tidyverse`"
author: "Monica Alexander"
output:
pdf_document:
number_sections: true
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
\newpage
Life tables are a fundamental tool in demography. A life table describes the mortality experiences for a certain population. Usually a life table is composed of sets of values showing the mortality experience of a hypothetical group of infants born at the same time and subject throughout their lifetime to the specific mortality rates of a given year. Life tables are how we calculate **life expectancy**, probably one of the most common mortality summary measures. They are useful to compare populations and also tells us something about the implied stationary population. Each column refers to a different measure of survivorship. There are different ways of describing survivorship; for example, probability still alive, life expectancy, etc, so a life table has many different columns.
This module explains the main columns of a lifetable and demonstrates how to construct a lifetable in R using the `tidyverse` syntax.
Let's load in the packages we need:
```{r, echo = TRUE}
library(tidyverse) # data manipulation and ggplot functions
library(kableExtra) # format tables
library(janitor) # to easily clean up column names
```
# Columns of the lifetable
## Survivorship $l_x$
Every row of a life table refers to a different age or age group: if the later, the table is referred to as an **abridged** life table. We will define $x$ to be age and $n$ to be the length of the interval.
A usual place to start is survivorship, $l_x$, which is defined as the number of people still left alive at age $x$. The value of $l_0$ is the starting size of the population, and is called the **radix**. In practice, the radix is usually equal to 1, 100, or 100,000. If $l_0=1$ then then $l_x$ is a probability of survival to age $x$. Note that for now we are implicitly assuming this $l_0$ relates to a **cohort** of people moving through time, so the life table documents **cohort** mortality. However, later on we will look at period mortality.
Here's the estimated $l_x$ values for females in Ontario in 2015. The data are from the [Canadian Human Mortality Database](http://www.bdlc.umontreal.ca/CHMD/). Here, the radix is 100,000. By age 110, out of the original population of 100,000, it is estimated that 28 will survive.
```{r}
lt <- read_table("http://www.prdh.umontreal.ca/BDLC/data/ont/fltper_5x5.txt", skip = 2)
lt <- lt |>
filter(Year=="2015-2019") |>
mutate(x = c(0,1,seq(5, 110, by = 5)),
n = lead(x, default = Inf)-x)
lt |>
select(x,n, lx) |>
kable()
```
Let's also plot the $l_x$ and divide through by 100,000 so the $l_x$ can be interpreted as the proportion of the population surviving at age $x$.
```{r}
lt |>
mutate(lx = lx/100000) |>
ggplot(aes(x, lx)) +
geom_line() +
xlab("age") +
theme_bw(base_size = 14) +
ggtitle("Survivorship for Ontario, 2015")
```
## Deaths $_nd_x$
The next column, $_nd_x$, is the number of deaths between ages $x$ and $x+n$. Note that we using the 'duration-age' notation, because it refers to deaths over an interval. In contrast, $l_x$ refers to survivors at a certain age $x$.
By definition, the number of deaths over an interval must be the number of survivors at the start of the interval, minus the number of survivors at the end, i.e.
$$
_nd_x = l_x - l_{x+n}.
$$
Let's look at the estimated $_nd_x$ values for Ontario in 2015:
```{r}
lt |>
select(x,n, lx, dx) |>
kable()
```
Notice the structure of the life table in terms of where the the values of $l_x$ and $_nd_x$ line up within the rows. $l_x$ always starts at the radix, so the first row represents the total population before any deaths. The first row of $_nd_x$ represents the deaths in the first interval. So the second row of $l_x$ is equal to the previous row of $l_x$ minus the previous row of $_nd_x$, and so on. Note also the last interval: everyone who survived to the last age group must die.[^9]
[^9]: *memento mori*.
## Probability of dying, $_nq_x$, and of surviving, $_np_x$
The next column, $_nq_x$ is the probability of dying between ages $x$ and $x+n$. Note that this is a conditional probability, so it's the probability of dying in that interval given you survived to age $x$. $_nq_x$ can be calculated as
$$
_nq_x = \frac{_nd_x}{l_x}
$$
The complement of $_nq_x$ is the probability of survival, $_np_x$
$$
_np_x = 1 - {_nq_x}
$$
Again, this is a conditional probability, so it's the probability of survival between ages $x$ and $x+n$ given you survived to age $x$. Given the relationship for $_nq_x$ and $_nd_x$, we can also calculate $_np_x$ as
$$
_np_x = 1 - {_nq_x} = 1 - \frac{_nd_x}{l_x} = \frac{l_x - l_x + l_{x+n}}{l_x} = \frac{l_{x+n}}{l_x}
$$
i.e. the probability of survival is the ratio of the the survivors at the end and start of the interval.
Looking again at the data for Ontario in 2015, notice the probability of death in the last age group is 1, because again, everyone must die eventually.
```{r}
lt |>
mutate(px = 1- qx) |>
select(x, n, lx, dx, qx, px) |>
kable()
```
## Average years lived, $_na_x$
$_na_x$ is the number of years lived by those who died between ages $x$ and $x +n$. So for example, if $_1a_0 = 0.25$, then for that population, those infants who died in the first year on average died after 0.25 years = 3 months. To calculate the exact value for $_na_x$ requires a lot of data: you would need to know the exact lengths of life for each individual in the cohort. Approximations to $_na_x$ are discussed below in the period life table section.
## Person-years lived, $_nL_x$
$_nL_x$ is the number of person-years lived between ages $x$ and $x + n$. The total number of person-years lived (PYL) in an interval is the sum of
1. the PYL by those who survived and
2. the PYL by those who died in the interval.
The first piece is just the interval length, $n$ multiplied by the number of survivors at the end of the intervals, $l_{x+n}$. The second piece is the average time spent alive in the interval by those who died, $_na_x$ multiplied by the number of people who died in the interval, $_nd_x$. So
\begin{equation*}
_nL_x = n\cdot l_{x+n} + _na_x \cdot _nd_x.
\end{equation*}
Note for the last interval there are no survivors, so $_{\infty}L_x = {_{\infty}a_x} \cdot {_{\infty}d_x}$.
Adding $_na_x$ and $_nL_x$ to the Ontario life table:
```{r}
lt |>
select(x,n, lx, dx, ax, Lx) |>
kable()
```
### $_nL_x$ graphically and the relationship with $l_x$
$_nL_x$ is essentially the number of survivors times the average length of time they survived in a particular interval. For a given radix $l_0$ and interval length $n$, the maximum $_nL_x$ could be is $l_0 \cdot n$, if everyone survived.
How does $_nL_x$ relate to $l_x$? It is the area under the $l_x$ curve for the interval $[x,x+n]$, as illustrated below by the red dashed lines, for $_{15}L_{15}$. It may help to think about the units here: $_nL_x$ has units person-years. $l_x$ has units of persons. The x-axis on the graph below has units years.
```{r}
lt |>
mutate(lx = lx/100000) |>
ggplot(aes(x, lx)) +
geom_line() +
xlab("age") +
theme_bw(base_size = 14) +
geom_vline(xintercept = 15, lty = 2, color = "red")+
geom_vline(xintercept = 30, lty = 2, color = "red")+
ggtitle("Survivorship for Ontario, 2015")
```
With this in mind, we can represent $_nL_x$ in continuous form as
$$
_nL_x = \frac{1}{l_0}\int_x^{x+n} l_x dx
$$
In practice we usually have to calculate $_nL_x$ in the discrete form, but it's often useful to think about it in continuous form, i.e. the area under the survivorship curve.
## Person-years lived above age $x$, $T_x$
Whereas $_nL_x$ is the person-years lived in a specific interval, $T_x$ is the person-years lived above a specific age $x$ (so notice it does not have the duration/age notation). It is defined as the sum of the relevant $_nL_x$:
\begin{equation*}
T_x = \sum_x ^ \infty {_nL_x}
\end{equation*}
In a similar fashion to $_nL_x$, $T_x$ can be thought of as the area under the $l_x$ curve above age $x$. In addition, We can write $T_x$ in continuous form as
$$
T_x = \frac{1}{l_0}\int_x^{\infty} l_x dx
$$
## Life expectancy at age $x$, $e_x$
The final column we will introduce for now is probably the most well-known: $e_x$, the average number of remaining years of life for those who reach age $x$, or the **life expectancy** at age $x$. Note that the 'expectancy' terminology is related to the expected value in the statistical sense. $e_x$ is calculated as
\begin{equation*}
e_x = \frac{T_x}{l_x}
\end{equation*}
We can do a quick check of the units here to make sure it makes sense: $T_x$ has units person-years, $l_x$ has units persons, so $e_x$ has units of years.
You are probably most familiar with life expectancy at birth, $e_0$. Note that it is again a conditional measure, that is, it's the average number of remaining years **given** a person has already survived to age $x$. As such, the value of $e_x$ need not decrease monotonically over age. In practice it usually does, unless infant mortality is relatively high.
The filled-in life table with all columns discussed:
```{r}
lt |>
select(x,n, lx, dx, ax, Lx, Tx, ex) |>
kable()
```
# Period life tables
A life table as defined above refers to tracking the mortality of a **cohort** of people as they age. However, it is often not practical or useful just to consider the mortality of a cohort, because in order to to build a complete table, we have to wait to observe everyone in the cohort die. So for the 1990 birth cohort, for example, we would probably have to wait around until at least 2090 before a reasonable cohort life table could be built. This is not very useful to study current mortality conditions.
In addition to constructing cohort life tables, we can construct **period** life tables. They refer to the period in the sense that they are constructed using mortality conditions in a particular period. This means the lifetable refers to a **synthetic cohort**, a hypothetical group of people that experience the mortality conditions of the period of interest throughout their entire life. Why is this hypothetical and potentially unrealistic? Because mortality conditions change over time (and in general, are getting better). So for example, if I live until I'm 70, the mortality conditions I am subject to in the future are likely to be different to the mortality conditions that a current 70-year-old is being subjected to. However, period life tables are still useful to compare mortality outcomes for different populations in a more up-to-date way.
## Construction from period mortality rates
The key to constructing period life tables is converting the observed period mortality rates $_nM_x$ to probabilities of death, $_nq_x$. The mortality rate is the number deaths divided by person years lived, so in life table notation this is:
$$
_nM_x = \frac{_ndx}{_nLx}
$$
We can use the following **$_nM_x$ to $_nq_x$ conversion formula** to get the $_nq_x$ column, after which all other columns can be derived based on the relationships discussed above (and choosing a radix). The formula is:
$$
_nq_x = \frac{n \cdot {_nM_x}}{1 + (n - {_na_x})\cdot {_nM_x}}
$$
How did this come about? By rewriting $$_nL_x = n\cdot l_{x+n} + {_na_x}\cdot{_nd_x} = n(l_{n} - {_nd_x}) + {_na_x}\cdot{_nd_x}$$ rearranging to get $l_x$ and rewriting $_nq_x$ with this denominator.
## Getting values for $_na_x$
The conversion formula requires information only on period mortality rates and values of the average number of years lived for those who died, $_na_x$. How do we get these values? As mentioned above, the data required to calculate $_na_x$ exactly is usually not available, so we need to approximate it somehow. Preston, Heuveline and Guilot (2000) has a good overview of the options here (section 3.2, page 44), but the most common and easiest approach is, for **most** age groups, assume
$$_na_x = n/2$$
that is, on average those who die, die half-way through the interval. So for an abridged life table with five-year intervals, $_na_x = 2.5$. This assumption is fine for most age groups, except for the very young and very old ages.
At younger ages, typically in abridged life tables the first five years is split out into the first year, and years 1-5, so we need values for $_1a_0$ and $_4a_1$. For mortality at younger ages, we expected comparatively more deaths to occur at the start of the interval, so $_na_x < n/2$. The following two approximations are common, and used e.g. by Wachter in Essential Demographic Methods (2014):
$$
_1a_0 = 0.07 + 1.7 _1M_0
$$
and
$$
_4a_1 = 1.5
$$
The equation for $_1a_0$ says that infants who die on average die at about 1 month plus a bit, where the bit depends on the over level of infant mortality. This is based on the observation that as infant mortality declines, infants that are dying are more likely to die from pre-existing conditions rather than exogenous factors, so deaths occur relatively early.
For the last age group, we can assume $_na_x$ is the inverse of the mortality rate:
$$
_{\infty}a_{\omega} = 1/_{\infty}M_{\omega}
$$
where $_{\infty}M_{\omega}$ is the age-specific mortality rate for the last age interval; $\omega$ refers to the last age. The smaller the interval of the open-ended age group, the better approximation.
## Interpretation of period life table measures
As mentioned at the start of this section, period life tables are constructed from a synthetic cohort of people that hypothetically would go through life experiencing the age-specific mortality conditions of the current period. Period life table measures, and in particular, period life expectancy at birth, are the most commonly published and discussed. However, period measures of life expectancy are often misinterpreted. The technical definition is the expected number of years of live for a newborn who would be subject to the current mortality conditions for their entire life. But life expectancy is usually just talked about as 'how long you're expected to live for'. Part of the confusion comes from the name, but the 'expected' part refers to the fact that it is an expected value in the statistical sense; in particular, $E[T_x] = e_x$.
# R: Make your own life table
Let's make a period life table for females in Quebec in 2015 using data from the Canadian Human Mortality Database. Read in the data directly from the website, and filter out what we need:
```{r}
Mx <- read_table("http://www.prdh.umontreal.ca/BDLC/data/que/Mx_5x5.txt", skip = 2, col_types = 'ccddd')
d <- Mx |>
mutate(year = as.numeric(substr(Year, 1, 4))) |>
select(year, Age, Total) |>
clean_names() |>
rename(Mx = total)
head(d)
```
The `age` column is a character, so let's make an age `x` and interval length `n` column:
```{r}
d <- d |>
mutate(x = as.numeric(str_remove(age, "-.*|\\+")),
n = lead(x, default = Inf) - x) |>
filter(x<105) |> # remove older ages that have varying data availability
select(year, age, x, n, Mx)
head(d)
```
Now we can use `tidyverse` to calculate the columns in the life table, based on the equations presented in previous sections. I set the radix $l_0$ to be one and filter to just include the year 2015. This code makes use of the `case_when` function, which allows to define different values of $_na_x$ based on age group. Formulas for other columns are based on equations stated above. The formula for $T_x$ is implemented by first reversing the $_nL_x$ column, taking the cumulative sum, and then reversing the result.
```{r}
lt_2015 <- d |>
filter(year==2015) |>
mutate(
ax = case_when(
x==0 ~ 0.07 + 1.7*Mx,
x==1 ~ 1.5,
x==110 ~ 1/Mx,
TRUE ~ 2.5
),
qx = n * Mx / (1 + (n - ax)* Mx),
px = 1 - qx,
lx = lag(cumprod(px), default = 1),
dx = lx - lead(lx, default = 0),
Lx = n * lead(lx, default = 0) + (ax* dx),
Tx = rev(cumsum(rev(Lx))),
ex = Tx / lx
)
head(lt_2015)
```
We can extend this to calculate life tables for every year using the `group_by` function
```{r}
lt_all_years <- d |>
group_by(year) |>
mutate(
ax = case_when(
x==0 ~ 0.07 + 1.7*Mx,
x==1 ~ 1.5,
x==110 ~ 1/Mx,
TRUE ~ 2.5
),
qx = n * Mx / (1 + (n - ax)* Mx),
px = 1 - qx,
lx = lag(cumprod(px), default = 1),
dx = lx - lead(lx, default = 0),
Lx = n * lead(lx, default = 0) + (ax* dx),
Tx = rev(cumsum(rev(Lx))),
ex = Tx / lx
)
head(lt_all_years)
```
Let's plot the $l_x$ curve over time; notice as mortality improves, the drop in $l_x$ after the first year of life becomes less noticeable, and the curve becomes more 'rectangular'[^11]
[^11]: Indeed, this phenomenon is called 'the rectangularization of the survival curve' and is related to the study of 'compression' of mortality, meaning that deaths become more concentration at older ages. See for example [Wilmoth and Horiuchi](https://link.springer.com/article/10.2307/2648085).
```{r, warning = FALSE}
lt_all_years |>
ggplot(aes(x, lx, color = year, group = year)) +
geom_line()+
ggtitle("Survivorship over time, Quebec")
```
We can also plot life expectancy at birth over time:
```{r, warning = FALSE}
lt_all_years |>
filter(x==0) |>
ggplot(aes(year, ex)) +
geom_line() +
ggtitle("Life expectancy at birth, Quebec")
```