example-nonlinear-timeseries.Rmd

---
title: "Advanced use demo: non-linear time series analysis"
output: 
  html_document:
    toc: true
    toc_float: true
    theme: readable
    highlight: haddock
    fig_width: 10
    fig_height: 4
---

##1. Background 
This tutorial is to show you how to use `mgcv` for non-linear time series data.
With time series, the values you observe at a current time, $y(t)$, are assumed
to depend on the previous values of the time series: **_y(t) = f(y(t-1),
y(t-2),...)_**. Compare this to standard statistical methods, which assume each
data point is independent of the others once you account for any covariates, or
spatial statistics, which assume that the value at a given point depends on all
its neighbours, not just the ones in one direction from it.

A good example of time series data would be a population of animals fluctuating 
over time. If the population is well-above carrying capacity currently, it's 
likely going to be above its carrying capacity the next time we survey it as 
well. To be able to effectively model this population, we need to understand how
it changes over time in response to its own density. Generalized additive 
models, using `mgcv`, are very useful for this as they allow you to model 
current values as a non-linear function of past values.

This tutorial has two major sections. In the first one (part 3), we talk about 
how to use GAMs to model trends, with autocorrelated background noise. The 
second section (part 4) focuses on using GAMs to model complex nonlinear
dependencies between time points, for cases where you really care about how
current observations depend on previous observations (for instance, when trying
to estimate population dynamics). You don't have to do both sections; each
stands alone, so feel free to work through whichever of the two is more useful
for you.

## 2. Key concepts and functions: 
Here's a few key ideas and R functions you should familiarize yourself with if 
you haven't already encountered them before. For the R functions (which will be)
highlighted `like this`, use `?function_name` to look them up. If you're 
familiar with time series stats, skip to the next part.

### Cyclic smooths {#smoother}
This exercise will use a type of smoother we didn't cover in the intro: the 
cyclical smooth. Standard smooth terms assume that the left and right end of the
range of your x variable have nothing to do with each other; one can be much
higher or lower than the other. However, there's several types of data where the
predictor starts and ends at the same point. Think of time of day: 11:59 and
00:00 are almost the same time, even though they occur on different days. In
this tutorial, we'll work with seasonal (monthly) data, and similarly, month 1
and month 12 are very close to each other. As such, the start and end of such a
smooth should also be very close; if our model predicts that it should be 10
degrees C at midnight, it should also predict the same temperature at 11:59:59.

`mgcv` allows us to easily model this type of data, using a cyclic smooth basis.
The standard smooth terms we've used so far are built so that their 2nd
derivates are zero at the ends of the data. The cyclic smooth instead assumes 
that the value and 1st and 2nd derivatives are equal at the ends of the data.

To use a cyclic smooth, you need to code the smooth term like this: `s(x,
bs="cc")`. By default, `mgcv` assumes that the largest and smallest values of x
are the end points. If that isn't true, you can specify the end points of the
range as shown below and in the first example using the knots argument.


This code shows how the smooth basis functions vary between the default thin
plate ("tp"), cyclic ("cc") and cyclic smooth with end points outside the range
of the data. Note that no matter how you vary the cyclic smooths, the ends will
always be equal. For more info on cyclic smooths, refer to
`?smooth.construct.cc.smooth.spec`
```{r, echo=T,tidy=F,results="hide", include=T, message=FALSE,highlight=TRUE}
library(mgcv)
dat = data.frame(x = seq(0,1, length = 100))
tp_smooth = smooth.construct2(s(x,bs="tp", k=4),data = dat,knots = NULL)
cc_smooth = smooth.construct2(s(x,bs="cc", k=4),data = dat,knots = NULL)
cc_smooth2 = smooth.construct2(s(x,bs="cc", k=4),data = dat,
                               knots = list(x=c(-0.5,1.5)))
layout(matrix(1:3, nrow=1))
matplot(dat$x,tp_smooth$X, type="l",main="Thin plate spline" )
matplot(dat$x,cc_smooth$X, type="l",main="Cyclic spline" )
matplot(dat$x,cc_smooth2$X, type="l",main="Cyclic spline with wider knots")
layout(1)
```


### Glossary of terms
_Trend_: The long-term value around which a time series is fluctuating. For 
instance, if the habitat for a population is slowly deteriorating, it will show
a decreasing trend.

_Seasonal effect_: Many time series show a strong seasonal pattern, so the 
average value of the series depends on time of year. For instance, our 
hypothetical population may reproduce in spring, so population numbers would be 
higher then and decline throughout the year. We can use   [cyclic
smoothers](#smoother) to model these effects.

_Stationary time series_: This is a time series without any trend or seasonal 
components, so if you looked at multiple realizations of the time series, the 
mean and variance (and other statistical properties) averaged across 
realizations would not change with time.

_Autocorrelation function (`acf`)_: This is a function that describes how 
correlated a time series is with itself at different time lags; that is, the acf
of a time series x is: acf(x,i) = cor(x(t),x(t-i)), where i is a given lag. In 
`R`, use the function `acf(x)` to view the acf function for series `x`.

_`gamm`_: This function fits a generalized additive *mixed effects* model, using
the `nlme` package. We haven't discussed it too much in the rest of the course, 
but it's a tool to include more complicated mixed effects models than is allowed
for by the bs="re" method we discussed earlier. More importantly for this 
tutorial, it also allows you to add correlations between errors into your model.
This lets you specify how the errors are related to one another, using a range 
of models. Look up `?corStruct` if you're interested in the range of possible 
error correlations allowed. The `gamm` function creates a list object, with two 
items in it: a gam model, containing the smooth terms, which functions like the 
gam models you've been working with so far, and an lme object, that contains the
random effects, and in our case correlation functions.

## 3. Nonlinear decomposition of time series: trend, seasonal, and random values
In many cases, what we're interested in is estimating how things are changing
over time, and the fact that the time series is autocorrelated is a nuisance; it
makes it harder to get reliable estimates of the trends we're actually
interested in. In this section, we'll show you how to estimate non-linear
trends, while accounting for autocorrelation in residuals [^trendnote]. 

[^trendnote]: This  is an expanded
example based on one on Gavin's blog, [available here](http://www.fromthebottomoftheheap.net/2016/03/25/additive-modeling-global-temperature-series-revisited/). His example also has some great code on how to model
the derivatives of the trend.

This example assumes that a time series can be broken down into two main
components: a mean effect, that changes over time, and an error term with
autocorrelated errors, so the model for the dependent variable looks like:

$$y(t) = f(t) + \epsilon(t) $$
$$\epsilon(t) = \sum_{i=1}^{n} \alpha_i \cdot \epsilon(t-i) + rnorm(0, \sigma)$$

Where $f(t)$ is a smooth function of time, $\alpha_i$ are the auto-regressive
coefficients, and $\sigma$ is the standard deviation of model errors.

###Loading and viewing the data
The first time series we'll look at is a long-term temperature and precipitation
data set from the convention center here at Fort Lauderdale, consisting of 
monthly mean temperature (in degrees c), the number of days it rained that
month, and the amount of precipitation that month (in mm). The data range from
1950 to 2015, with a column for year and month (with Jan. = 1). The data is from
a great web-app called FetchClim2, and the link for the data [I used is
here](http://fetchclimate2.cloudapp.net/#page=geography&dm=values&t=years&v=airt(13,6,2,1),prate(12,7,2,1),wet(1)&y=1950:1:2015&dc=1,32,60,91,121,152,182,213,244,274,305,335,366&hc=0,24&p=26.099,-80.123,Point%201&ts=2016-07-17T21:57:11.213Z).

Here we'll load it and plot the air temperature data. Note that air temperature 
shows a strong seasonal trend, and that air-temperature is strongly 
auto-correlated. 

```{r, echo=T,tidy=F,include=T, message=FALSE,highlight=TRUE}
library(dplyr)
library(mgcv)
library(ggplot2)
library(tidyr)
florida = read.csv("data/time_series/Florida_climate.csv",
                           stringsAsFactors = F)
head(florida)

ggplot(aes(x=month, y=air_temp,color=year,group=year), 
      data= florida)+
  geom_point()+
  geom_line()
acf(florida$air_temp,lag.max = 36)
```

Looking at the time series, it does look like later years are 
warmer, but we should model it to see how much warmer.


###Modelling yearly and seasonal trends
The first model we'll fit will include both a yearly and seasonal trend. Note
that the seasonal trend uses a cyclic smoother, as January and December values
should be close to one another. Also note that we've specified a `knots` command
for the month term, with knots at 0.5 and 12.5 months. This is because if we
just left it as is, `mgcv` would assume that month 1 and month 12 were the end 
points, and should therefore have the same value. Since January and December 
will have different mean temperatures, this creates a bias. By specifying knots 
like this, we are stating that the point of equality should occur equidistant 
between the middle of January and the middle of December.

```{r, echo=T,tidy=F,include=T, message=FALSE}
model_temp_basic = gamm(air_temp~s(year)+s(month,bs="cc",k=12),data=florida,
                        knots = list(month = c(0.5, 12.5)))
plot(model_temp_basic$gam,page=1,scale=0)
summary(model_temp_basic$gam)
```

It appears that there's about a 8 degree C difference between the warmest and coolest
months, and that temperatures haven been rising non-linearly, with a much faster rate
of increase since 2010. The model seems to fit well; however, if we look at the acf, 
there's still substantial unexplained autocorrelation:
```{r, echo=T,tidy=F,include=T, message=FALSE}
acf(residuals(model_temp_basic$gam),lag.max = 36)
```

To incorporate the auto-correlated error structure, we need to use the slightly 
more complex `gamm` function. This uses the `nlme` package to fit the model, and
it allows us to add correlation functions (in this case, an auto-regressive 
model) to our GAM model. The model we'll add (an auto-regressive order-p model) 
assumes that the residuals at time t $\epsilon_t = \sum_{i=1}^{p} 
\alpha_p\epsilon_{t-i} + rnorm(0,\sigma)$. The more values you add to the 
auto-regressive model, the more complex time dependencies you can model. Also 
note we specify the form of the correlation as `form= ~1|year`. This means that 
temperatures will only be assumed to be correlated within a given year, so 
there's no assumed correlation between the temperature in December of one year 
and January of the next. This is just to speed up computations for the workshop;
in an actual analysis, we would not recommend ignoring these
correlations[^sequential_data].

[^sequential_data]:Also note that if your data is not organized so that data are
sequential (so that `gamm` can assume that the prior data point is the one that
occurs immediately before the current one), you'd have to give `corARMA` a
unique term specifying the time each observation occurs at. You could do that
for this data using: 
`florida= mutate(florida, time = as.Date(paste(year, month,"15", sep = "-")))` 
`corARMA(form = time|year, p=2)`


```{r, echo=T,tidy=F,include=T, message=FALSE}
model_temp_autoreg1 = gamm(air_temp~s(year)+s(month,bs="cc",k=12),
                            correlation = corARMA(form=~1|year,p = 1),
                            data=florida, knots = list(month = c(0.5, 12.5)))

model_temp_autoreg2 = gamm(air_temp~s(year)+s(month,bs="cc",k=12),
                            correlation = corARMA(form=~1|year,p = 2),
                            data=florida, knots = list(month = c(0.5, 12.5)))


layout(matrix(1:6, nrow = 2,byrow = F))
plot(model_temp_basic$gam,scale=0,main="basic model")
plot(model_temp_autoreg1$gam,scale=0,main="lag-1 model")
plot(model_temp_autoreg2$gam,scale=0,main="lag-2 model")
layout(1)

layout(matrix(1:3, nrow = 1))
acf(residuals(model_temp_basic$lme,type="normalized"),main="basic model",lag.max = 36)
acf(residuals(model_temp_autoreg1$lme,type="normalized"),main="lag-1 model",lag.max = 36)
acf(residuals(model_temp_autoreg2$lme,type="normalized"),main="lag-2 model",lag.max = 36)
layout(1)
```

Note that including the auto-regressive model resulted in a substantially 
smoother estimate of the long-term trend. Warming appears to still be speeding 
up, although not at the same rapid rate as in the naive model. Adding the first 
lag also seems to remove most of the auto-correlation in the data, although 
there appears to be at least some long-period auto-correlation at the  12-month 
scale. We can use an ANOVA to test whether adding the first or second
autoregressive terms actually result in a better fitting model than the naive
one.

```{r, echo=T,tidy=F,include=T, message=FALSE}
anova(model_temp_basic$lme, model_temp_autoreg1$lme, model_temp_autoreg2$lme)
```
 
It appears that adding the first autoregressive term substantially improved the model,
but the second lag term may not be necessary. 

###Exercises
1. Modify the model of temperature and season to allow the seasonal effect to 
change over time. Is there evidence for changing seasonal patterns? How does 
this change once temperature autocorrelation is  accounted for? Hint: this will 
require use tensor smooths.

2. Construct a model to estimate seasonal and long-term trends in precipitation 
for the convention center. What lags are necessary to include in the model? 
Remember that precipitation is a positive variable (i.e. cannot go below zero), 
so you may want to transform the variable to better suit the model's
assumptions.


## 4. Stationary nonlinear autoregressive models and forecasting
So far we've focused on estimating trends where the presence of auto-correlation
is just a nuisance effect we need to account for to get accurate estimates. 
However, in population and community ecology, we often find the time dependence 
itself to be interesting, as it can give us information on how the population is
changing in response to its own density; that is, it helps us estimate what 
factors are affecting population growth rates. We can use this information to 
estimate what the equilibrium (carrying capacity) population density is, whether
this equilibrium is stable, and how quickly we expect the population to return to
equilibrium after perturbation. We can also use these methods to forecast the
population into the future, which is very useful for managers[^lynxnote]. 

[^lynxnote]: This example is
based off an example by the statistician Cosma Shalizi, at Carnegie Mellon University. The original example starts at page 517 of his book Advanced
Data Analysis from an Elementary Point of view (the example starts at page 517
of the pdf draft [available here](http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf)). This book
is a great learning tool for stats in general, and for more on using GAMs. 

We also know, though, that density dependent effects in ecology are often 
strongly non-linear, so that observed densities at a given time may be a complex
function of lagged density values: $x(t) = f(x(t-1), x(t-2)...)$. Therefore, the
linear correlation equations we were just working with may not suffice to
estimate the real population dynamics. This is where `mgcv` comes in, letting us
analyze complex time-dependent relationships.

###Loading and viewing the data
The second data set we'll work on is the famous "lynx" data. The data set
consists of the annual numbers of lynx caught by Hudson's Bay trappers from 1821
to 1934. This is one of the most well-known data sets in time series analysis
(and in population ecology), as it shows clearly cyclic behaviour, with a period
of  9-10 years. It is also one of the most well-studied data sets in time series
statistics, as standard (linear) time series methods have a hard time capturing
the strong cyclic behaviours.

Here we'll load it, convert it into a data frame that `mgcv`
is able to handle, and plot the data. Note the strong cyclicity, obvious from the
acf function.
```{r, echo=T,tidy=F,include=T, message=FALSE}
library(dplyr)
library(mgcv)
library(ggplot2)
library(tidyr)
data(lynx)
lynx_full = data.frame(year=1821:1934, population = as.numeric(lynx))
head(lynx_full)
ggplot(aes(x=year, y=population), data= lynx_full)+
  geom_point()+
  geom_line()
acf(lynx_full$population)
```

Next we'll add some extra columns to the data, representing lagged population 
values. Here we'll use the `lag` function from the `dplyr` package. Note that
we've log- transformed the population data for the lag values. This is because
our model will assume we're dealing with log-responses, so it will make
interpreting plots easier. Also note that we specified the `default` argument in
the `lag` function as the mean log population. This will replace all the `NA`s
generated at the start of the time series (where there's no lagged values
possible) with the mean value, so we don't end up throwing out data, and so we
can compare models with different lags (as they'll have the same number of data
points). 

We'll also split the data set into 2: a training data set to fit models, and a
40 year testing data set to test how well our models fit out of sample. This is
often a crucial step in fitting this sort of data; it is very easy to overfit 
time series data, and holding data in reserve gives us something to test for 
overfitting issues. 


```{r, echo=T,tidy=F,include=T, message=FALSE}

mean_pop_l = mean(log(lynx_full$population))
lynx_full = mutate(lynx_full, 
                   popl = log(population),
                   lag1 = lag(popl,1, default = mean_pop_l),
                   lag2 = lag(popl,2, default = mean_pop_l),
                   lag3 = lag(popl,3, default = mean_pop_l),
                   lag4 = lag(popl,4, default = mean_pop_l),
                   lag5 = lag(popl,5, default = mean_pop_l),
                   lag6 = lag(popl,6, default = mean_pop_l))

lynx_train = filter(lynx_full, year<1895)
lynx_test = filter(lynx_full, year>=1895)

```


###Fitting and testing linear models

We'll start by fitting a series of linear auto-regressive models of increasing
order, and seeing how well each forecasts the test data, using one-step-ahead
forecasting, where we try to forecast the next step for each time point, given
the observed data series:

```{r, echo=T,tidy=F,include=T, message=FALSE,fig.width=12}
lynx_lm1 = gam(population~lag1, 
                         data=lynx_train, family = "poisson", method="REML")
lynx_lm2 = update(lynx_lm1,formula = population~lag1+lag2)
lynx_lm3 = update(lynx_lm1,formula = population~lag1+lag2+lag3)
lynx_lm4 = update(lynx_lm1,formula = population~lag1+lag2+lag3+lag4)

round(AIC(lynx_lm1,lynx_lm2,
      lynx_lm3,lynx_lm4))

lynx_predict_lm = mutate(
  lynx_full,
  lag1_model = as.vector(exp(predict(lynx_lm1,lynx_full))),
  lag2_model = as.vector(exp(predict(lynx_lm2,lynx_full))),
  lag3_model = as.vector(exp(predict(lynx_lm3,lynx_full))),
  lag4_model = as.vector(exp(predict(lynx_lm4,lynx_full))),
  data_type = factor(ifelse(year<1895, "training","testing"),
                     levels= c("training","testing"))
  )

#Gathers all of the predictions into two columns: the model name and the predicted value
lynx_predict_lm = gather(lynx_predict_lm,key= model, value= pop_est,
                                  lag1_model:lag4_model )%>%
  mutate(pop_est = as.numeric(pop_est))


forecast_accuracy_lm = lynx_predict_lm %>% 
  group_by(model)%>%
  filter(data_type=="testing")%>%
  summarize(out_of_sample_r2 = round(cor(log(pop_est),log(population))^2,2))

print(forecast_accuracy_lm)

ggplot(aes(x=year, y=population), data= lynx_predict_lm)+
  geom_point()+
  geom_line()+
  geom_line(aes(y=pop_est,color=model))+
  scale_color_brewer(palette="Set1")+
  facet_grid(.~data_type,scales = "free_x")+
  theme_bw()

```


Note that the anova table shows that adding each lag substantially reduced model
AIC, up to lag-4. Some work on this time series shows that you need up to 11
(!!) linear lag terms to effectively fit the model. Even the long-lagged models
under-estimate population densities when the cycle is at a peak, and seem to
miss-estimate the period of the cycle; the predicted peaks in the testing data
occur 1-2 years after they do in the observed data. Further, looking at the
out-of-sample estimates of $R^2$, $R^2$ is highest with a two-lag model.


###Fitting and testing non-linear models
Now let's look at how adding non-linear lag terms improves the model. We've
constrained the degrees of freedom of each smooth term as there's complex
correlations between the lags, and allowing too much freedom for the smooths can
allow for complex and difficult to interpret smooths, as well as poor
out-of-sample performance.

```{r, echo=T,tidy=F,include=T, message=FALSE,fig.width=12}
lynx_gam1 = gam(population~s(lag1,k=5), 
                         data=lynx_train, family = "poisson", method="REML")
lynx_gam2 = update(lynx_gam1,population~s(lag1,k=5)+s(lag2,k=5))
lynx_gam3 = update(lynx_gam1,population~s(lag1,k=5)+s(lag2,k=5)+s(lag3,k=5))
lynx_gam4 = update(lynx_gam1,population~s(lag1,k=5)+s(lag2,k=5)+s(lag3,k=5)+
                              s(lag4,k=5))

round(AIC(lynx_gam1,lynx_gam2,lynx_gam3,lynx_gam4))

lynx_predict_gam = mutate(
  lynx_full,
  lag1_model = as.vector(exp(predict(lynx_gam1,lynx_full))),
  lag2_model = as.vector(exp(predict(lynx_gam2,lynx_full))),
  lag3_model = as.vector(exp(predict(lynx_gam3,lynx_full))),
  lag4_model = as.vector(exp(predict(lynx_gam4,lynx_full))),
  data_type = factor(ifelse(year<1895, "training","testing"),
                     levels= c("training","testing"))
  )

#Gathers all of the predictions into two columns: the model name and the predicted value
lynx_predict_gam = gather(lynx_predict_gam,key= model, value= pop_est,
                                  lag1_model:lag4_model)%>%
  mutate(pop_est = as.numeric(pop_est))

forecast_accuracy_gam = lynx_predict_gam %>% 
  group_by(model)%>%
  filter(data_type=="testing")%>%
  summarize(out_of_sample_r2 = round(cor(log(pop_est),log(population))^2,2))

print(forecast_accuracy_gam)

ggplot(aes(x=year, y=population), data= lynx_predict_gam)+
  geom_point()+
  geom_line()+
  geom_line(aes(y=pop_est,color=model))+
  scale_color_brewer(palette="Set1")+
  facet_grid(.~data_type, scales="free_x")+
  theme_bw()

```

The nonlinear models are more able to effectively capture the timing and
magnitude of peaks in the cycle than any of the linear models, and explain
more of the variance out-of-sample, accounting for 86% of OOS variance
for the 2-lag model. 


```{r, echo=T,tidy=F,include=T, message=FALSE}
plot(lynx_gam2, page=1,scale=0)
```
 
Our best fitting model, the 2-lag nonlinear model, indicates that log-lynx
densities increase roughly linearly with the prior year's log-density, but
decrease nonlinearly when population densities two year's previous are high,
decreasing at a faster rate when (log) densities are very high two years
previously. You'll see this sort of lagged response in predator-prey cycles.


### Exercises: 
1. Focusing on the two-lag model: is there any evidence of an interaction
between the population at lag 1 and the population at lag 2? Make sure to choose
appropriate interaction terms, keeping in mind what you know about the data.

2. We ignored confidence intervals on our forecasts. For the 2-lag linear and 
nonlinear models, determine the appropriate confidence intervals and determine
if the observed out-of-sample population densities fall within them. 

3. We assumed that errors were Poisson-distributed for this exercise. However, 
we know many natural populations are over-dispersed, that is,they show more
variation than we'd expect from a Poisson distribution. Choose a distribution to
model this extra variation. How does it affect your model estimates, and the
estimated degree of non-linearity present?

4. CHALLENGING: We focused, for simplicity, on one-step-ahead forecasting.
However, we often want to forecast over longer time horizons. If you have the
time, try figuring out how you can use this model to forecast whole trajectories
of the out-of-sample data. Do the forecast trajectories for the linear or
non-linear models have the same dynamic properties as the observed data? Hint:
This will likely require a for-loop, and you'll have to calculate new lagged
terms for each step.