hassan2005/main.Rmd

---
title: "Input-Output Hidden Markov Model applied to financial time series"
author: "Luis Damiano, Brian Peterson, Michael Weylandt"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    number_sections: true
    self_contained: true
    includes:
      in_header: ../common/Rmd/preamble-html.Rmd
vignette: >
  %\VignetteIndexEntry{Input-Output Hidden Markov Model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: ../common/references.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, fig.width = 7.2)
```

```{r echo = FALSE, results="asis"}
cat("<style>img{border: 0px !important;}</style>")
```

This work aims at replicating the Input-Output Hidden Markov Model (IOHMM) originally proposed by @hassan2005stock to forecast stock prices. The main goal is to produce public programming code in [Stan](http://mc-stan.org/) [@carpenter2016stan] for a fully Bayesian estimation of the model parameters and inference on hidden quantities, namely filtered state belief, smoothed state belief, jointly most probable state path and fitted output. The model is introduced only briefly, a more detailed mathematical treatment can be found in our [literature review](https://github.com/luisdamiano/gsoc17-hhmm/blob/master/litreview/main.pdf).

The authors acknowledge Google for financial support via the Google Summer of Code 2017 program.

---

# The Input-Output Hidden Markov Model

The IOHMM is an architecture proposed by @bengio1995input to map input sequences, sometimes called the control signal, to output sequences. It is a probabilistic framework that can deal with general sequence processing tasks such as production, classification, or prediction. The main difference with Hidden Markov Models (HMM), which are part of the unsupervised learning paradigm, is the capability to learn the output sequence itself instead of the distribution of the output sequence.

## Definitions

As with HMM, IOHMM involves two interconnected models,

\begin{align*}
z_{t} &= f(z_{t-1}, \mat{u}_{t}) \\
\mat{x}_{t} &= g(z_{t  }, \mat{u}_{t}).
\end{align*}

The first line corresponds to the state model, which consists of discrete-time, discrete-state hidden states $z_t \in \{1, \dots, K\}$ whose transition depends on the previous hidden state $z_{t-1}$ and the input vector $\mat{u}_{t} \in \RR^M$. Additionally, the observation model is governed by $g(z_{t}, \mat{u}_{t})$, where $\mat{x}_t \in \RR^R$ is the vector of observations, emissions or output. The corresponding joint distribution is

\[
p(\mat{z}_{1:T}, \mat{x}_{1:T} | \mat{u}_{t}).
\]

In the proposed parametrization with continuous inputs and outputs, the state model involves a multinomial regression whose parameters depend on the previous state taking the value $i$,

\[
p(z_t | \mat{x}_{t}, \mat{u}_{t}, z_{t-1} = i) = \text{softmax}^{-1}(\mat{w}_i \mat{u}_{t}),
\]

and the observation model is built upon the Gaussian density with parameters depending on the current state taking the value $j$,

\[
p(\mat{x}_t | \mat{u}_{t}, z_{t} = j) = \mathcal{N}(\mat{x}_t | \mat{\mu}_j, \mat{\Sigma}_j).
\]

IOHMM adapts the logic of HMM to allow for input and output vectors, retaining its fully probabilistic quality. Hidden states are assumed to follow a multinomial distribution that depends on the input sequence. The transition probabilities $\Psi_t(i, j) = p(z_t = j | z_{t-1} = i, \mat{u}_{t})$, which govern the state dynamics, are driven by the control signal as well.

As for the output sequence, the local evidence at time $t$ now becomes $\psi_t(j) = p(\mat{x}_t | z_t = j, \mat{\eta}_t)$, where $\mat{\eta}_t = \ev{\mat{x}_t | z_t, \mat{u}_t}$ can be interpreted as the expected location parameter for the probability distribution of the emission $\mat{x}_{t}$ conditional on the input vector $\mat{u}_t$ and the hidden state $z_t$.

## Inference
There are several quantities of interest to be estimated via different algorithms. In this section, the discussion assumes that model parameters are known.

### Filtering

A filter infers the belief state at a given step based on all the information available up to that point,

\begin{align*}
\alpha_t(j)
  & \triangleq p(z_t = j | \mat{x}_{1:t}, \mat{u}_{1:t}).
\end{align*}

It achieves better noise reduction than simply estimating the hidden state based on the current estimate $p(z_t | \mat{x}_{t})$. The filtering process can be run online, or recursively, as new data streams in. Filtered marginals can be computed recursively by means of the forward algorithm [@baum1967inequality].

### Smoothing

A smoother infers the belief state at a given step based on all the observations or evidence,

\[
\begin{align*}
\gamma_t(j)
  & \triangleq p(z_t = j | \mat{x}_{1:T}, \mat{u}_{1:T}) \\
  & \propto \alpha_t(j) \beta_t(j),
\end{align*}
\]

where

\begin{align*}
\beta_{t-1}(i)
  & \triangleq p(\mat{x}_{t:T} | z_{t-1} = i, \mat{u}_{t:T}).
\end{align*}

Although noise and uncertainty are reduced as a result of conditioning on past and future data, the smoothing process can only be run offline. Inference can be run by means of the forwards-backwards algorithm, also know as the Baum-Welch algorithm [@baum1967inequality, @baum1970maximization].

### Most likely hidden path
It is also of interest to compute the most probable state sequence or path,

\[
\mat{z}^* = \argmax_{\mat{z}_{1:T}} p(\mat{z}_{1:T} | \mat{x}_{1:T}).
\]

The jointly most probable sequence of states can be inferred by means of maximum a posterior (MAP) estimation or Viterbi decoding.

## Parameter estimation
The parameters of the models are $\mat{\theta} = (\mat{\pi}_1, \mat{\theta}_h, \mat{\theta}_o)$, where $\mat{\pi}_1$ is the initial state distribution, $\mat{\theta}_h$ are the parameters of the hidden model and $\mat{\theta}_o$ are the parameters of the state-conditional density function $p(\mat{x}_t | z_t = j, \mat{u}_t)$. The form of $\mat{\theta}_h$ and $\mat{\theta}_o$ depends on the specification of each model. For example, state transition may be characterized by a logistic or multinomial regression with parameters $\mat{w}_k$ for $k \in \{1, \dots, K\}$, while emissions may be modeled by a linear regression with Gaussian error with parameters $\mat{b}_k$ and $\mat{\Sigma}_k$ for $k \in \{1, \dots, K\}$.

---

# Learning by simulation
Model complexity and limited software availability requires that we implement our sampler from scratch. Following a very simplified version of the methodology proposed by @cook2006validation, we first create a simulation routine that generates data complying with the model assumptions and we then write a MCMC sampler in Stan to recover the parameters and other hidden quantities. Only after the correctness of our software is tested, we feed our model with real data and analyse the results.

We believe that learning by simulation has many benefits, including:

* Confirmation whether the software developed to implement the algorithms works properly, i.e. it retrieves the parameters used to generate the data.
* Enhanced interpretation of the results and deeper understanding of the model behaviour, especially because the generated data meets all the underlying assumptions and the estimates are free of the effects of data contamination originated by unmodeled phenomena.
* The possibility of calibrating different combinations of values for the parameters, the inputs and the outputs help the development of intuition and insight. This may prove valuable for the data analysis stage.

## Auxiliary files

### Math functions

We write an auxiliary R function to compute the softmax transformation of a given vector. The calculations are run in log scale for greater numerical stability, i.e. to avoid any overflow.

```{r}
source('../common/R/math.R')
```

### Plot functions

As plots are extensively used, we arrange the code in an auxiliary file.

```{r}
source('../common/R/plots.R')
```

### Simulation functions

We arrange the code to generate simulated data, as explained below, in an auxiliary file.

```{r}
source('../iohmm-reg/R/iohmm-sim.R')
```

## Generative model

### Data simulation

We first write an R function for our generative model. The arguments are the sequence length $T$, the number of discrete hidden states $K$, the input matrix $\mat{u}$, the initial state distribution vector $\mat{\pi}_1$, a matrix with the parameters of the multinomial regression that rules the hidden states dynamics $\mat{w}$, the name of a function drawing samples from the observation distribution and its arguments.

```{r}
iohmm_sim <- function(T, K, u, w, p.init, obs.model, obs.pars) {
  m <- ncol(u)

  if (dim(u)[1] != T)
    stop("The input matrix must have T rows.")

  if (any(dim(w) != c(K, m)))
    stop("The transition weight matrix must be of size Kxm, where m is the size of the input vector.")

  if (length(p.init) != K)
    stop("The vector p.init must have length K.")

  p.mat <- matrix(0, nrow = T, ncol = K)
  p.mat[1, ] <- p.init

  z <- vector("numeric", T)
  z[1] <- sample(x = 1:K, size = 1, replace = FALSE, prob = p.init)
  for (t in 2:T) {
    p.mat[t, ] <- softmax(sapply(1:K, function(j) {u[t, ] %*% w[j, ]}))
    z[t] <- sample(x = 1:K, size = 1, replace = FALSE, prob = p.mat[t, ])
  }

  x <- do.call(obs.model, c(list(u = u, z = z), obs.pars))

  list(
    u = u,
    z = z,
    x = x,
    p.mat = p.mat
  )
}
```

The initial hidden state is drawn from a multinomial distribution with one trial and event probabilities given by the initial state probability vector $\mat{\pi}_1$. The transition probabilities for each of the following steps $t \in \{2, \dots, T\}$ are generated from a multinomial regression with vector parameters $\mat{w}_k$, one set per possible hidden state $k \in \{1, \dots, K\}$, and covariates $\mat{u}_t$. The hidden states are subsequently sampled based on these transition probabilities.

The observation at each step may generate from a linear regressions with parameters $\mat{b}_k$ and $\mat{\Sigma}_k$, one set per possible hidden state.

```{r}
obsmodel_reg <- function(...) {
  args <- list(...)
  u <- args$u
  z <- args$z
  b <- args$b
  s <- args$s

  K <- length(unique(z))
  m <- ncol(u)

  if (any(dim(b) != c(K, m)))
    stop("The regressors matrix must be of size Kxm, where m is the size of the input vector.")

  T.length <- nrow(u)

  x <- vector("numeric", T.length)
  for (t in 1:T.length) {
    x[t] <- rnorm(1, u[t, ] %*% b[z[t], ], s[z[t]])
  }

  return(x)
}
```

Alternatively, the observation could originate in a Gaussian mixture density where $\lambda_{kl}$ is the component weight for the $l$-th component within the $k$-th state, $0 \le \lambda_{kl} \le 1 \ \forall \ l \in \{1, \dots, L\}, k \in \{1, \dots, K\}$ and $\sum_{l=1}^{L}{\lambda_{kl}} = 1 \ \forall \ k$.

```{r}
obsmodel_mix <- function(...) {
  args <- list(...)
  z <- args$z
  lambda <- args$lambda
  mu <- args$mu
  s <- args$s

  if (!all.equal(length(unique(z)), length(lambda), length(mu), length(s)))
    stop("The size of the vector parameters lambda, mu and s must equal to the
         number of different states.")

  T.length <- length(z)
  L <- ncol(lambda)

  x <- vector("numeric", T.length)
  for (t in 1:T.length) {
    l <- sample(1:L, 1, FALSE, prob = lambda[z[t], ])
    x[t] <- rnorm(1, mu[z[t], l], s[z[t], l])
  }

  return(x)
}
```

We set up our simulated experiments for a regression observation model with arbitrary values for all the involved parameters. Additionally, we define the settings for the Markov Chain Monte Carlo sampler.

```{r}
# Data
T.length = 300
K = 3
M = 4
R = 1
u.intercept = FALSE
w = matrix(
  c(1.2, 0.5, 0.3, 0.1, 0.5, 1.2, 0.3, 0.1, 0.5, 0.1, 1.2, 0.1),
  nrow = K, ncol = M, byrow = TRUE)
b = matrix(
  c(5.0, 6.0, 7.0, 0.5, 1.0, 5.0, 0.1, -0.5, 0.1, -1.0, -5.0, 0.2),
  nrow = K, ncol = M, byrow = TRUE)
s = c(0.2, 1.0, 2.5)
p1 = c(0.4, 0.2, 0.4)

# Markov Chain Monte Carlo
n.iter = 400
n.warmup = 200
n.chains = 1
n.cores = 1
n.thin = 1
n.seed = 9000
```

We anticipate identification issues in our sampler given the nature of the model. As an arguable simplification with obvious drawbacks, we decide to rely only on one chain to avoid between-chain label switching of regression parameters. We refer the reader to @betancourt2017identifying for an in-depth treatment of the diagnostics, causes and possible solutions for label switching in Bayesian Mixture Models.

We draw random inputs from a standard Gaussian distribution and generate the dataset accordingly.

```{r}
set.seed(n.seed)
u <- matrix(rnorm(T.length*M, 0, 1), nrow = T.length, ncol = M)
if (u.intercept)
  u[, 1] = 1

dataset <- iohmm_sim(T.length, K, u, w, p1, obsmodel_reg, list(b = b, s = s))
```

```{r, fig.height = 5, out.width="\\textwidth"}
plot_inputoutput(x = dataset$x, u = dataset$u, z = dataset$z)
```

We observe how the chosen values for the parameters affect the generated data. For example, the relationship between the third input $\mat{u}_3$ and the output $\mat{x}$ is positive, indifferent and negative for the hidden states 1 through 3, the true slopes being `r b[1, 3]`, `r b[2, 3]` and `r b[3, 3]` respectively.

```{r, fig.height = 6, out.width="\\textwidth"}
plot_inputprob(u = dataset$u, p.mat = dataset$p.mat, z = dataset$z)
```

We then analyse the relationship between the input and the state probabilities, which are usually hidden in applications with real data. The pairs $\{ \mat{u}_1, p(z_t = 1) \}$, $\{ \mat{u}_2, p(z_t = 2) \}$ and $\{ \mat{u}_3, p(z_t = 3) \}$ are most related due to the choice of values for the true regression parameters: those inputs take the largest weight in each state, namely $w_{11} = `r w[1, 1]`$, $w_{22} = `r w[2, 2]`$ and $w_{33} = `r w[3, 3]`$.

### Estimation

Adopting a fully Bayesian approach, we run our software to draw samples from the posterior density of model parameters and other hidden quantities.

```{r}
library(rstan)
library(shinystan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

stan.model = '../iohmm-reg/stan/iohmm-reg.stan'
stan.data = list(
  T = T.length,
  K = K,
  M = M,
  u_tm = as.array(u),
  x_t = dataset$x
)

stan.fit <- stan(file = stan.model,
                 data = stan.data, verbose = T,
                 iter = n.iter, warmup = n.warmup,
                 thin = n.thin, chains = n.chains,
                 cores = n.cores, seed = n.seed)

n.samples = (n.iter - n.warmup) * n.chains
```

### Convergence

We rely on several diagnostic statistics and plots provided by rstan [@sdt2017rstan] and shinystan [@sdt2017shinystan] to assess mixing, convergence and the inexistence of divergences. We then extract the samples for some quantities of interest, namely the filtered probabilities vector $\mat{\alpha}_t$, the smoothed probability vector $\mat{\gamma}_t$, the most probable hidden path $\mat{z}^*$ and the fitted output $\hat{x}$.

```{r, message = FALSE, results = 'hide'}
# MCMC Diagnostics --------------------------------------------------------
summary(stan.fit,
        pars = c('p_1k', 'w_km', 'b_km', 's_k'),
        probs = c(0.50))$summary
# launch_shinystan(stan.fit)

# Extraction --------------------------------------------------------------
alpha_tk <- extract(stan.fit, pars = 'alpha_tk')[[1]]
gamma_tk <- extract(stan.fit, pars = 'gamma_tk')[[1]]
zstar_t <- extract(stan.fit, pars = 'zstar_t')[[1]]
hatx_t  <- extract(stan.fit, pars = 'hatx_t')[[1]]
```

```{r, echo = FALSE}
knitr::kable(summary(stan.fit,
        pars = c('p_1k', 'w_km', 'b_km', 's_k'),
        probs = c(0.50))$summary,
      col.names = c("Mean", "MCSE", "SE", "Med", "ESS", "$\\hat{R}$"),
      digits = 2, align = "r", caption = "Estimated parameters and hidden quantities. *MCSE = Monte Carlo Standard Error, SE = Standard Error, Med = Median, ESS = Effective Sample Size*.")
```

While mixing and convergence is extremely efficient, as expected when dealing with generated data, we note that the regression parameters for the latent states are the worst performers. The small effective size translates into a high Monte Carlo standard error and broader credibility intervals. One possible reason is that softmax is invariant to change in location, thus the parameters of a multinomial regression do not have a natural center and become harder to estimate.

### State probability

We assess that our software recover the hidden states correctly. Due to label switching, the states generated under the labels 1 through 3 were recovered in inverse order. In consequence, we decide to relabel the observations based on the best fit. This would not prove to be a problem with real data considering that the hidden states are never observed.

```{r}
# Relabelling (ugly hack edition) -----------------------------------------
dataset$zrelab <- rep(0, T)

hard <- sapply(1:T.length, function(t, med) {
  which.max(med[t, ])
}, med = apply(alpha_tk, c(2, 3),
                    function(x) {
                      quantile(x, c(0.50)) }))

tab <- table(hard = hard, original = dataset$z)

for (k in 1:K) {
  dataset$zrelab[which(dataset$z == k)] <- which.max(tab[, k])
}

print("Label re-imputation (relabeling due to switching labels)")
table(new = dataset$zrelab, original = dataset$z)
```

Point estimates and credibility intervals are provided by rstan's `{r}summary` function.

```{r, eval = FALSE}
print("Estimated initial state probabilities")
summary(stan.fit,
        pars = c('p_1k'),
        probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)]

print("Estimated probabilities in the transition matrix")
head(summary(stan.fit,
        pars = c('A_ij'),
        probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)])

print("Estimated regressors of hidden states")
summary(stan.fit,
        pars = c('w_km'),
        probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)]

print("Estimated regressors and standard deviation of observations in each state")
summary(stan.fit,
        pars = c('b_km', 's_k'),
        probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)]
```

```{r, echo = FALSE}
knitr::kable(summary(stan.fit,
        pars = c('p_1k'),
        probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)],
      col.names = c("Mean", "SE", "10%", "Med", "90%"),
      digits = 2, align = "r", caption = "Estimated initial state probabilities.")

knitr::kable(head(summary(stan.fit,
        pars = c('A_ij'),
        probs = c(0.10, 0.50, 0.90))$summary)[, c(1, 3, 4, 5, 6)],
      col.names = c("Mean", "SE", "10%", "Med", "90%"),
      digits = 2, align = "r", caption = "Estimated probabilities in the transition matrix.")

knitr::kable(summary(stan.fit,
        pars = c('w_km'),
        probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)],
      col.names = c("Mean", "SE", "10%", "Med", "90%"),
      digits = 2, align = "r", caption = "Estimated regressors of hidden states.")

knitr::kable(summary(stan.fit,
        pars = c('b_km', 's_k'),
        probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)],
      col.names = c("Mean", "SE", "10%", "Med", "90%"),
      digits = 2, align = "r", caption = "Estimated regression parameters and standard deviation of observations in each state.")
```

The samples drawn from the posterior density are reasonably close to the true values. We additionally check the estimated hidden quantities.

```{r, fig.height = 6, out.width="\\textwidth"}
plot_stateprobability(alpha_tk, gamma_tk, 0.8, dataset$zrelab)
```

```{r, eval = FALSE}
print("Estimated hidden states (hard naive classification using filtered probabilities)")
print(table(
  estimated = apply(round(apply(alpha_tk, c(2, 3),
                                function(x) {
                                  quantile(x, c(0.50)) })), 1, which.max),
  real = dataset$zrelab))
```

```{r, echo = FALSE}
knitr::kable(table(
        estimated = apply(round(apply(alpha_tk, c(2, 3),
                                      function(x) {
                                        quantile(x, c(0.50)) })), 1, which.max),
        real = dataset$zrelab),
      col.names = c("Real 1", "Real 2", "Real 3"),
      row.names = TRUE,
      digits = 2, align = "r", caption = "Hard classification.")
```

The filtered probability is overly accurate, as expected, making the additional information contained in the smoothed seem of little value. The confusion matrix makes evident that, under ideal conditions, the sampler works as intended. Similarly, the Viterbi algorithm recovers the expected most probably hidden state.

```{r, fig.height = 6, out.width="\\textwidth"}
plot_statepath(zstar_t, dataset$zrelab)
```

```{r, eval = FALSE}
print("Estimated hidden states for the jointly most likely path (Viterbi decoding)")
round(table(
  actual = rep(dataset$zrelab, each = n.samples),
  fit = zstar_t) / n.samples, 0)
```

```{r, echo = FALSE}
knitr::kable(table(
        estimated = apply(round(apply(alpha_tk, c(2, 3),
                                      function(x) {
                                        quantile(x, c(0.50)) })), 1, which.max),
        real = dataset$zrelab),
      digits = 2, align = "r", caption = "Viterbi decoding.")
```

### Fitted output

```{r, fig.height = 6, out.width="\\textwidth"}
plot_outputfit(dataset$x, hatx_t, z = dataset$zrelab, TRUE)
```

All in all, we are confident that the Stan code works as specified and we find model calibration satisfactory.

```{r, message = FALSE, echo = FALSE, results = 'hide'}
rm(list = ls())
gc()
```

---

# Stock Market Forecasting Using Hidden Markov Model

## Preamble
We first load the packages and source auxiliary functions.

```{r, message = FALSE, warning = FALSE}
library(quantmod)
library(rstan)
library(shinystan)
source('R/data.R')
source('R/forecast.R')
source('R/wf-forecast.R')
source('../common/R/math.R')
source('../common/R/plots.R')
source('../iohmm-mix/R/iohmm-mix-init.R')
```

## Model

We adhere to the methodology proposed in the original work as much as possible. We set up an IOHMM model with $K = 4$ hidden states to forecast stock prices. The hidden dynamics are driven by a multinomial (softmax) regression with $R = 4$ inputs, namely opening, closing, highest and lowest prices from the previous time step. The univariate output is the stock closing price for the current time step. The observation model is a Gaussian mixture with $L = 3$ components per state,

\[
p(x_t | z_{t} = j) = \sum_{l = 1}^{L}{\lambda_{jl} \ \NN(x_t | \mu_{jl}, \sigma_{jl})},
\]

where $\lambda_{jl}$ is the component weight for the $l$-th component within the $j$-th state, $0 \le \lambda_{jl} \le 1 \ \forall \ l, j$ and $\sum_{l=1}^{L}{\lambda_{jl}} = 1 \ \forall \ j$.

The vector parameter then becomes $\mat{\theta} = (\mat{\pi}_1, \mat{w}_k, \lambda_{kl}, \mu_{kl}, \sigma_{kl})$ for $k \in \{1, \dots, K\}$ and $l \in \{1, \dots, L\}$.

```{r}
K = 4
L = 3
```

## The sampler

We use Stan's Hamiltonian Monte Carlo to run a fully Bayesian estimation of the model parameters. Additionally, we compute many hidden quantities, including the forward probability, the future likelihood, the jointly most likely path and transition probabilities. We draw samples from the hidden state distribution as well. To this end, we write our own implementation of the forward filter, the forwards-backwards filter and the Viterbi decoding algorithms. We acknowledge the authors of the Stan Manual [@team2017stan] for the numerous examples, some of which served as a starting point for our code.

We define the following parameters for the MCMC sampler. The chosen hyperparameters correspond to diffuse priors.

```{r}
hyperparams <- c(0, 5, 5, 0, 3, 1, 1, 0, 5);

n.iter = 800
n.warmup = 400
n.chains = 1
n.cores = 1
n.thin = 1
n.seed = 9000
```

## Dataset & data transformation

We use daily OHLC prices for the stocks and time spans listed in the original paper. The data is downloaded from Yahoo Finance via Quantmod. We omit two delisted stocks as the series are not avaliable anymore in public sources such as Yahoo and Google Finance.

```{r}
syms <- data.frame(
  symbol    = c("LUV", "RYA.L"),
  name      = c("Southwest Airlines Co", "Ryanair Holdings Plc"),
  train.from = c("2002-12-18", "2003-05-06"),
  train.to   = c("2004-07-23", "2004-12-06"),
  test.from  = c("2004-07-24", "2004-12-07"),
  test.to    = c("2004-11-17", "2005-03-17"),
  src        = c("yahoo", "yahoo"),
  stringsAsFactors = FALSE)
```

```{r, echo = FALSE}
library(knitr)
knitr::kable(t(sapply(1:nrow(syms), function(i) { c(
        syms$symbol[i],
        syms$name[i],
        paste(syms$train.from[i], "to", syms$train.to[i]),
        paste(syms$test.from[i] , "to", syms$test.to[i])
      )})),
      col.names = c("Symbol", "Name", "Training set", "Test set"),
      align = "c", caption = "Details of the dataset.")
```

We found that feeding the raw dataset to our sampler led to convergence issues. We learnt by trial and error that centering and rescaling the variables did not not only improve convergence but also sped up the software by a factor of 5.

```{r}
make_dataset <- function(prices, scale = TRUE) {
  T.length = nrow(prices)

  x = as.vector(Cl(prices))[2:T.length]
  u = as.matrix(prices)[1:(T.length - 1), 1:4]

  dataset <- list(
    x = as.vector(x),
    u = as.matrix(u),
    x.unscaled = as.vector(x),
    u.unscaled = as.matrix(u),
    x.center = 0,
    x.scale  = 1,
    u.center = 0,
    u.scale  = 1
  )

  if (scale) {
    x <- scale(x, TRUE, TRUE)
    u <- scale(u, TRUE, TRUE)

    dataset$x <- as.vector(x)
    dataset$u <- as.matrix(u)
    dataset$x.center <- attr(x, "scaled:center")
    dataset$x.scale  <- attr(x, "scaled:scale")
    dataset$u.center <- attr(u, "scaled:center")
    dataset$u.scale  <- attr(u, "scaled:scale")
  }

  dataset
}
```

## Methodology

The goal of the experiment is to predict the closing price for a specific stock market share. According to the proposed methodology, we estimate the model parameters $\mat{\hat{\theta}}$ using all the information available up to a given step. We compute the likelihood of the last observation in the training set conditional on the trained parameters $p(\mat{x}_{t} | \mat{\hat{\theta}})$ and we locate the observation in the training set with the nearest likelihood,

\[
\mat{x}_{t^*} = \argmin_{s} \ \left| p(\mat{x}_{t} | \mat{\hat{\theta}}) - p(\mat{x}_{s} | \mat{\hat{\theta}}) \right|.
\]

We compute the change in the closing price from the located time step to the next one,

\[
\Delta_{t^*} = \mat{x}_{t^*+1} - \mat{x}_{t^*},
\]

and build our forecast assuming that a similar pattern will repeat

\[
\mat{\hat{x}}_{t+1} = \mat{x}_{t} + \Delta_{t^*}.
\]

```{r}
neighbouring_forecast <- function(x, oblik_t, h = 1, threshold = 0.05) {
  if (!is.vector(x) || length(x) != dim(oblik_t)[2])
    stop("The size of the observation vector and the width of the likelihood
         array must be equal.")

  n.samples <- dim(oblik_t)[1]
  T.length  <- dim(oblik_t)[2]

  find_closest <- function(target, candidates, threshold) {
    ind <- which(abs(target - candidates) < abs(target) * threshold)

    if (!length(ind))
      ind <- which(abs(target - candidates) == min(abs(target - candidates)))

    ind
  }

  forecast = vector("numeric", n.samples)
  for (n in 1:n.samples) {
    oblik_target <- oblik_t[n, T.length]
    oblik_cand   <- oblik_t[n, 1:(T.length - h)]

    closests <- find_closest(oblik_target, oblik_cand, threshold)
    d        <- abs(oblik_target - oblik_t[n, closests])
    w        <- exp(d)

    forecast[n] <- x[T.length] + sum((x[closests + h] - x[closests]) * w) / sum(w)
  }

  forecast
}
```

Although the authors state that one or more observations in the training dataset are selected, the paper provides no details about the determination of the number, the selection and the combination of the neighboring observations. In our implementation, we locate all the observations whose estimated likelihood is within the interval that covers $\pm 5º%$ of the target observation likelihood.

\[
\left\{ \mat{x}_{s} : \ \left| p(\mat{x}_{t} | \mat{\hat{\theta}}) - p(\mat{x}_{s} | \mat{\hat{\theta}}) \right| < 0.05 \times p(\mat{x}_{t} | \mat{\hat{\theta}}) \right\}.
\]

If the set is empty, we use the past observation with the most similar likelihood instead.

The procedure is run independently for each stock.

## Southwest Airlines (LUV)

We present an in-depth study of one stock, LUV, to assess the strengths and weaknesses of the model. We later present a brief analysis of RYA.L.

### Data exploration

```{r, message = FALSE, warning = FALSE, fig.height = 6}
symbol <- syms[1, ] # LUV
prices   <- getSymbols(symbol$symbol,
                       env  = NULL,
                       from = symbol$train.from,
                       to   = symbol$test.to,
                       src  = symbol$src)
T.length <- nrow(prices[paste(symbol$train.from, symbol$train.to, sep = "/")])
dataset <- make_dataset(prices[1:T.length, ], TRUE)

plot_inputoutput(x = dataset$x, u = dataset$u,
                 x.label = symbol$symbol,
                 u.label = c("Open", "High", "Low", "Close"))
```

Unsuprisingly, OHLC prices follow an almost identical trend and are highly correlated with the next day closing price. Nonetheless, we stress that the correlation of two non-stationary variables is an unreliable measure. Althought it is customary to take the first difference or apply other transformations to achieve non-stationarity, the price level is purposely retained in the input series.

### Estimation

We attempt to ease convergence by setting up approximated initial values for the component parameters. Although nested k-means is not truthful to the model we replicate, we still find this simplification worthwhile.

```{r, warning = FALSE}
rstan_options(auto_write = TRUE)

stan.model = '../iohmm-mix/stan/iohmm-hmix.stan'
stan.data = list(
  K = K,
  L = L,
  M = ncol(dataset$u),
  T = length(dataset$x),
  u_tm = as.array(dataset$u),
  x_t = as.vector(dataset$x),
  hyperparams = as.array(hyperparams)
)

stan.fit <- stan(file = stan.model,
                 data = stan.data, verbose = T,
                 iter = n.iter, warmup = n.warmup,
                 thin = n.thin, chains = n.chains,
                 cores = n.cores, seed = n.seed,
                 init = function() {init_fun(stan.data)})

n.samples = (n.iter - n.warmup) * n.chains
```

Long computational times are a forewarning of poor convergence. This may be specific to the characteristics of the selected price series; despite the similar length, RYA.L performs significantly faster as shown below. As suggested by the warning message and the extended guidelines in the Stan website, the numerical problem is negligible and we do not take any further action.

### Convergence
Graphical diagnoses (not shown) make evident that convergence to stationarity is at very least weak. Switching labels complicate the assessment since we cannot run and compare several chains as widely suggested in the standard literature. On the other hand, the number of iterations chosen seems to be enough to allow convergence of averages as assessed by the cumulated averages curve. Nonetheless, this do not compensate the fragile convergence to stationarity.

Relying on the shrink factor of @gelman1992inference, we identify that the components mean, standard deviation and weight are the hardest quantities to converge. More worryingly, the effective sample size of some mixture parameters lies below 10\% of the iteration number. Analogously, Monte Carlo standard error exceeds 10\% of the estimated standard error. Although thinning could reduce autocorrelation, it has the computational (but not statistical) purpose of storing and running computations on large sampling sequences [@gelman2011inference]. The mean hyperparameters sample efficiently, a clear indication that the addition of a hierarchy is a pertinent improvement.

```{r}
knitr::kable(summary(stan.fit,
        pars = c('p_1k', 'w_km', 'lambda_kl', 'hypermu_k', 'mu_kl', 's_kl'),
        probs = c(0.10, 0.50, 0.90))$summary,
        col.names = c("Mean", "MCSE", "SE", "10\\%", "Med", "90\\%", "ESS", "$\\hat{R}$"),
      digits = 2, caption = "Summary of the posterior draws from the distribution of the estimated parameters.")
```

In general terms, convergence is just about satisfactory. As shown in the following sections, the fitted values are roughly consistent with the observations thus hinting that the sampler adapts to the data even in the cases where diagnostics yield mixed results. Nonetheless, we interpret the results carefully and with skepticism. The fact that mixture parameters are the most ill-behaved leads us to the hypothesis that the observation model would improve significantly with better domain knowledge. Furthermore, we suspect that setting the same number of hidden states for all the price series analysed in the original paper may prove excessively arbitrary and hinder sampling performance.

We extract sampled hidden quantities for later use.

```{r}
alpha_tk <- extract(stan.fit, pars = 'alpha_tk')[[1]]
gamma_tk <- extract(stan.fit, pars = 'gamma_tk')[[1]]
zstar_t  <- extract(stan.fit, pars = 'zstar_t')[[1]]
hatx_t   <- extract(stan.fit, pars = 'hatx_t')[[1]]
oblik_t  <- extract(stan.fit, pars = 'oblik_t')[[1]]
logA_ij  <- extract(stan.fit, pars = 'logA_ij')[[1]]
```

### State probability
The following plot shows the probability, along with its 80\% credible interval, that an observation belongs to one of the possible hidden states.

```{r, warning = FALSE, fig.height = 7}
plot_stateprobability(alpha_tk, gamma_tk, 0.8)
```

Since states are determined by the previous step OHLC prices, filtered and smoothed probabilities are equivalent. We note that state membership naturally became persistent. Given the choice of input variables, state membership depends non-linearly on price levels and will remain stable so long as price changes are stationary. In some contexts, persistence may be regarded as a very valuable feature because latent states are expected to change disruptively yet sporadically. Conversely, Markov Switching dynamics attempt to accommodate latent states in a way that they may shift more often than domain knowledge suggests, which in turns may lead to overfitting. The jointly most probably hidden path, as decoded by the Viterbi algorithm, reflects the continuity of states across time.

```{r, fig.height = 3.5}
plot_statepath(zstar_t)
```

It is interesting to study in detail in what way the price level at a given day informs the closing price for the following time step. As expected, the softmax transformation establishes a non-linear relationship between OHLC prices and the membership probability. The latent variables have a natural and intuitive interpretation. States 2 and 3 represent the upper price levels and contain the observations that are considerably and slightly above the mean. The analogous is also true for the states 3 and 4. Following this intuition, the number of possible states $K$ should be chosen according to the observed number of levels in the input series. We notice that the different inputs have a similar impact on the state probability due to their high cross-correlatation. Even though the thresholds are estimated by maximizing the log-posterior density, out of sample error measures would be used if forecasting were the main goal. We expand on this in the [discussion](#further-research).

```{r, warning = FALSE, fig.height = 7.5}
plot_inputprob(dataset$u.unscaled,
               apply(logA_ij, c(2, 3), function(x) { median(exp(x)) }),
               u.label = c("Open", "High", "Low", "Close"))
```

### Fitted output
The following plot shows the fitted output. We first draw the hidden state according to the probabilities estimated by the softmax regression applied on the previous day OHLC vector. We draw one component from such state, then we draw an observation from a Gaussian distribution with the corresponding mean and standard deviation.

```{r, fig.height = 4}
plot_outputfit(dataset$x, hatx_t, z = apply(zstar_t, 2, median), K = K)
```

It is important to note that fitted observation is not the forecast itself. The methodology proposed by the original authors rely on the likelihood of the observations instead.

### In-sample summary

All in all, the in-sample fit can be summarized in the following plot.

```{r, fig.height = 7}
plot_inputoutputprob(x = dataset$x, u = dataset$u,
                     stateprob = alpha_tk, zstar = zstar_t,
                     x.label = symbol$symbol,
                     u.label = c("Open", "High", "Low", "Close"),
                     stateprob.label = bquote("Filtered prob" ~ p(z[t] ~ "|" ~ x[" " ~ 1:t])))
```

### Forecast

The methodology produces a forecast based on the estimated likelihood of the last observation. We write a walk forward procedure where the model is refitted each time a new observation is received. Since Stan does not have a natural way to update the log-density from a previous run, we implement a reduced version of our software that includes only the minimum computations needed for forecasting.

The following snippet runs the walk forward procedure in parallel and returns a list of objects.

```{r, message = FALSE, warning = FALSE}
luv.wf <- wf_forecast(syms = symbol, K, L, hyperparams,
                        n.iter, n.warmup, n.chains, n.cores, n.thin, n.seed,
                        cache.dir = "fore_cache/")[[1]]

luv.300904 <- luv.wf[[49]]
```

#### Forecast for 30-sep-2004

We first recreate the forecast for 30-sep-2004. Our OHLC values differ from those shown in the original work, possibly due to adjustments. We regret that the authors do not provide enough details about data sources and pre-processing.

```{r, echo = FALSE}
luv.300904.n <- nrow(luv.300904$dataset$u)
knitr::kable(luv.300904$dataset$u.unscaled[(luv.300904.n - 1):luv.300904.n, ],
      digits = 2, col.names = c("Open", "High", "Low", "Close"),
      caption = "OHLC values for the selected dates.")
```

We arbitrarily choose one of the samples generated by MCMC. We identify similar past observations and produce the point forecast accordingly.

```{r}
oblik_t   <- luv.300904$oblik_t
n.samples <- dim(oblik_t)[1]
T.length  <- dim(oblik_t)[2]
h         <- 1
n         <- 1 # Sample

find_closest <- function(target, candidates, threshold) {
  ind <- which(abs(target - candidates) < abs(target) * threshold)

  if (!length(ind))
    ind <- which(abs(target - candidates) == min(abs(target - candidates)))

  ind
}

oblik_target <- oblik_t[n, T.length]
oblik_cand   <- oblik_t[n, 1:(T.length - h)]
threshold    <- 0.05

closests <- find_closest(oblik_target, oblik_cand, threshold)
d        <- abs(oblik_target - oblik_t[n, closests])
w        <- exp(d)
```

```{r, echo = FALSE}
knitr::kable(
  cbind(
    luv.300904$dataset$u.unscaled[closests, ],
    Cl(luv.300904$dataset$u.unscaled[closests + h, ]) - Cl(luv.300904$dataset$u.unscaled[closests, ]),
    w / sum(w)
  ),
  digits = 2, col.names = c("Open", "High", "Low", "Close", "Delta", "Weight"),
  caption = "OHLC values for the selected dates.")
```

```{r}
luv.300904.y     <- luv.300904$dataset$u.unscaled[, 4]
luv.300904.delta <- luv.300904$dataset$u.unscaled[closests + h, 4] - luv.300904$dataset$u.unscaled[closests, 4]
luv.300904.yhat  <- luv.300904$dataset$u.unscaled[T.length, 4] + sum(luv.300904.delta * w) / sum(w)
```

The difference between the observed closing price $x_{t+1} = `r luv.300904.n<-nrow(luv.300904$dataset$u); round(luv.300904$dataset$u.unscaled[luv.300904.n, 4], 2)`$ and the forecast $\hat{x}_{t+1} = `r round(luv.300904.yhat, 2)`$ equal `r round(luv.300904$dataset$u.unscaled[luv.300904.n, 4] - luv.300904.yhat, 2)`.

```{r, echo = FALSE, fig.height = 5}
T.length <- length(luv.300904.y)
T.ooslength <- length(luv.300904.yhat)
luv.300904.cl <- rep(NA, T.length)
luv.300904.cl[closests] <- luv.300904$dataset$u.unscaled[closests, 4]
luv.300904.alpha_tk <- extract(luv.300904$stan.fit, pars = 'unalpha_tk')[[1]]
luv.300904.hard <- sapply(1:T.length, function(t, med) {
    which.max(med[t, ])
  }, med = apply(luv.300904.alpha_tk, c(2, 3),
                      function(x) {
                        quantile(x, c(0.50)) }))

opar <- par(no.readonly = TRUE)

layout(matrix(c(1, 2), nrow = 2, ncol = 1), heights = c(0.95, 0.05))
plot(x = 1:T.length, y = luv.300904.y,
     ylab = bquote("Output" ~ x), xlab = bquote("Time" ~ t),
     main = "Point forecast for 2004-09-30 (LUV)",
     type = 'l', col = 'lightgray')

points(x = 1:T.length, y = luv.300904.cl,
       pch = 21, cex = 0.5,
       bg = 'darkgray', col = 'darkgray')

points(x = (T.length - T.ooslength + 1):T.length, y = luv.300904.yhat,
       pch = 21, cex = 0.5,
       bg = 1, col = 1)

par(mai = c(0, 0, 0, 0))
plot.new()
legend(x = "center",
       legend = c("Observed", "Similar", "Forecast"),
       lwd = 3, col = c('lightgray', 'darkgray', 1), horiz = TRUE, bty = 'n')

par(opar)
```

We can account for parameter uncertainty by repeating these computations for all the samples corresponding to a given time step. We stress that such a measure would not conform a true interval estimate as the methodology provides only with a point forecast.

#### Walking forward forecasts

We now show the walking forward forecasts for the selected stock.

```{r}
T.ooslength  <- length(luv.wf)
luv.dataset  <- luv.wf[[T.ooslength]]$dataset
luv.wfy      <- tail(luv.dataset$x.unscaled, T.ooslength - 1)
luv.wfyhat   <- head(sapply(luv.wf, function(wf) { median(wf$forecast) }), -1)
```

```{r, fig.height = 5}
plot_seqforecast(y = luv.wfy, yhat = luv.wfyhat,
                 main.label = symbol$symbol)

```

We report the out of sample forecast error measures in the table below. Both the coefficient of determination $R^2$ and the Mean Absolute Percentage Error (MAPE) are consistent with the findings in the original paper.

```{r, echo = FALSE}
luv.err <- luv.wfy - luv.wfyhat
knitr::kable(data.frame(
  mean(luv.err^2),
  paste0(round(100 * mean(abs(luv.err / luv.wfy)), 2), "%"),
  as.numeric(summary(lm(luv.wfy ~ luv.wfyhat))$r.squared)
),
      align = "c", digits = 4,
      col.names = c("MSE", "MAPE", "$R^2$"),
      caption = "Out of sample error measures for LUV.")
```

```{r, message = FALSE, echo = FALSE, results = 'hide'}
rm(stan.fit, stan.data, alpha_tk, gamma_tk, zstar_t, hatx_t, oblik_t, logA_ij)
rm(list = ls()[grep("luv", ls())])
gc()
```

## Ryanair Holdings Plc (RYA.L)

### In-sample analysis

We estimate the model for RYA.L and we then present the in-sample summary. 

```{r, message = FALSE, warning = FALSE}
symbol <- syms[2, ] # RYA
prices   <- getSymbols(symbol$symbol,
                       env  = NULL,
                       from = symbol$train.from,
                       to   = symbol$test.to,
                       src  = symbol$src)
T.length <- nrow(prices[paste(symbol$train.from, symbol$train.to, sep = "/")])
dataset <- make_dataset(prices[1:T.length, ], TRUE)
```

```{r}
rstan_options(auto_write = TRUE)

stan.model = '../iohmm-mix/stan/iohmm-hmix.stan'
stan.data = list(
  K = K,
  L = L,
  M = ncol(dataset$u),
  T = length(dataset$x),
  u_tm = as.array(dataset$u),
  x_t = as.vector(dataset$x),
  hyperparams = as.array(hyperparams)
)

stan.fit <- stan(file = stan.model,
                 data = stan.data, verbose = T,
                 iter = n.iter, warmup = n.warmup,
                 thin = n.thin, chains = n.chains,
                 cores = n.cores, seed = n.seed,
                 init = function() {init_fun(stan.data)})

n.samples = (n.iter - n.warmup) * n.chains

alpha_tk <- extract(stan.fit, pars = 'alpha_tk')[[1]]
gamma_tk <- extract(stan.fit, pars = 'gamma_tk')[[1]]
zstar_t  <- extract(stan.fit, pars = 'zstar_t')[[1]]
hatx_t   <- extract(stan.fit, pars = 'hatx_t')[[1]]
oblik_t  <- extract(stan.fit, pars = 'oblik_t')[[1]]
logA_ij  <- extract(stan.fit, pars = 'logA_ij')[[1]]
```

We identify two extreme observations but we decide not to take any further action. State 2 represents a period of time in which the stock suddenly reached the maximum value in its history. As observed before, the hidden states are persistent.

```{r}
knitr::kable(summary(stan.fit,
        pars = c('p_1k', 'w_km', 'lambda_kl', 'hypermu_k', 'mu_kl', 's_kl'),
        probs = c(0.10, 0.50, 0.90))$summary,
        col.names = c("Mean", "MCSE", "SE", "10\\%", "Med", "90\\%", "ESS", "$\\hat{R}$"),
      digits = 2, caption = "Summary of the posterior draws from the distribution of the estimated parameters.")
```

```{r, fig.height = 7}
plot_inputoutputprob(x = dataset$x, u = dataset$u,
                     stateprob = alpha_tk, zstar = zstar_t,
                     x.label = symbol$symbol,
                     u.label = c("Open", "High", "Low", "Close"),
                     stateprob.label = bquote("Filtered prob" ~ p(z[t] ~ "|" ~ x[" " ~ 1:t])))
```

### Out-of-sample analysis

We produce the forecast using the same methodology and arrive at a similar accuracy level.

```{r}
rya.wf <- wf_forecast(syms = symbol, K, L, hyperparams,
                        n.iter, n.warmup, n.chains, n.cores, n.thin, n.seed,
                        cache.dir = "fore_cache/")[[1]]

T.ooslength  <- length(rya.wf)
rya.dataset  <- rya.wf[[T.ooslength]]$dataset
rya.wfy      <- tail(rya.dataset$x.unscaled, T.ooslength - 1)
rya.wfyhat   <- head(sapply(rya.wf, function(wf) { median(wf$forecast) }), -1)
```

```{r, fig.height = 5}
plot_seqforecast(y = rya.wfy, yhat = rya.wfyhat,
                 main.label = symbol$symbol)
```

```{r, echo = FALSE}
rya.err <- rya.wfy - rya.wfyhat
knitr::kable(data.frame(
  mean(rya.err^2),
  paste0(round(100 * mean(abs(rya.err / rya.wfy)), 2), "%"),
  as.numeric(summary(lm(rya.wfy ~ rya.wfyhat))$r.squared)
),
      align = "c", digits = 4,
      col.names = c("MSE", "MAPE", "$R^2$"),
      caption = "Out of sample error measures for rya.")
```

## Discussion {#further-research}

### The statistical model

In its simplest form, this model could be interpreted as $K \times L$ means that switch according to previous price levels. We thus anticipated a highly multinomial, ill-behaved log-posterior that aggravates the frequent combinatorial non-identifiabilities inherent to mixture models. As our first naive and mostly unrestricted implementation could not sample efficiently from the log-posterior, we decided to add more information in the form of prior belief and additional model structure (within-state hierarchical mixture).

Even after all the improvements, we only arrive at a just about satisfactory yet imperfect convergence. We are confident that our implementation works as intended since we tested the model calibration by simulation, however, some computational issues arise when the sampler is fed with real data. More specifically, we find that sampling efficiency becomes extremely low for some of the parameters of the observation model. 

In resemblance to the folk theorem^[See [The folk theorem of statistical computing
](http://andrewgelman.com/2008/05/13/the_folk_theore/)], which informally states that computational problems may be originated in model misspecification, we find that the output model proposed by the authors is unrepresentative of the features observed in the data.

### The financial application

The patterns captured by the hidden states are consitent with the data under analysis. We believe a better choice of the number of states could improve the fit significantly. We find especially appealing that states are naturally persistent. On the other hand, the observation model seems simplistic to the extent that convergence is hindered. Forecasting methodology should be inspected during a larger time span before reaching any conclusion.

Letting inputs drive the hidden dynamics has a great potential for financial application. Although the authors proposed that price levels should drive the latent variables, the transition model is flexible enough to adapt to many other features a financial analyst may want to recreate. Possible candidate inputs for persistent processes are volume, volatility and trends. If persistence is undesired, trends should be removed from the input. For example, hidden dynamics may be driven by the residuals from a given model. 

---

# Original Computing Environment

```{r}
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
devtools::session_info("rstan")
```

---

# References