-
Notifications
You must be signed in to change notification settings - Fork 43
/
Copy pathmain.Rmd
1066 lines (801 loc) · 47.3 KB
/
main.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Input-Output Hidden Markov Model applied to financial time series"
author: "Luis Damiano, Brian Peterson, Michael Weylandt"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
number_sections: true
self_contained: true
includes:
in_header: ../common/Rmd/preamble-html.Rmd
vignette: >
%\VignetteIndexEntry{Input-Output Hidden Markov Model}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: ../common/references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, fig.width = 7.2)
```
```{r echo = FALSE, results="asis"}
cat("<style>img{border: 0px !important;}</style>")
```
This work aims at replicating the Input-Output Hidden Markov Model (IOHMM) originally proposed by @hassan2005stock to forecast stock prices. The main goal is to produce public programming code in [Stan](http://mc-stan.org/) [@carpenter2016stan] for a fully Bayesian estimation of the model parameters and inference on hidden quantities, namely filtered state belief, smoothed state belief, jointly most probable state path and fitted output. The model is introduced only briefly, a more detailed mathematical treatment can be found in our [literature review](https://github.com/luisdamiano/gsoc17-hhmm/blob/master/litreview/main.pdf).
The authors acknowledge Google for financial support via the Google Summer of Code 2017 program.
---
# The Input-Output Hidden Markov Model
The IOHMM is an architecture proposed by @bengio1995input to map input sequences, sometimes called the control signal, to output sequences. It is a probabilistic framework that can deal with general sequence processing tasks such as production, classification, or prediction. The main difference with Hidden Markov Models (HMM), which are part of the unsupervised learning paradigm, is the capability to learn the output sequence itself instead of the distribution of the output sequence.
## Definitions
As with HMM, IOHMM involves two interconnected models,
\begin{align*}
z_{t} &= f(z_{t-1}, \mat{u}_{t}) \\
\mat{x}_{t} &= g(z_{t }, \mat{u}_{t}).
\end{align*}
The first line corresponds to the state model, which consists of discrete-time, discrete-state hidden states $z_t \in \{1, \dots, K\}$ whose transition depends on the previous hidden state $z_{t-1}$ and the input vector $\mat{u}_{t} \in \RR^M$. Additionally, the observation model is governed by $g(z_{t}, \mat{u}_{t})$, where $\mat{x}_t \in \RR^R$ is the vector of observations, emissions or output. The corresponding joint distribution is
\[
p(\mat{z}_{1:T}, \mat{x}_{1:T} | \mat{u}_{t}).
\]
In the proposed parametrization with continuous inputs and outputs, the state model involves a multinomial regression whose parameters depend on the previous state taking the value $i$,
\[
p(z_t | \mat{x}_{t}, \mat{u}_{t}, z_{t-1} = i) = \text{softmax}^{-1}(\mat{w}_i \mat{u}_{t}),
\]
and the observation model is built upon the Gaussian density with parameters depending on the current state taking the value $j$,
\[
p(\mat{x}_t | \mat{u}_{t}, z_{t} = j) = \mathcal{N}(\mat{x}_t | \mat{\mu}_j, \mat{\Sigma}_j).
\]
IOHMM adapts the logic of HMM to allow for input and output vectors, retaining its fully probabilistic quality. Hidden states are assumed to follow a multinomial distribution that depends on the input sequence. The transition probabilities $\Psi_t(i, j) = p(z_t = j | z_{t-1} = i, \mat{u}_{t})$, which govern the state dynamics, are driven by the control signal as well.
As for the output sequence, the local evidence at time $t$ now becomes $\psi_t(j) = p(\mat{x}_t | z_t = j, \mat{\eta}_t)$, where $\mat{\eta}_t = \ev{\mat{x}_t | z_t, \mat{u}_t}$ can be interpreted as the expected location parameter for the probability distribution of the emission $\mat{x}_{t}$ conditional on the input vector $\mat{u}_t$ and the hidden state $z_t$.
## Inference
There are several quantities of interest to be estimated via different algorithms. In this section, the discussion assumes that model parameters are known.
### Filtering
A filter infers the belief state at a given step based on all the information available up to that point,
\begin{align*}
\alpha_t(j)
& \triangleq p(z_t = j | \mat{x}_{1:t}, \mat{u}_{1:t}).
\end{align*}
It achieves better noise reduction than simply estimating the hidden state based on the current estimate $p(z_t | \mat{x}_{t})$. The filtering process can be run online, or recursively, as new data streams in. Filtered marginals can be computed recursively by means of the forward algorithm [@baum1967inequality].
### Smoothing
A smoother infers the belief state at a given step based on all the observations or evidence,
\[
\begin{align*}
\gamma_t(j)
& \triangleq p(z_t = j | \mat{x}_{1:T}, \mat{u}_{1:T}) \\
& \propto \alpha_t(j) \beta_t(j),
\end{align*}
\]
where
\begin{align*}
\beta_{t-1}(i)
& \triangleq p(\mat{x}_{t:T} | z_{t-1} = i, \mat{u}_{t:T}).
\end{align*}
Although noise and uncertainty are reduced as a result of conditioning on past and future data, the smoothing process can only be run offline. Inference can be run by means of the forwards-backwards algorithm, also know as the Baum-Welch algorithm [@baum1967inequality, @baum1970maximization].
### Most likely hidden path
It is also of interest to compute the most probable state sequence or path,
\[
\mat{z}^* = \argmax_{\mat{z}_{1:T}} p(\mat{z}_{1:T} | \mat{x}_{1:T}).
\]
The jointly most probable sequence of states can be inferred by means of maximum a posterior (MAP) estimation or Viterbi decoding.
## Parameter estimation
The parameters of the models are $\mat{\theta} = (\mat{\pi}_1, \mat{\theta}_h, \mat{\theta}_o)$, where $\mat{\pi}_1$ is the initial state distribution, $\mat{\theta}_h$ are the parameters of the hidden model and $\mat{\theta}_o$ are the parameters of the state-conditional density function $p(\mat{x}_t | z_t = j, \mat{u}_t)$. The form of $\mat{\theta}_h$ and $\mat{\theta}_o$ depends on the specification of each model. For example, state transition may be characterized by a logistic or multinomial regression with parameters $\mat{w}_k$ for $k \in \{1, \dots, K\}$, while emissions may be modeled by a linear regression with Gaussian error with parameters $\mat{b}_k$ and $\mat{\Sigma}_k$ for $k \in \{1, \dots, K\}$.
---
# Learning by simulation
Model complexity and limited software availability requires that we implement our sampler from scratch. Following a very simplified version of the methodology proposed by @cook2006validation, we first create a simulation routine that generates data complying with the model assumptions and we then write a MCMC sampler in Stan to recover the parameters and other hidden quantities. Only after the correctness of our software is tested, we feed our model with real data and analyse the results.
We believe that learning by simulation has many benefits, including:
* Confirmation whether the software developed to implement the algorithms works properly, i.e. it retrieves the parameters used to generate the data.
* Enhanced interpretation of the results and deeper understanding of the model behaviour, especially because the generated data meets all the underlying assumptions and the estimates are free of the effects of data contamination originated by unmodeled phenomena.
* The possibility of calibrating different combinations of values for the parameters, the inputs and the outputs help the development of intuition and insight. This may prove valuable for the data analysis stage.
## Auxiliary files
### Math functions
We write an auxiliary R function to compute the softmax transformation of a given vector. The calculations are run in log scale for greater numerical stability, i.e. to avoid any overflow.
```{r}
source('../common/R/math.R')
```
### Plot functions
As plots are extensively used, we arrange the code in an auxiliary file.
```{r}
source('../common/R/plots.R')
```
### Simulation functions
We arrange the code to generate simulated data, as explained below, in an auxiliary file.
```{r}
source('../iohmm-reg/R/iohmm-sim.R')
```
## Generative model
### Data simulation
We first write an R function for our generative model. The arguments are the sequence length $T$, the number of discrete hidden states $K$, the input matrix $\mat{u}$, the initial state distribution vector $\mat{\pi}_1$, a matrix with the parameters of the multinomial regression that rules the hidden states dynamics $\mat{w}$, the name of a function drawing samples from the observation distribution and its arguments.
```{r}
iohmm_sim <- function(T, K, u, w, p.init, obs.model, obs.pars) {
m <- ncol(u)
if (dim(u)[1] != T)
stop("The input matrix must have T rows.")
if (any(dim(w) != c(K, m)))
stop("The transition weight matrix must be of size Kxm, where m is the size of the input vector.")
if (length(p.init) != K)
stop("The vector p.init must have length K.")
p.mat <- matrix(0, nrow = T, ncol = K)
p.mat[1, ] <- p.init
z <- vector("numeric", T)
z[1] <- sample(x = 1:K, size = 1, replace = FALSE, prob = p.init)
for (t in 2:T) {
p.mat[t, ] <- softmax(sapply(1:K, function(j) {u[t, ] %*% w[j, ]}))
z[t] <- sample(x = 1:K, size = 1, replace = FALSE, prob = p.mat[t, ])
}
x <- do.call(obs.model, c(list(u = u, z = z), obs.pars))
list(
u = u,
z = z,
x = x,
p.mat = p.mat
)
}
```
The initial hidden state is drawn from a multinomial distribution with one trial and event probabilities given by the initial state probability vector $\mat{\pi}_1$. The transition probabilities for each of the following steps $t \in \{2, \dots, T\}$ are generated from a multinomial regression with vector parameters $\mat{w}_k$, one set per possible hidden state $k \in \{1, \dots, K\}$, and covariates $\mat{u}_t$. The hidden states are subsequently sampled based on these transition probabilities.
The observation at each step may generate from a linear regressions with parameters $\mat{b}_k$ and $\mat{\Sigma}_k$, one set per possible hidden state.
```{r}
obsmodel_reg <- function(...) {
args <- list(...)
u <- args$u
z <- args$z
b <- args$b
s <- args$s
K <- length(unique(z))
m <- ncol(u)
if (any(dim(b) != c(K, m)))
stop("The regressors matrix must be of size Kxm, where m is the size of the input vector.")
T.length <- nrow(u)
x <- vector("numeric", T.length)
for (t in 1:T.length) {
x[t] <- rnorm(1, u[t, ] %*% b[z[t], ], s[z[t]])
}
return(x)
}
```
Alternatively, the observation could originate in a Gaussian mixture density where $\lambda_{kl}$ is the component weight for the $l$-th component within the $k$-th state, $0 \le \lambda_{kl} \le 1 \ \forall \ l \in \{1, \dots, L\}, k \in \{1, \dots, K\}$ and $\sum_{l=1}^{L}{\lambda_{kl}} = 1 \ \forall \ k$.
```{r}
obsmodel_mix <- function(...) {
args <- list(...)
z <- args$z
lambda <- args$lambda
mu <- args$mu
s <- args$s
if (!all.equal(length(unique(z)), length(lambda), length(mu), length(s)))
stop("The size of the vector parameters lambda, mu and s must equal to the
number of different states.")
T.length <- length(z)
L <- ncol(lambda)
x <- vector("numeric", T.length)
for (t in 1:T.length) {
l <- sample(1:L, 1, FALSE, prob = lambda[z[t], ])
x[t] <- rnorm(1, mu[z[t], l], s[z[t], l])
}
return(x)
}
```
We set up our simulated experiments for a regression observation model with arbitrary values for all the involved parameters. Additionally, we define the settings for the Markov Chain Monte Carlo sampler.
```{r}
# Data
T.length = 300
K = 3
M = 4
R = 1
u.intercept = FALSE
w = matrix(
c(1.2, 0.5, 0.3, 0.1, 0.5, 1.2, 0.3, 0.1, 0.5, 0.1, 1.2, 0.1),
nrow = K, ncol = M, byrow = TRUE)
b = matrix(
c(5.0, 6.0, 7.0, 0.5, 1.0, 5.0, 0.1, -0.5, 0.1, -1.0, -5.0, 0.2),
nrow = K, ncol = M, byrow = TRUE)
s = c(0.2, 1.0, 2.5)
p1 = c(0.4, 0.2, 0.4)
# Markov Chain Monte Carlo
n.iter = 400
n.warmup = 200
n.chains = 1
n.cores = 1
n.thin = 1
n.seed = 9000
```
We anticipate identification issues in our sampler given the nature of the model. As an arguable simplification with obvious drawbacks, we decide to rely only on one chain to avoid between-chain label switching of regression parameters. We refer the reader to @betancourt2017identifying for an in-depth treatment of the diagnostics, causes and possible solutions for label switching in Bayesian Mixture Models.
We draw random inputs from a standard Gaussian distribution and generate the dataset accordingly.
```{r}
set.seed(n.seed)
u <- matrix(rnorm(T.length*M, 0, 1), nrow = T.length, ncol = M)
if (u.intercept)
u[, 1] = 1
dataset <- iohmm_sim(T.length, K, u, w, p1, obsmodel_reg, list(b = b, s = s))
```
```{r, fig.height = 5, out.width="\\textwidth"}
plot_inputoutput(x = dataset$x, u = dataset$u, z = dataset$z)
```
We observe how the chosen values for the parameters affect the generated data. For example, the relationship between the third input $\mat{u}_3$ and the output $\mat{x}$ is positive, indifferent and negative for the hidden states 1 through 3, the true slopes being `r b[1, 3]`, `r b[2, 3]` and `r b[3, 3]` respectively.
```{r, fig.height = 6, out.width="\\textwidth"}
plot_inputprob(u = dataset$u, p.mat = dataset$p.mat, z = dataset$z)
```
We then analyse the relationship between the input and the state probabilities, which are usually hidden in applications with real data. The pairs $\{ \mat{u}_1, p(z_t = 1) \}$, $\{ \mat{u}_2, p(z_t = 2) \}$ and $\{ \mat{u}_3, p(z_t = 3) \}$ are most related due to the choice of values for the true regression parameters: those inputs take the largest weight in each state, namely $w_{11} = `r w[1, 1]`$, $w_{22} = `r w[2, 2]`$ and $w_{33} = `r w[3, 3]`$.
### Estimation
Adopting a fully Bayesian approach, we run our software to draw samples from the posterior density of model parameters and other hidden quantities.
```{r}
library(rstan)
library(shinystan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
stan.model = '../iohmm-reg/stan/iohmm-reg.stan'
stan.data = list(
T = T.length,
K = K,
M = M,
u_tm = as.array(u),
x_t = dataset$x
)
stan.fit <- stan(file = stan.model,
data = stan.data, verbose = T,
iter = n.iter, warmup = n.warmup,
thin = n.thin, chains = n.chains,
cores = n.cores, seed = n.seed)
n.samples = (n.iter - n.warmup) * n.chains
```
### Convergence
We rely on several diagnostic statistics and plots provided by rstan [@sdt2017rstan] and shinystan [@sdt2017shinystan] to assess mixing, convergence and the inexistence of divergences. We then extract the samples for some quantities of interest, namely the filtered probabilities vector $\mat{\alpha}_t$, the smoothed probability vector $\mat{\gamma}_t$, the most probable hidden path $\mat{z}^*$ and the fitted output $\hat{x}$.
```{r, message = FALSE, results = 'hide'}
# MCMC Diagnostics --------------------------------------------------------
summary(stan.fit,
pars = c('p_1k', 'w_km', 'b_km', 's_k'),
probs = c(0.50))$summary
# launch_shinystan(stan.fit)
# Extraction --------------------------------------------------------------
alpha_tk <- extract(stan.fit, pars = 'alpha_tk')[[1]]
gamma_tk <- extract(stan.fit, pars = 'gamma_tk')[[1]]
zstar_t <- extract(stan.fit, pars = 'zstar_t')[[1]]
hatx_t <- extract(stan.fit, pars = 'hatx_t')[[1]]
```
```{r, echo = FALSE}
knitr::kable(summary(stan.fit,
pars = c('p_1k', 'w_km', 'b_km', 's_k'),
probs = c(0.50))$summary,
col.names = c("Mean", "MCSE", "SE", "Med", "ESS", "$\\hat{R}$"),
digits = 2, align = "r", caption = "Estimated parameters and hidden quantities. *MCSE = Monte Carlo Standard Error, SE = Standard Error, Med = Median, ESS = Effective Sample Size*.")
```
While mixing and convergence is extremely efficient, as expected when dealing with generated data, we note that the regression parameters for the latent states are the worst performers. The small effective size translates into a high Monte Carlo standard error and broader credibility intervals. One possible reason is that softmax is invariant to change in location, thus the parameters of a multinomial regression do not have a natural center and become harder to estimate.
### State probability
We assess that our software recover the hidden states correctly. Due to label switching, the states generated under the labels 1 through 3 were recovered in inverse order. In consequence, we decide to relabel the observations based on the best fit. This would not prove to be a problem with real data considering that the hidden states are never observed.
```{r}
# Relabelling (ugly hack edition) -----------------------------------------
dataset$zrelab <- rep(0, T)
hard <- sapply(1:T.length, function(t, med) {
which.max(med[t, ])
}, med = apply(alpha_tk, c(2, 3),
function(x) {
quantile(x, c(0.50)) }))
tab <- table(hard = hard, original = dataset$z)
for (k in 1:K) {
dataset$zrelab[which(dataset$z == k)] <- which.max(tab[, k])
}
print("Label re-imputation (relabeling due to switching labels)")
table(new = dataset$zrelab, original = dataset$z)
```
Point estimates and credibility intervals are provided by rstan's `{r}summary` function.
```{r, eval = FALSE}
print("Estimated initial state probabilities")
summary(stan.fit,
pars = c('p_1k'),
probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)]
print("Estimated probabilities in the transition matrix")
head(summary(stan.fit,
pars = c('A_ij'),
probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)])
print("Estimated regressors of hidden states")
summary(stan.fit,
pars = c('w_km'),
probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)]
print("Estimated regressors and standard deviation of observations in each state")
summary(stan.fit,
pars = c('b_km', 's_k'),
probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)]
```
```{r, echo = FALSE}
knitr::kable(summary(stan.fit,
pars = c('p_1k'),
probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)],
col.names = c("Mean", "SE", "10%", "Med", "90%"),
digits = 2, align = "r", caption = "Estimated initial state probabilities.")
knitr::kable(head(summary(stan.fit,
pars = c('A_ij'),
probs = c(0.10, 0.50, 0.90))$summary)[, c(1, 3, 4, 5, 6)],
col.names = c("Mean", "SE", "10%", "Med", "90%"),
digits = 2, align = "r", caption = "Estimated probabilities in the transition matrix.")
knitr::kable(summary(stan.fit,
pars = c('w_km'),
probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)],
col.names = c("Mean", "SE", "10%", "Med", "90%"),
digits = 2, align = "r", caption = "Estimated regressors of hidden states.")
knitr::kable(summary(stan.fit,
pars = c('b_km', 's_k'),
probs = c(0.10, 0.50, 0.90))$summary[, c(1, 3, 4, 5, 6)],
col.names = c("Mean", "SE", "10%", "Med", "90%"),
digits = 2, align = "r", caption = "Estimated regression parameters and standard deviation of observations in each state.")
```
The samples drawn from the posterior density are reasonably close to the true values. We additionally check the estimated hidden quantities.
```{r, fig.height = 6, out.width="\\textwidth"}
plot_stateprobability(alpha_tk, gamma_tk, 0.8, dataset$zrelab)
```
```{r, eval = FALSE}
print("Estimated hidden states (hard naive classification using filtered probabilities)")
print(table(
estimated = apply(round(apply(alpha_tk, c(2, 3),
function(x) {
quantile(x, c(0.50)) })), 1, which.max),
real = dataset$zrelab))
```
```{r, echo = FALSE}
knitr::kable(table(
estimated = apply(round(apply(alpha_tk, c(2, 3),
function(x) {
quantile(x, c(0.50)) })), 1, which.max),
real = dataset$zrelab),
col.names = c("Real 1", "Real 2", "Real 3"),
row.names = TRUE,
digits = 2, align = "r", caption = "Hard classification.")
```
The filtered probability is overly accurate, as expected, making the additional information contained in the smoothed seem of little value. The confusion matrix makes evident that, under ideal conditions, the sampler works as intended. Similarly, the Viterbi algorithm recovers the expected most probably hidden state.
```{r, fig.height = 6, out.width="\\textwidth"}
plot_statepath(zstar_t, dataset$zrelab)
```
```{r, eval = FALSE}
print("Estimated hidden states for the jointly most likely path (Viterbi decoding)")
round(table(
actual = rep(dataset$zrelab, each = n.samples),
fit = zstar_t) / n.samples, 0)
```
```{r, echo = FALSE}
knitr::kable(table(
estimated = apply(round(apply(alpha_tk, c(2, 3),
function(x) {
quantile(x, c(0.50)) })), 1, which.max),
real = dataset$zrelab),
digits = 2, align = "r", caption = "Viterbi decoding.")
```
### Fitted output
```{r, fig.height = 6, out.width="\\textwidth"}
plot_outputfit(dataset$x, hatx_t, z = dataset$zrelab, TRUE)
```
All in all, we are confident that the Stan code works as specified and we find model calibration satisfactory.
```{r, message = FALSE, echo = FALSE, results = 'hide'}
rm(list = ls())
gc()
```
---
# Stock Market Forecasting Using Hidden Markov Model
## Preamble
We first load the packages and source auxiliary functions.
```{r, message = FALSE, warning = FALSE}
library(quantmod)
library(rstan)
library(shinystan)
source('R/data.R')
source('R/forecast.R')
source('R/wf-forecast.R')
source('../common/R/math.R')
source('../common/R/plots.R')
source('../iohmm-mix/R/iohmm-mix-init.R')
```
## Model
We adhere to the methodology proposed in the original work as much as possible. We set up an IOHMM model with $K = 4$ hidden states to forecast stock prices. The hidden dynamics are driven by a multinomial (softmax) regression with $R = 4$ inputs, namely opening, closing, highest and lowest prices from the previous time step. The univariate output is the stock closing price for the current time step. The observation model is a Gaussian mixture with $L = 3$ components per state,
\[
p(x_t | z_{t} = j) = \sum_{l = 1}^{L}{\lambda_{jl} \ \NN(x_t | \mu_{jl}, \sigma_{jl})},
\]
where $\lambda_{jl}$ is the component weight for the $l$-th component within the $j$-th state, $0 \le \lambda_{jl} \le 1 \ \forall \ l, j$ and $\sum_{l=1}^{L}{\lambda_{jl}} = 1 \ \forall \ j$.
The vector parameter then becomes $\mat{\theta} = (\mat{\pi}_1, \mat{w}_k, \lambda_{kl}, \mu_{kl}, \sigma_{kl})$ for $k \in \{1, \dots, K\}$ and $l \in \{1, \dots, L\}$.
```{r}
K = 4
L = 3
```
## The sampler
We use Stan's Hamiltonian Monte Carlo to run a fully Bayesian estimation of the model parameters. Additionally, we compute many hidden quantities, including the forward probability, the future likelihood, the jointly most likely path and transition probabilities. We draw samples from the hidden state distribution as well. To this end, we write our own implementation of the forward filter, the forwards-backwards filter and the Viterbi decoding algorithms. We acknowledge the authors of the Stan Manual [@team2017stan] for the numerous examples, some of which served as a starting point for our code.
We define the following parameters for the MCMC sampler. The chosen hyperparameters correspond to diffuse priors.
```{r}
hyperparams <- c(0, 5, 5, 0, 3, 1, 1, 0, 5);
n.iter = 800
n.warmup = 400
n.chains = 1
n.cores = 1
n.thin = 1
n.seed = 9000
```
## Dataset & data transformation
We use daily OHLC prices for the stocks and time spans listed in the original paper. The data is downloaded from Yahoo Finance via Quantmod. We omit two delisted stocks as the series are not avaliable anymore in public sources such as Yahoo and Google Finance.
```{r}
syms <- data.frame(
symbol = c("LUV", "RYA.L"),
name = c("Southwest Airlines Co", "Ryanair Holdings Plc"),
train.from = c("2002-12-18", "2003-05-06"),
train.to = c("2004-07-23", "2004-12-06"),
test.from = c("2004-07-24", "2004-12-07"),
test.to = c("2004-11-17", "2005-03-17"),
src = c("yahoo", "yahoo"),
stringsAsFactors = FALSE)
```
```{r, echo = FALSE}
library(knitr)
knitr::kable(t(sapply(1:nrow(syms), function(i) { c(
syms$symbol[i],
syms$name[i],
paste(syms$train.from[i], "to", syms$train.to[i]),
paste(syms$test.from[i] , "to", syms$test.to[i])
)})),
col.names = c("Symbol", "Name", "Training set", "Test set"),
align = "c", caption = "Details of the dataset.")
```
We found that feeding the raw dataset to our sampler led to convergence issues. We learnt by trial and error that centering and rescaling the variables did not not only improve convergence but also sped up the software by a factor of 5.
```{r}
make_dataset <- function(prices, scale = TRUE) {
T.length = nrow(prices)
x = as.vector(Cl(prices))[2:T.length]
u = as.matrix(prices)[1:(T.length - 1), 1:4]
dataset <- list(
x = as.vector(x),
u = as.matrix(u),
x.unscaled = as.vector(x),
u.unscaled = as.matrix(u),
x.center = 0,
x.scale = 1,
u.center = 0,
u.scale = 1
)
if (scale) {
x <- scale(x, TRUE, TRUE)
u <- scale(u, TRUE, TRUE)
dataset$x <- as.vector(x)
dataset$u <- as.matrix(u)
dataset$x.center <- attr(x, "scaled:center")
dataset$x.scale <- attr(x, "scaled:scale")
dataset$u.center <- attr(u, "scaled:center")
dataset$u.scale <- attr(u, "scaled:scale")
}
dataset
}
```
## Methodology
The goal of the experiment is to predict the closing price for a specific stock market share. According to the proposed methodology, we estimate the model parameters $\mat{\hat{\theta}}$ using all the information available up to a given step. We compute the likelihood of the last observation in the training set conditional on the trained parameters $p(\mat{x}_{t} | \mat{\hat{\theta}})$ and we locate the observation in the training set with the nearest likelihood,
\[
\mat{x}_{t^*} = \argmin_{s} \ \left| p(\mat{x}_{t} | \mat{\hat{\theta}}) - p(\mat{x}_{s} | \mat{\hat{\theta}}) \right|.
\]
We compute the change in the closing price from the located time step to the next one,
\[
\Delta_{t^*} = \mat{x}_{t^*+1} - \mat{x}_{t^*},
\]
and build our forecast assuming that a similar pattern will repeat
\[
\mat{\hat{x}}_{t+1} = \mat{x}_{t} + \Delta_{t^*}.
\]
```{r}
neighbouring_forecast <- function(x, oblik_t, h = 1, threshold = 0.05) {
if (!is.vector(x) || length(x) != dim(oblik_t)[2])
stop("The size of the observation vector and the width of the likelihood
array must be equal.")
n.samples <- dim(oblik_t)[1]
T.length <- dim(oblik_t)[2]
find_closest <- function(target, candidates, threshold) {
ind <- which(abs(target - candidates) < abs(target) * threshold)
if (!length(ind))
ind <- which(abs(target - candidates) == min(abs(target - candidates)))
ind
}
forecast = vector("numeric", n.samples)
for (n in 1:n.samples) {
oblik_target <- oblik_t[n, T.length]
oblik_cand <- oblik_t[n, 1:(T.length - h)]
closests <- find_closest(oblik_target, oblik_cand, threshold)
d <- abs(oblik_target - oblik_t[n, closests])
w <- exp(d)
forecast[n] <- x[T.length] + sum((x[closests + h] - x[closests]) * w) / sum(w)
}
forecast
}
```
Although the authors state that one or more observations in the training dataset are selected, the paper provides no details about the determination of the number, the selection and the combination of the neighboring observations. In our implementation, we locate all the observations whose estimated likelihood is within the interval that covers $\pm 5º%$ of the target observation likelihood.
\[
\left\{ \mat{x}_{s} : \ \left| p(\mat{x}_{t} | \mat{\hat{\theta}}) - p(\mat{x}_{s} | \mat{\hat{\theta}}) \right| < 0.05 \times p(\mat{x}_{t} | \mat{\hat{\theta}}) \right\}.
\]
If the set is empty, we use the past observation with the most similar likelihood instead.
The procedure is run independently for each stock.
## Southwest Airlines (LUV)
We present an in-depth study of one stock, LUV, to assess the strengths and weaknesses of the model. We later present a brief analysis of RYA.L.
### Data exploration
```{r, message = FALSE, warning = FALSE, fig.height = 6}
symbol <- syms[1, ] # LUV
prices <- getSymbols(symbol$symbol,
env = NULL,
from = symbol$train.from,
to = symbol$test.to,
src = symbol$src)
T.length <- nrow(prices[paste(symbol$train.from, symbol$train.to, sep = "/")])
dataset <- make_dataset(prices[1:T.length, ], TRUE)
plot_inputoutput(x = dataset$x, u = dataset$u,
x.label = symbol$symbol,
u.label = c("Open", "High", "Low", "Close"))
```
Unsuprisingly, OHLC prices follow an almost identical trend and are highly correlated with the next day closing price. Nonetheless, we stress that the correlation of two non-stationary variables is an unreliable measure. Althought it is customary to take the first difference or apply other transformations to achieve non-stationarity, the price level is purposely retained in the input series.
### Estimation
We attempt to ease convergence by setting up approximated initial values for the component parameters. Although nested k-means is not truthful to the model we replicate, we still find this simplification worthwhile.
```{r, warning = FALSE}
rstan_options(auto_write = TRUE)
stan.model = '../iohmm-mix/stan/iohmm-hmix.stan'
stan.data = list(
K = K,
L = L,
M = ncol(dataset$u),
T = length(dataset$x),
u_tm = as.array(dataset$u),
x_t = as.vector(dataset$x),
hyperparams = as.array(hyperparams)
)
stan.fit <- stan(file = stan.model,
data = stan.data, verbose = T,
iter = n.iter, warmup = n.warmup,
thin = n.thin, chains = n.chains,
cores = n.cores, seed = n.seed,
init = function() {init_fun(stan.data)})
n.samples = (n.iter - n.warmup) * n.chains
```
Long computational times are a forewarning of poor convergence. This may be specific to the characteristics of the selected price series; despite the similar length, RYA.L performs significantly faster as shown below. As suggested by the warning message and the extended guidelines in the Stan website, the numerical problem is negligible and we do not take any further action.
### Convergence
Graphical diagnoses (not shown) make evident that convergence to stationarity is at very least weak. Switching labels complicate the assessment since we cannot run and compare several chains as widely suggested in the standard literature. On the other hand, the number of iterations chosen seems to be enough to allow convergence of averages as assessed by the cumulated averages curve. Nonetheless, this do not compensate the fragile convergence to stationarity.
Relying on the shrink factor of @gelman1992inference, we identify that the components mean, standard deviation and weight are the hardest quantities to converge. More worryingly, the effective sample size of some mixture parameters lies below 10\% of the iteration number. Analogously, Monte Carlo standard error exceeds 10\% of the estimated standard error. Although thinning could reduce autocorrelation, it has the computational (but not statistical) purpose of storing and running computations on large sampling sequences [@gelman2011inference]. The mean hyperparameters sample efficiently, a clear indication that the addition of a hierarchy is a pertinent improvement.
```{r}
knitr::kable(summary(stan.fit,
pars = c('p_1k', 'w_km', 'lambda_kl', 'hypermu_k', 'mu_kl', 's_kl'),
probs = c(0.10, 0.50, 0.90))$summary,
col.names = c("Mean", "MCSE", "SE", "10\\%", "Med", "90\\%", "ESS", "$\\hat{R}$"),
digits = 2, caption = "Summary of the posterior draws from the distribution of the estimated parameters.")
```
In general terms, convergence is just about satisfactory. As shown in the following sections, the fitted values are roughly consistent with the observations thus hinting that the sampler adapts to the data even in the cases where diagnostics yield mixed results. Nonetheless, we interpret the results carefully and with skepticism. The fact that mixture parameters are the most ill-behaved leads us to the hypothesis that the observation model would improve significantly with better domain knowledge. Furthermore, we suspect that setting the same number of hidden states for all the price series analysed in the original paper may prove excessively arbitrary and hinder sampling performance.
We extract sampled hidden quantities for later use.
```{r}
alpha_tk <- extract(stan.fit, pars = 'alpha_tk')[[1]]
gamma_tk <- extract(stan.fit, pars = 'gamma_tk')[[1]]
zstar_t <- extract(stan.fit, pars = 'zstar_t')[[1]]
hatx_t <- extract(stan.fit, pars = 'hatx_t')[[1]]
oblik_t <- extract(stan.fit, pars = 'oblik_t')[[1]]
logA_ij <- extract(stan.fit, pars = 'logA_ij')[[1]]
```
### State probability
The following plot shows the probability, along with its 80\% credible interval, that an observation belongs to one of the possible hidden states.
```{r, warning = FALSE, fig.height = 7}
plot_stateprobability(alpha_tk, gamma_tk, 0.8)
```
Since states are determined by the previous step OHLC prices, filtered and smoothed probabilities are equivalent. We note that state membership naturally became persistent. Given the choice of input variables, state membership depends non-linearly on price levels and will remain stable so long as price changes are stationary. In some contexts, persistence may be regarded as a very valuable feature because latent states are expected to change disruptively yet sporadically. Conversely, Markov Switching dynamics attempt to accommodate latent states in a way that they may shift more often than domain knowledge suggests, which in turns may lead to overfitting. The jointly most probably hidden path, as decoded by the Viterbi algorithm, reflects the continuity of states across time.
```{r, fig.height = 3.5}
plot_statepath(zstar_t)
```
It is interesting to study in detail in what way the price level at a given day informs the closing price for the following time step. As expected, the softmax transformation establishes a non-linear relationship between OHLC prices and the membership probability. The latent variables have a natural and intuitive interpretation. States 2 and 3 represent the upper price levels and contain the observations that are considerably and slightly above the mean. The analogous is also true for the states 3 and 4. Following this intuition, the number of possible states $K$ should be chosen according to the observed number of levels in the input series. We notice that the different inputs have a similar impact on the state probability due to their high cross-correlatation. Even though the thresholds are estimated by maximizing the log-posterior density, out of sample error measures would be used if forecasting were the main goal. We expand on this in the [discussion](#further-research).
```{r, warning = FALSE, fig.height = 7.5}
plot_inputprob(dataset$u.unscaled,
apply(logA_ij, c(2, 3), function(x) { median(exp(x)) }),
u.label = c("Open", "High", "Low", "Close"))
```
### Fitted output
The following plot shows the fitted output. We first draw the hidden state according to the probabilities estimated by the softmax regression applied on the previous day OHLC vector. We draw one component from such state, then we draw an observation from a Gaussian distribution with the corresponding mean and standard deviation.
```{r, fig.height = 4}
plot_outputfit(dataset$x, hatx_t, z = apply(zstar_t, 2, median), K = K)
```
It is important to note that fitted observation is not the forecast itself. The methodology proposed by the original authors rely on the likelihood of the observations instead.
### In-sample summary
All in all, the in-sample fit can be summarized in the following plot.
```{r, fig.height = 7}
plot_inputoutputprob(x = dataset$x, u = dataset$u,
stateprob = alpha_tk, zstar = zstar_t,
x.label = symbol$symbol,
u.label = c("Open", "High", "Low", "Close"),
stateprob.label = bquote("Filtered prob" ~ p(z[t] ~ "|" ~ x[" " ~ 1:t])))
```
### Forecast
The methodology produces a forecast based on the estimated likelihood of the last observation. We write a walk forward procedure where the model is refitted each time a new observation is received. Since Stan does not have a natural way to update the log-density from a previous run, we implement a reduced version of our software that includes only the minimum computations needed for forecasting.
The following snippet runs the walk forward procedure in parallel and returns a list of objects.
```{r, message = FALSE, warning = FALSE}
luv.wf <- wf_forecast(syms = symbol, K, L, hyperparams,
n.iter, n.warmup, n.chains, n.cores, n.thin, n.seed,
cache.dir = "fore_cache/")[[1]]
luv.300904 <- luv.wf[[49]]
```
#### Forecast for 30-sep-2004
We first recreate the forecast for 30-sep-2004. Our OHLC values differ from those shown in the original work, possibly due to adjustments. We regret that the authors do not provide enough details about data sources and pre-processing.
```{r, echo = FALSE}
luv.300904.n <- nrow(luv.300904$dataset$u)
knitr::kable(luv.300904$dataset$u.unscaled[(luv.300904.n - 1):luv.300904.n, ],
digits = 2, col.names = c("Open", "High", "Low", "Close"),
caption = "OHLC values for the selected dates.")
```
We arbitrarily choose one of the samples generated by MCMC. We identify similar past observations and produce the point forecast accordingly.
```{r}
oblik_t <- luv.300904$oblik_t
n.samples <- dim(oblik_t)[1]
T.length <- dim(oblik_t)[2]
h <- 1
n <- 1 # Sample
find_closest <- function(target, candidates, threshold) {
ind <- which(abs(target - candidates) < abs(target) * threshold)
if (!length(ind))
ind <- which(abs(target - candidates) == min(abs(target - candidates)))
ind
}
oblik_target <- oblik_t[n, T.length]
oblik_cand <- oblik_t[n, 1:(T.length - h)]
threshold <- 0.05
closests <- find_closest(oblik_target, oblik_cand, threshold)
d <- abs(oblik_target - oblik_t[n, closests])
w <- exp(d)
```
```{r, echo = FALSE}
knitr::kable(
cbind(
luv.300904$dataset$u.unscaled[closests, ],
Cl(luv.300904$dataset$u.unscaled[closests + h, ]) - Cl(luv.300904$dataset$u.unscaled[closests, ]),
w / sum(w)
),
digits = 2, col.names = c("Open", "High", "Low", "Close", "Delta", "Weight"),
caption = "OHLC values for the selected dates.")
```
```{r}
luv.300904.y <- luv.300904$dataset$u.unscaled[, 4]
luv.300904.delta <- luv.300904$dataset$u.unscaled[closests + h, 4] - luv.300904$dataset$u.unscaled[closests, 4]
luv.300904.yhat <- luv.300904$dataset$u.unscaled[T.length, 4] + sum(luv.300904.delta * w) / sum(w)
```
The difference between the observed closing price $x_{t+1} = `r luv.300904.n<-nrow(luv.300904$dataset$u); round(luv.300904$dataset$u.unscaled[luv.300904.n, 4], 2)`$ and the forecast $\hat{x}_{t+1} = `r round(luv.300904.yhat, 2)`$ equal `r round(luv.300904$dataset$u.unscaled[luv.300904.n, 4] - luv.300904.yhat, 2)`.
```{r, echo = FALSE, fig.height = 5}
T.length <- length(luv.300904.y)
T.ooslength <- length(luv.300904.yhat)
luv.300904.cl <- rep(NA, T.length)
luv.300904.cl[closests] <- luv.300904$dataset$u.unscaled[closests, 4]
luv.300904.alpha_tk <- extract(luv.300904$stan.fit, pars = 'unalpha_tk')[[1]]
luv.300904.hard <- sapply(1:T.length, function(t, med) {
which.max(med[t, ])
}, med = apply(luv.300904.alpha_tk, c(2, 3),
function(x) {
quantile(x, c(0.50)) }))
opar <- par(no.readonly = TRUE)
layout(matrix(c(1, 2), nrow = 2, ncol = 1), heights = c(0.95, 0.05))
plot(x = 1:T.length, y = luv.300904.y,
ylab = bquote("Output" ~ x), xlab = bquote("Time" ~ t),
main = "Point forecast for 2004-09-30 (LUV)",
type = 'l', col = 'lightgray')
points(x = 1:T.length, y = luv.300904.cl,
pch = 21, cex = 0.5,
bg = 'darkgray', col = 'darkgray')
points(x = (T.length - T.ooslength + 1):T.length, y = luv.300904.yhat,
pch = 21, cex = 0.5,
bg = 1, col = 1)
par(mai = c(0, 0, 0, 0))
plot.new()
legend(x = "center",
legend = c("Observed", "Similar", "Forecast"),
lwd = 3, col = c('lightgray', 'darkgray', 1), horiz = TRUE, bty = 'n')
par(opar)
```
We can account for parameter uncertainty by repeating these computations for all the samples corresponding to a given time step. We stress that such a measure would not conform a true interval estimate as the methodology provides only with a point forecast.
#### Walking forward forecasts
We now show the walking forward forecasts for the selected stock.
```{r}
T.ooslength <- length(luv.wf)
luv.dataset <- luv.wf[[T.ooslength]]$dataset
luv.wfy <- tail(luv.dataset$x.unscaled, T.ooslength - 1)
luv.wfyhat <- head(sapply(luv.wf, function(wf) { median(wf$forecast) }), -1)
```
```{r, fig.height = 5}
plot_seqforecast(y = luv.wfy, yhat = luv.wfyhat,
main.label = symbol$symbol)
```
We report the out of sample forecast error measures in the table below. Both the coefficient of determination $R^2$ and the Mean Absolute Percentage Error (MAPE) are consistent with the findings in the original paper.
```{r, echo = FALSE}
luv.err <- luv.wfy - luv.wfyhat
knitr::kable(data.frame(
mean(luv.err^2),
paste0(round(100 * mean(abs(luv.err / luv.wfy)), 2), "%"),
as.numeric(summary(lm(luv.wfy ~ luv.wfyhat))$r.squared)
),
align = "c", digits = 4,
col.names = c("MSE", "MAPE", "$R^2$"),
caption = "Out of sample error measures for LUV.")
```
```{r, message = FALSE, echo = FALSE, results = 'hide'}
rm(stan.fit, stan.data, alpha_tk, gamma_tk, zstar_t, hatx_t, oblik_t, logA_ij)
rm(list = ls()[grep("luv", ls())])
gc()
```
## Ryanair Holdings Plc (RYA.L)
### In-sample analysis
We estimate the model for RYA.L and we then present the in-sample summary.
```{r, message = FALSE, warning = FALSE}
symbol <- syms[2, ] # RYA
prices <- getSymbols(symbol$symbol,
env = NULL,
from = symbol$train.from,
to = symbol$test.to,
src = symbol$src)
T.length <- nrow(prices[paste(symbol$train.from, symbol$train.to, sep = "/")])
dataset <- make_dataset(prices[1:T.length, ], TRUE)
```
```{r}
rstan_options(auto_write = TRUE)
stan.model = '../iohmm-mix/stan/iohmm-hmix.stan'
stan.data = list(
K = K,
L = L,
M = ncol(dataset$u),
T = length(dataset$x),
u_tm = as.array(dataset$u),
x_t = as.vector(dataset$x),
hyperparams = as.array(hyperparams)
)
stan.fit <- stan(file = stan.model,
data = stan.data, verbose = T,
iter = n.iter, warmup = n.warmup,
thin = n.thin, chains = n.chains,
cores = n.cores, seed = n.seed,
init = function() {init_fun(stan.data)})
n.samples = (n.iter - n.warmup) * n.chains
alpha_tk <- extract(stan.fit, pars = 'alpha_tk')[[1]]
gamma_tk <- extract(stan.fit, pars = 'gamma_tk')[[1]]
zstar_t <- extract(stan.fit, pars = 'zstar_t')[[1]]
hatx_t <- extract(stan.fit, pars = 'hatx_t')[[1]]
oblik_t <- extract(stan.fit, pars = 'oblik_t')[[1]]
logA_ij <- extract(stan.fit, pars = 'logA_ij')[[1]]
```
We identify two extreme observations but we decide not to take any further action. State 2 represents a period of time in which the stock suddenly reached the maximum value in its history. As observed before, the hidden states are persistent.
```{r}
knitr::kable(summary(stan.fit,
pars = c('p_1k', 'w_km', 'lambda_kl', 'hypermu_k', 'mu_kl', 's_kl'),
probs = c(0.10, 0.50, 0.90))$summary,
col.names = c("Mean", "MCSE", "SE", "10\\%", "Med", "90\\%", "ESS", "$\\hat{R}$"),
digits = 2, caption = "Summary of the posterior draws from the distribution of the estimated parameters.")
```
```{r, fig.height = 7}
plot_inputoutputprob(x = dataset$x, u = dataset$u,
stateprob = alpha_tk, zstar = zstar_t,