forked from avehtari/BDA_course_Aalto
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathBDA_notes_ch5.tex
290 lines (257 loc) · 11.9 KB
/
BDA_notes_ch5.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
\documentclass[a4paper,11pt,english]{article}
\usepackage{babel}
\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{times}
\usepackage{amsmath}
\usepackage{microtype}
\usepackage{url}
\urlstyle{same}
\usepackage[bookmarks=false]{hyperref}
\hypersetup{%
bookmarksopen=true,
bookmarksnumbered=true,
pdftitle={Bayesian data analysis},
pdfsubject={Comments},
pdfauthor={Aki Vehtari},
pdfkeywords={Bayesian probability theory, Bayesian inference, Bayesian data analysis},
pdfstartview={FitH -32768},
colorlinks=true,
linkcolor=black,
citecolor=black,
filecolor=black,
urlcolor=black
}
% if not draft, smaller printable area makes the paper more readable
\topmargin -4mm
\oddsidemargin 0mm
\textheight 225mm
\textwidth 160mm
%\parskip=\baselineskip
\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\Sd}{Sd}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\Bin}{Bin}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Invchi2}{Inv-\chi^2}
\DeclareMathOperator{\NInvchi2}{N-Inv-\chi^2}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\N}{N}
\DeclareMathOperator{\U}{U}
\DeclareMathOperator{\tr}{tr}
%\DeclareMathOperator{\Pr}{Pr}
\DeclareMathOperator{\trace}{trace}
\pagestyle{empty}
\begin{document}
\thispagestyle{empty}
\section*{Bayesian data analysis -- reading instructions ch 5}
\smallskip
{\bf Aki Vehtari}
\smallskip
\subsection*{Chapter 5}
Outline of the chapter 5
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item 5.1 Lead-in to hierarchical models
\item 5.2 Exchangeability (a useful theoretical concept)
\item 5.3 Bayesian analysis of hierarchical models
\item 5.4 Hierarchical normal model
\item 5.5 Example: parallel experiments in eight schools (uses hierarchical normal model, details of computation can be skipped)
\item 5.6 Meta-analysis (can be skipped in this course)
\item 5.7 Weakly informative priors for hierarchical variance parameters
\end{list}
The hierarchical models in the chapter are simple to keep computation
simple. More advanced computational tools are presented in Chapters
10-12 (part of the course) and 13 (not part of the course).
Demos
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item demo5\_1: Rats example
\item demo5\_2: SAT example
\end{list}
Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others.
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item population distribution
\item hyperparameter
\item overfit
\item hierarchical model
\item exchangeability
\item invariant to permutations
\item independent and identically distributed
\item ignorance
\item the mixture of independent identical distributions
\item de Finetti's theorem
\item partially exchangeable
\item conditionally exchangeable
\item conditional independence
\item hyperprior
\item different posterior predictive distributions
\item the conditional probability formula
\end{list}
\subsection*{Computation}
Examples in Sections 5.3 and 5.4 continue computation with
factorization and grid, but there is no need to go deep in to
computational details as we can use MCMC and Stan
instead. Hierarchical model exercises are made with Stan.
\subsection*{Exchangeability vs. independence}
Exchangeability and independence are two separate concepts.
Neither necessarily implies the other. Independent identically
distributed variables/parameters are exchangeable. Exchangeability
is less strict condition than independence. Often we may assume
that observations or unobserved quantities are in fact dependent,
but if we can't get information about these dependencies we may
assume those observations or unobserved quantities as
exchangeable. "Ignorance implies exchangeability."
In case of exchangeable observations, we may sometimes act \emph{as
if} observations were independent if the additional potential
information gained from the dependencies is very small. This is
related to de Finetti's theorem (p. 105), which applies formally
only when $J\rightarrow\infty$, but in practice difference may be
small if $J$ is finite but relatively large (see examples below).
\begin{list}{$\bullet$}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item If no other information than data $y$ is available to distinguish
$\theta_j$ from each other and parameters can not be ordered or
grouped, we may assume symmetry between parameters in their prior
distribution
\item This symmetry can be represented with exchangeability
\item Parameters $\theta_1,\ldots,\theta_J$ are exchangeable
in their joint distribution if $p(\theta_1,\ldots,\theta_J)$ is
invariant to permutation of indexes $(1,\ldots,J)$
\end{list}
Here are some examples you may consider.
Ex 5.1. Exchangeability with known model parameters: For each of following three examples, answer: (i) Are observations $y_1$ and $y_2$ exchangeable? (ii) Are observations $y_1$ and $y_2$ independent? (iii) Can we act {\em as if} the two observations are independent?
\begin{enumerate}
\item A box has one black ball and one white ball. We pick a ball $y_1$ at random, put it back, and pick another ball $y_2$ at random.
\item A box has one black ball and one white ball. We pick a ball
$y_1$ at random, we do not put it back, then we pick ball
$y_2$.
\item A box has a million black balls and a million white balls. We
pick a ball $y_1$ at random, we do not put it back, then we pick ball
$y_2$ at random.
\end{enumerate}
Ex 5.2. Exchangeability with unknown model parameters: For each of following
three examples, answer: (i) Are observations $y_1$ and $y_2$
exchangeable? (ii) Are observations $y_1$ and $y_2$ independent? (iii)
Can we act {\em as if} the two observations are independent?
\begin{enumerate}
\item A box has $n$ black and white balls but we do not know how many of each color.
We pick a ball $y_1$ at random, put it back, and pick another ball $y_2$ at random.
\item A box has $n$ black and white balls but we do not know how many of each color. We pick a ball
$y_1$ at random, we do not put it back, then we pick ball
$y_2$ at random.
\item Same as (b) but we know that there are many balls of each color in the box.
\end{enumerate}
Note that for example in opinion polls, balls i.e. humans are not
put back and there is a large but finite number of humans.
\pagebreak
Following complements the divorce example in the book by discussing
the effect of the additional observations
\begin{list}{$\bullet$}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item Example: divorce rate per 1000 population in 8 states of the
USA in 1981
\begin{list}{-}{\itemsep=0pt\parsep=0pt\topsep=0pt}
\item without any other knowledge $y_1,\ldots,y_8$ are exchangeable
\item it is reasonable to assume a prior independence given
population density $p(y_i|\theta)$
\end{list}
\item Divorce rate in first seven are $5.6, 6.6, 7.8, 5.6,
7.0, 7.2, 5.4$
\begin{list}{-}{\itemsep=0pt\parsep=0pt\topsep=0pt}
\item now we have some additional information, but
still changing the indexing does not affect the joint
distribution. For example, if we were told that divorce rate
were not for the first seven but last seven states, it does not
change the joint distribution, and thus
$y_1,\ldots,y_8$ are exchangeable
\item sensible assumption is a prior independence given
population density $p(y_i|\theta)$
\item if "true" $\theta_0$ were known, $y_1,\ldots,y_8$
were independent given "true" $\theta_0$
\item since $\theta$ is estimated using observations, $y_i$ are a
posterior dependent, which is obvious, e.g., from the
predictive density $p(y_8|y_1,\ldots,y_7,M)$, i.e. e.g.
if $y_1,\ldots,y_7$ are large then probably $y_8$ is large
\item if we were told that given rates were for the last seven
states, then $p(y_1|y_2,\ldots,y_8,M)$ would be exactly same
as $p(y_8|y_1,\ldots,y_7,M)$ above, i.e.
changing the indexing does not have effect since $y_1,\ldots,y_8$
are exchangeable
\end{list}
\item Additionally we know that $y_8$ is Nevada and rates of other
states are $5.6, 6.6, 7.8, 5.6, 7.0, 7.2, 5.4$
\begin{list}{-}{\itemsep=0pt\parsep=0pt\topsep=0pt}
\item based on what we were told about Nevada, predictive density s $p(y_8|y_1,\ldots,y_7,M)$
should take into account that probability
$p(y_8>\max(y_1,\ldots,y_7)|y_1,\ldots,y_7)$ should be large
\item if we were told that, Nevada is $y_3$ (not $y_8$ as
above), then new predictive density
$p(y_8|y_1,\ldots,y_7,M)$ would be different, because
$y_1,\ldots,y_8$ are not anymore exchangeable
\end{list}
\end{list}
\subsection*{What if observations are not exchangeable}
Often observations are not fully exchangeable, but are partially
or conditionally exchangeable. Two basic cases
\begin{list}{-}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item[1)] If observations can be grouped, we may make hierarchical
model, were each group has own subpart, but the group properties
are unknown. If we assume that group properties are exchangeable
we may use common prior for the group properties.
\item[2)] If $y_i$ has additional information $x_i$, then
$y_i$ are not exchangeable, but $(y_i,x_i)$ still are
exchangeable, then we can be make joint model for $(y_i,x_i)$ or
conditional model $(y_i|x_i)$.
\end{list}
Here are additional examples (Bernardo \& Smith, Bayesian Theory,
1994), which illustrate the above basic cases. Think of old
fashioned thumb pin. This kind of pin can stay flat on it's base or
slanting so that the pin head and the edge of the base touch the
table. This kind of pin represents realistic version of "unfair"
coin.
\begin{list}{-}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item[1)] Let's throw pin $n$ times and mark $x_i=1$ when pin stands
on it's base. Let's assume, that throwing conditions stay same
all the time. Most would accept throws as exchangeable.
\item[2)] Same experiment, but odd numbered throws will be made
with full metal pin and even numbererd throws with plastic coated
pin. Most would accept exchangeability for all odd and all even
throws separately, but not necessarily for both series combined.
Thus we have partial exchangeability.
\item[3)] Laboratory experiments $x_1,...,x_n$, are real valued
measurements about the chemical property of some substance. If
all experiments are from the same sample, in the same laboratory
with same procedure, most would accept exchangeability. If
experiments were made, for example, in different laboratories we
could assume partial exchangeability.
\item[4)] $x_1,...,x_n$ are real valued measurements about the
physiological reactions to certain medicine. Different test
persons get different amount of medicine. Test persons are males and
females of different ages. If the attributes of the test persons
were known, most would not accept results as exchangeable.
In a group with certain dose, sex and age, the measurements could
be assumed exchangeable. We could use grouping or if the doses
and attributes are continuous we could use regression, i.e.
assume conditional independence.
\end{list}
\subsection*{Weakly informative priors for hierarchical variance parameters}
Our thinking has advanced since section 5.7 was written.
Section 5.7 (p. 128--) recommends use of half-Cauchy as weakly
informative prior for hierarchical variance parameters. More recent
recommendation is half-normal if you have substantial information on
the high end values, or or half-$t_4$ if you there
might be possibility of surprise. Often we don't have so much prior
information that we would be able to well define the exact tail
shape of the prior, but half-normal produces usually more sensible
prior predictive distributions and is thus better
justified. Half-normal leads also usually to easier inference.
See the Prior
Choice Wiki
(\url{https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations}
for more recent general discussion and model specific recommendations.
\end{document}
%%% Local Variables:
%%% TeX-PDF-mode: t
%%% TeX-master: t
%%% End: