chapter_notes/BDA_notes_ch5.tex

\documentclass[a4paper,11pt,english]{article}

\usepackage{babel}
\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{times}
\usepackage{amsmath}
\usepackage{microtype}
\usepackage{url}
\urlstyle{same}

\usepackage[bookmarks=false]{hyperref}
\hypersetup{%
  bookmarksopen=true,
  bookmarksnumbered=true,
  pdftitle={Bayesian data analysis},
  pdfsubject={Comments},
  pdfauthor={Aki Vehtari},
  pdfkeywords={Bayesian probability theory, Bayesian inference, Bayesian data analysis},
  pdfstartview={FitH -32768},
  colorlinks=true,
  linkcolor=black,
  citecolor=black,
  filecolor=black,
  urlcolor=black
}


% if not draft, smaller printable area makes the paper more readable
\topmargin -4mm
\oddsidemargin 0mm
\textheight 225mm
\textwidth 160mm

%\parskip=\baselineskip

\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\Sd}{Sd}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\Bin}{Bin}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Invchi2}{Inv-\chi^2}
\DeclareMathOperator{\NInvchi2}{N-Inv-\chi^2}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\N}{N}
\DeclareMathOperator{\U}{U}
\DeclareMathOperator{\tr}{tr}
%\DeclareMathOperator{\Pr}{Pr}
\DeclareMathOperator{\trace}{trace}

\pagestyle{empty}

\begin{document}
\thispagestyle{empty}

\section*{Bayesian data analysis -- reading instructions ch 5} 
\smallskip
{\bf Aki Vehtari}
\smallskip

\subsection*{Chapter 5}

Outline of the chapter 5
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item 5.1 Lead-in to hierarchical models
\item 5.2 Exchangeability (a useful theoretical concept)
\item 5.3 Bayesian analysis of hierarchical models
\item 5.4 Hierarchical normal model
\item 5.5 Example: parallel experiments in eight schools (uses hierarchical normal model, details of computation can be skipped)
\item 5.6 Meta-analysis (can be skipped in this course)
\item 5.7 Weakly informative priors for hierarchical variance parameters
\end{list}

The hierarchical models in the chapter are simple to keep computation
simple. More advanced computational tools are presented in Chapters
10-12 (part of the course) and 13 (not part of the course).

Demos
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item demo5\_1: Rats example
\item demo5\_2: SAT example
\end{list}

Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others. 
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item population distribution
\item hyperparameter
\item overfit
\item hierarchical model
\item exchangeability
\item invariant to permutations
\item independent and identically distributed
\item ignorance
\item the mixture of independent identical distributions
\item de Finetti's theorem
\item partially exchangeable
\item conditionally exchangeable
\item conditional independence
\item hyperprior
\item different posterior predictive distributions
\item the conditional probability formula
\end{list}

\subsection*{Computation}

Examples in Sections 5.3 and 5.4 continue computation with
factorization and grid, but there is no need to go deep in to
computational details as we can use MCMC and Stan
instead. Hierarchical model exercises are made with Stan.

\subsection*{Exchangeability vs. independence}

Exchangeability and independence are two separate concepts.
Neither necessarily implies the other. Independent identically
distributed variables/parameters are exchangeable. Exchangeability
is less strict condition than independence. Often we may assume
that observations or unobserved quantities are in fact dependent,
but if we can't get information about these dependencies we may
assume those observations or unobserved quantities as
exchangeable. "Ignorance implies exchangeability."

In case of exchangeable observations, we may sometimes act \emph{as
  if} observations were independent if the additional potential
information gained from the dependencies is very small. This is
related to de Finetti's theorem (p. 105), which applies formally
only when $J\rightarrow\infty$, but in practice difference may be
small if $J$ is finite but relatively large (see examples below).

\begin{list}{$\bullet$}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item If no other information than data $y$ is available to distinguish
  $\theta_j$ from each other and parameters can not be ordered or
  grouped, we may assume symmetry between parameters in their prior
  distribution
\item This symmetry can be represented with exchangeability
\item Parameters $\theta_1,\ldots,\theta_J$ are exchangeable
  in their joint distribution if $p(\theta_1,\ldots,\theta_J)$ is
  invariant to permutation of indexes  $(1,\ldots,J)$ 
\end{list}

Here are some examples you may consider. 

Ex 5.1. Exchangeability with known model parameters: For each of following three examples, answer: (i) Are observations $y_1$ and $y_2$ exchangeable? (ii) Are observations $y_1$ and $y_2$ independent? (iii) Can we act {\em as if} the two observations are independent?
\begin{enumerate}
\item A box has one black ball and one white ball.  We pick a ball $y_1$ at random, put it back, and pick another ball $y_2$ at random.
\item A box has one black ball and one white ball.   We pick a ball
  $y_1$ at random, we do not put it back, then we pick ball
   $y_2$.
\item A box has a million black balls and a million white balls.  We
  pick a ball $y_1$ at random, we do not put it back, then we pick ball
   $y_2$ at random.
\end{enumerate}

Ex 5.2. Exchangeability with unknown model parameters: For each of following
three examples, answer: (i) Are observations $y_1$ and $y_2$
exchangeable? (ii) Are observations $y_1$ and $y_2$ independent? (iii)
Can we act {\em as if} the two observations are independent?
\begin{enumerate}
\item A box has $n$ black and white balls but we do not know how many of each color.
We pick a ball $y_1$ at random, put it back, and pick another ball $y_2$ at random.
\item A box has $n$ black and white balls but we do not know how many of each color.  We pick a ball
  $y_1$ at random, we do not put it back, then we pick ball
   $y_2$ at random.
\item Same as (b) but we know that there are many balls of each color in the box.
\end{enumerate}

Note that for example in opinion polls, balls i.e. humans are not
put back and there is a large but finite number of humans.

\pagebreak
Following complements the divorce example in the book by discussing
the effect of the additional observations
\begin{list}{$\bullet$}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item Example: divorce rate per 1000 population in 8 states of the
  USA in 1981
  \begin{list}{-}{\itemsep=0pt\parsep=0pt\topsep=0pt}
  \item without any other knowledge $y_1,\ldots,y_8$ are exchangeable
  \item it is reasonable to assume a prior independence given
    population density $p(y_i|\theta)$
  \end{list}
\item Divorce rate in first seven are $5.6, 6.6, 7.8, 5.6,
  7.0, 7.2, 5.4$
  \begin{list}{-}{\itemsep=0pt\parsep=0pt\topsep=0pt}
  \item now we have some additional information, but
    still changing the indexing does not affect the joint
    distribution. For example, if we were told that divorce rate
    were not for the first seven but last seven states, it does not
    change the joint distribution, and thus 
    $y_1,\ldots,y_8$ are exchangeable
  \item sensible assumption is a prior independence given
    population density $p(y_i|\theta)$
  \item if "true" $\theta_0$ were known, $y_1,\ldots,y_8$
    were independent given "true" $\theta_0$ 
  \item since $\theta$ is estimated using observations,  $y_i$ are a
    posterior dependent, which is obvious, e.g., from the
    predictive density $p(y_8|y_1,\ldots,y_7,M)$, i.e. e.g.
    if $y_1,\ldots,y_7$ are large then probably $y_8$ is large
  \item if we were told that given rates were for the last seven
    states, then $p(y_1|y_2,\ldots,y_8,M)$ would be exactly same
    as $p(y_8|y_1,\ldots,y_7,M)$ above, i.e.
    changing the indexing does not have effect since $y_1,\ldots,y_8$
    are exchangeable
  \end{list}
\item Additionally we know that $y_8$ is Nevada and rates of other
  states are $5.6, 6.6, 7.8, 5.6, 7.0, 7.2, 5.4$
  \begin{list}{-}{\itemsep=0pt\parsep=0pt\topsep=0pt}
  \item based on what we were told about Nevada, predictive density s $p(y_8|y_1,\ldots,y_7,M)$
    should take into account that probability
    $p(y_8>\max(y_1,\ldots,y_7)|y_1,\ldots,y_7)$ should be large
  \item if we were told that, Nevada is $y_3$ (not $y_8$ as
    above), then new predictive density
    $p(y_8|y_1,\ldots,y_7,M)$ would be different, because
    $y_1,\ldots,y_8$ are not anymore exchangeable
  \end{list}
\end{list}
  
\subsection*{What if observations are not exchangeable}

Often observations are not fully exchangeable, but are partially
or conditionally exchangeable. Two basic cases
\begin{list}{-}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item[1)] If observations can be grouped, we may make hierarchical
  model, were each group has own subpart, but the group properties
  are unknown. If we assume that group properties are exchangeable
  we may use common prior for the group properties.
\item[2)] If $y_i$ has additional information $x_i$, then
  $y_i$ are not exchangeable, but $(y_i,x_i)$ still are
  exchangeable, then we can be make joint model for $(y_i,x_i)$ or
  conditional model $(y_i|x_i)$.
\end{list}

Here are additional examples (Bernardo \& Smith, Bayesian Theory,
1994), which illustrate the above basic cases. Think of old
fashioned thumb pin. This kind of pin can stay flat on it's base or
slanting so that the pin head and the edge of the base touch the
table. This kind of pin represents realistic version of "unfair"
coin.
\begin{list}{-}{\itemsep=0pt\parsep=4pt\topsep=4pt}
\item[1)] Let's throw pin $n$ times and mark $x_i=1$ when pin stands
  on it's base. Let's assume, that throwing conditions stay same
  all the time. Most would accept throws as exchangeable.
\item[2)] Same experiment, but odd numbered throws will be made
  with full metal pin and even numbererd throws with plastic coated
  pin. Most would accept exchangeability for all odd and all even
  throws separately, but not necessarily for both series combined.
  Thus we have partial exchangeability.
\item[3)] Laboratory experiments $x_1,...,x_n$, are real valued
  measurements about the chemical property of some substance. If
  all experiments are from the same sample, in the same laboratory
  with same procedure, most would accept exchangeability. If
  experiments were made, for example, in different laboratories we
  could assume partial exchangeability.
\item[4)] $x_1,...,x_n$ are real valued measurements about the
  physiological reactions to certain medicine. Different test
  persons get different amount of medicine. Test persons are males and
  females of different ages. If the attributes of the test persons
  were known, most would not accept results as exchangeable.
  In a group with certain dose, sex and age, the measurements could
  be assumed exchangeable. We could use grouping or if the doses
  and attributes are continuous we could use regression, i.e.
  assume conditional independence.
\end{list}

\subsection*{Weakly informative priors for hierarchical variance parameters}

  Our thinking has advanced since section 5.7 was written. 
  Section 5.7 (p. 128--) recommends use of half-Cauchy as weakly
  informative prior for hierarchical variance parameters. More recent
  recommendation is half-normal if you have substantial information on
  the high end values, or or half-$t_4$ if you there
  might be possibility of surprise. Often we don't have so much prior
  information that we would be able to well define the exact tail
  shape of the prior, but half-normal produces usually more sensible
  prior predictive distributions and is thus better
  justified. Half-normal leads also usually to easier inference.

See the Prior
Choice Wiki
(\url{https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations}
for more recent general discussion and model specific recommendations.

\end{document}

%%% Local Variables: 
%%% TeX-PDF-mode: t
%%% TeX-master: t
%%% End: