pcorr-methods.tex

\section{Materials and Methods}

\subsection*{Ethics statement}
All procedures were conducted in accordance with the ethical guidelines of the National Institutes of Health and were approved by the Baylor College of Medicine IACUC.

\subsection*{Surgery and two-photon imaging}
The surgical procedures and data acquisition were performed as described in \citep{Cotton:2013}: C57BL/6J mice (aged p40--60) were used. For surgery, animals were initially anesthetized with isoflurane (3\%). During the experiments, animals were sedated with a mixture of fentanyl (0.05 mg/kg), midazolam (5 mg/kg), and medetomidine (0.5 mg/kg), with boosts of half the initial dose every 3 hours.  A craniotomy was performed over the right primary visual cortex.  Membrane-permeant calcium indicator Oregon Green 488 BAPTA-1 AM (OGB-1, Invitrogen) was loaded by bolus injection.  The craniotomy was sealed using a glass coverslip secured with dental cement.

Calcium imaging began 1 hour after dye injection.  All imaging was performed using 3D-RAMP two-photon microscopy \citep{Cotton:2013}. First, a 3D stack was acquired and cells were manually segmented. Then calcium signal were collected by sampling in the center of each cell at rates of 100 Hz or higher, depending on the number of cells.

\subsection*{Visual stimulus}
The visual stimulus consisted of full-field drifting gratings with 90\% contrast, luminance of 10 cd/m$^2$, spatial frequency of 0.08 cycles/degree, and temporal frequency of 2 cycles/s. Two types of stimuli were presented for each imaging site: First, directional tuning was mapped using a pseudo-random sequence of drifting gratings at sixteen directions of motion, 500 ms per direction, without blanks, with 12--30 trials for each direction of motion.  Second, to measure correlations, the stimulus was modified to include only two directions of motion (in 9 datasets) or five directions (in 22 datasets) and the gratings were presented for 1 second and were separated by 1-second blanks, with 100--300 trials for each direction of motion.

\subsection*{Data processing}
All data were processed in MATLAB using the DataJoint data processing chain toolbox (http://datajoint.github.com).

The measured fluorescent traces were deconvolved to reconstruct the firing rates for each neuron: First, the first principal component was subtracted from the raw traces in order to reduce common mode noise related to small cardiovascular movements \citep{Cotton:2013}. The resulting traces were high-pass filtered above 0.1 Hz and downsampled to 20 Hz (Fig.~\ref{fig:3} C). Then, the firing rates were estimated using by nonnegative deconvolution \citep{Vogelstein:2010}.

Orientation tuning was computed by fitting the mean firing rates for each direction of motion $\phi$ using two-peaked von Mises tuning functions $f(\phi)=a + b\exp\left[\frac 1 w(\cos(\phi-\theta)-1) \right] + c\exp\left[\frac 1 w(\cos(\phi-\theta+\pi)-1) \right]$ where $b\ge c$ are the amplitudes of the two respective peaks, $w$ is the tuning width, and  $\theta$ is the preferred direction. The significance of the fit was determined by the permutation test: the labels of the direction were randomly permuted 10,000 times; the $p$-values of the fits were computed as the fraction of permutations that yielded $R^2$ equal to or higher than that of the original data.  Cells were considered tuned with $p<0.05$.

For covariance estimation, the analysis was limited to the period with 2 or 5 stimulus conditions and lasted between 14 and 27 minutes (mean 22 minutes).  Cells that did not have substantial spiking activity (those whose variance was less than 1\% of the median across the site) or whose activity was unstable (those whose variance in the least active quarter of the recording did not exceed 1\% of the variance in the most active quarter) were excluded from the analysis.

\subsection*{Cross-validation}
To compare the performance of the estimators, we used conventional 10-fold cross-validation: Trials were randomly divided into 10 subsets with approximately equal numbers of trials of each condition in each subset. Each subset was then used as the testing sample with the rest of the data used as the training sample for estimating the covariance matrix. The average validation loss over the 10 folds was reported.

Since each of the regularized estimators had one or two hyperparameters, we used \emph{nested cross-validation}:  The outer loop evaluated the performance of the estimators with the hyperparameter values optimized by cross-validation within the inner loop.  Hyperparameters were optimized by a two-phase search algorithm: random search to find a good starting point for the subsequent pattern search to find the global minimum.  The inner cross-validation loop subdivided the training dataset from the outer loop to perform 10-fold cross-validation in order to evaluate each choice of the hyperparameter values.  Thus the size of the training dataset within the inner loop comprised 81\% of the entire recording. Fig.~S1 illustrates the dependence of the validation loss on the hyperparameters of the $C_{\sf sparse+latent}$ estimator for the example site shown in Figures \ref{fig:3} and \ref{fig:5} and the optimal value found by the pattern search algorithm.

When the validation loss was not required, only the inner loop of cross-validation was used on the entire dataset.  This approach was used to compute the covariance matrix estimates and their true loss in the simulation study (Fig.~\ref{fig:1} Rows 4 and 5) and to analyze the partial correlation structure of the $C_{\sf sparse+latent}$ estimator (Fig.~\ref{fig:5}--\ref{fig:7}).

\subsection*{Covariance estimation}
Within the inner loop of cross-validation, regularized covariance matrix estimation required only the sample covariance matrix $C_{\sf sample}$ of the training dataset and the hyperparameter values provided by the outer loop.

Estimator $C_{\sf diag}$ (Eq.~\ref{eq:c-diag})  used two hyperparameters: the covariance shrinkage intensity $\lambda \in [0,1]$ and variance shrinkage intensity $\alpha \in [0,1]$.  The variances (the diagonal of $C_{\sf sample}$) were shrunk linearly toward their mean value $\frac 1 p \Tr(C_{\sf sample})$:
\begin{equation}
D = (1-\alpha)\mbox{diag}(C_{\sf sample}) + \alpha \frac 1 p \Tr(C_{\sf sample}) I
\end{equation}
The $C_{\sf diag}$ estimate was then obtained by shrinking $C_{\sf sample}$ toward $D$ according to Eq.~\ref{eq:c-diag}.

In estimator $C_{\sf factor}$ (Eq.~\ref{eq:c-factor}), the low-rank matrix $L$ and the diagonal matrix $D$ are found by solving the minimization problem
\begin{equation}
(L,D) = \argmin\limits_{\hat L,\hat D} \loss{\hat L + \hat D,C_{\sf sample}},
\end{equation}
by an expectation-maximization (EM) algorithm with specified rank of $L$. In addition, the diagonal matrix of individual variances is shrunk toward its mean value according to Eq.~\ref{eq:c-factor}.

In estimator $C_{\sf sparse}$ (Eq.~\ref{eq:c-sparse}), the sparse precision matrix $S$ is found by minimizing the $L_1$-penalized loss with regularization parameter $\lambda$:
\begin{equation}
S = \argmin\limits_{\hat S \succ 0} \loss{{\hat S}^{-1},C_{\sf sample}} + \lambda \|\hat S \|_1
\end{equation}
where $\hat S\succ 0$ denotes the constraint that $\hat S$ be a positive-definite matrix and $\|\hat S\|_1$ is the element-wise $L_1$ norm of the matrix $\hat S$. This problem formulation is known as \emph{graphical lasso} \citep{Meinshausen:2006, Friedman:2008}. To solve this minimization problem, we adapted the alternative-direction method of multipliers (ADMM) \citep{Ma:2013}.
Unlike $C_{\sf diag}$ and $C_{\sf factor}$, this estimator does not include linear shrinkage: the selection of the sparsity level provides sufficient flexibility to fine-tune the regularization level.

Estimator $C_{\sf sparse+latent}$ (Eq.~\ref{eq:c-sl}) estimates a larger sparse precision matrix $S^\ast$ of the joint distribution of the $p$ observed neurons and $d$ latent units.
\begin{equation}
S^\ast=
\begin{pmatrix}
S & S_{12} \\
S_{12}^\T & S_{22}
\end{pmatrix},
\end{equation}
where the $p\times p$ partition $S$ corresponds to the visible units.
Then the covariance matrix of the observed population is
\begin{equation}
C_{\sf sparse+latent} = \left(S-S_{12}S_{22}^{-1}S_{12}^\T\right)^{-1}
\end{equation}
The rank of the $p\times p$  matrix $L=S_{12}S_{22}^{-1}S_{12}^\T$ matches the number of the latent units in the joint distribution. Rather than finding $S_{12}$ and $S_{22}$ separately, $L$ can be estimated as a low-rank positive semidefinite matrix. To simultaneously optimize the sparse component $S$ and the low-rank component $L$, we adapted the loss function with $L_1$ penalty on $S$ combined with a penalty on the trace of $L$ \citep{Chandrasekaran:2010,Ma:2013}:
\begin{equation}\label{eq:ma}
(S,L) = \argmin\limits_{\hat S,\hat L} \loss{(\hat S-\hat L)^{-1}, C_{\sf sample}} + \alpha\|\hat S\|_1 + \beta\Tr(\hat L)
\end{equation}
The trace of a symmetric semidefinite matrix equals the sum of the absolute values of its eigenvalues, \emph{i.e.} its \emph{nuclear norm}; penalty on $\Tr(L)$ favors solutions with few non-zero eigenvalues or, equivalently, low-rank solutions while keeping the convexity of the overall optimization problem \citep{Fazel:2002,Recht:2010}. This allows using convex optimization algorithm such as ADMM to be applied with great computational efficiency \citep{Ma:2013}.

The partial correlation matrix (Eq.~\ref{eq:partial}) computed from $C_{\sf sparse+latent}$ includes interactions between the visible and latent units and was used in Fig.~\ref{fig:5} C and D and Fig.~\ref{fig:6} C, and Fig.~\ref{fig:7} A--C).  The partial correlation matrix computed from $S$ alone expresses strengths of pairwise interactions
\begin{equation}
P_{\sf sparse} = -(\mbox{diag}(S))^{-\frac 1 2} S  (\mbox{diag}(S))^{-\frac 1 2}
\end{equation}
and were used in Fig.~\ref{fig:5} F, G, H.

The MATLAB code for these computations is available online at http://github.com/atlab/cov-est.
\subsection*{Cross-validation with conditioned variances}
Special attention was given to estimating the variances.  
All evaluations and optimization in this study were defined with respect to the covariance matrices.  
However, neuroscientists often estimate a common correlation matrix across multiple stimulus conditions when the variances of responses are conditioned on the stimulus \citep{Vogels:1989, Ponce:2013}. In this study, we too conditioned the variances on the stimulus but estimated a single correlation matrix across all conditions.
Here we describe the computation of the validation loss (Eq.~\ref{eq:vloss}) when the variances were allowed to vary with the stimulus condition.

Let $T_c$ and $T_c^\prime$ denote the time bin indices for the training and testing samples, respectively, limited to condition $c$. Here, the prime symbol marks quantities estimated from the testing sample.

Similar to Eq.~\ref{eq:sample}, the training and testing sample covariance matrices for condition $c$ are
\begin{equation}
    C_{c,{\sf sample}}
    = \frac 1{n_c} \sum\limits_{t\in T_c}\left(x(t)-\bar x_c\right)\left(x(t)-\bar x_c\right)^\T
\end{equation}
and
\begin{equation}
    C_{c,{\sf sample}}^\prime
    = \frac 1{n_c^\prime} \sum\limits_{t\in T_c^\prime}\left(x(t)-\bar x_c\right)\left(x(t)-\bar x_c\right)^\T
\end{equation}
Here $n_c$ and $n_c^\prime$ denote the sizes of $T_c$ and $T_c^\prime$, respectively.

Note that $\bar x_c= \frac 1 {n_c} \sum\limits_{t \in T_c}x(t)$ is estimated from the training sample but used in both estimates, making $C_{c,\sf sample}^\prime$ an unbiased estimate of the true covariance matrix, $\Sigma$. As such, $C_{c,\sf sample}^\prime$ can be used for validation.

The common correlation matrix $R_{\sf sample}$ is estimated by averaging the condition-specific correlations:
\begin{equation}
    R_{\sf sample}  
    = \frac 1 n \sum\limits_c n_c \left(V_{c,\sf sample}^{-\frac 1 2}C_{c,\sf sample}V_{c,\sf sample}^{-\frac 1 2}\right)
    = \frac 1 n \sum\limits_c \sum\limits_{t \in T_c} z(t)z(t)^\T,
\end{equation}
where $n=\sum\limits_c n_c$ and $V_{c,\sf sample} = \mbox{diag}(C_{c,\sf sample})$ is the diagonal matrix containing the sample variances. Then $R_{\sf sample}$ is simply the covariance matrix of the $z$-score signal $z(t) = V_{c,\sf sample}^{-\frac 1 2} \left(x(t) - \bar x_c\right)$ of the training sample.

For consistency with prior work, we applied regularization to covariance matrices rather than to correlation matrices. The common covariance matrix was estimated by scaling $R_{\sf sample}$ by the average variances across conditions $V_{\sf sample} = \frac 1 n \sum\limits_c n_c V_{c,\sf sample}$:
\begin{equation}
    C_{\sf sample} = V_{\sf sample}^{\frac 1 2} R_{\sf sample} V_{\sf sample}^{\frac 1 2}
\end{equation}
Note that $C_{\sf sample}$ differs from the sample covariance matrix computed without conditioning the variances on $c$ and this computation helps avoid any biases that would be introduced by ignoring changes in variance.

The covariance matrix estimators $C_{\sf diag}$, $C_{\sf factor}$, $C_{\sf sparse}$ or $C_{\sf sparse+latent}$ convert $C_{\sf sample}$ into its regularized counterpart denoted here as $C_{\sf reg}$.

To evaluate the estimators, we regularized the conditioned variances by linear shrinkage toward their mean value across all conditions. This was done by scaling $C_{\sf reg}$ by the conditioned variance adjustment matrix $Q_c = \delta I + (1-\delta)V_{\sf sample}^{-1} V_{c,\sf sample}$ to produce the conditioned regularized covariance matrix estimate:
\begin{equation}
    C_{c,\sf reg} = Q_c^{\frac 1 2} C_{\sf reg} Q_c^{\frac 1 2}
\end{equation}

The variance regularization parameter $\delta \in [0,1]$ was optimized in the inner loop of cross-validation along with the other hyperparameters.

The overall validation loss is obtained by averaging the validation losses across all conditions:
\begin{equation}\label{eq:full-loss}
    \frac 1 {\sum\limits_c n_c^\prime}\sum\limits_c n_c^\prime \loss{C_{c,\sf reg}, C_{c,\sf sample}^\prime} 
\end{equation}

With negative normal log-likelihood as the validation loss  (Eq.~\ref{eq:vloss}) and the unbiased validation covariance matrix $C_{c,\sf sample}$, the loss function in Eq.~\ref{eq:full-loss} is an unbiased estimate of the true loss. Hence, it was used for evaluations reported in Fig.~\ref{fig:4}.

\subsection*{Simulation}
For simulation, ground truth covariance matrices were produced by taking 150 independent samples from an artificial population of 50 independent, identically normally distributed units. The covariance matrices were then subjected to the respective regularizations to produce the ground truth matrices for the simulation studies (Fig.~\ref{fig:1} Row 2). Samples were then drawn from multivariate normal distributions with the respective true covariance matrices to be estimated by each of the estimators.