intro.tex

\label{c1:intro}
\section{Automated Algorithm Design}
A vast majority of algorithms are typically developed with parameters that can be configured by users depending on their use cases. Although it is convenient to use a default configuration for every application, such a tactic is often sub-optimal when performance is sensitive to the choice of configurations\cite{balcan2021much,thornton2013auto}. In Bayesian statistics, choosing an appropriate kernel function to model data correlation can strongly influence the outcome of probabilistic inference \cite{Duvenaud13}. In deep learning, the design of a neural network architecture (e.g., the number of layers, types of activation functions and number of neurons in each layer) must be carefully selected to achieve optimal performance on specific scenarios \cite{pham2018,he2020fednas,zoph2016}. For example, convolutional neural network architectures that excel on computer vision tasks \cite{lecun2015lenet,he2016deep} might under-perform in language processing problems, which have been found to favor attention-based modelling \cite{vaswani2017attention}. 

Due to this inconsistent behavior, a more practical approach is therefore to calibrate the design of an algorithm on a \emph{per-task} basis, which seeks to find the most suitable model configuration for each new problem instance. Despite the potential improvements that optimally configuring an algorithm can have on its performance, this calibration step is traditionally driven by domain expert experience and manual heuristics, and hence inefficient and difficult to scale or reproduce. This absence of a principled approach for configuring algorithmic solutions has motivated the study of automated algorithm design (AAD) through systematic optimization frameworks, which was first considered in the seminal work of \citet{rice1976algorithm} and subsequently in various algorithmic domains such as deep learning \cite{pham2018,he2020fednas,Idrissi16}, non-parametric Bayesian methods \cite{Duvenaud13,Malkomes16,Lu18} and discrete algorithms \cite{zheng20miniception,zheng21}. Formally, the general AAD problem is described as follows:

\begin{definition}[Automated Algorithm Design]
Let $\mathcal{T}$ be an arbitrary space of computational tasks and $\mathcal{M}$ be a likely infinite set of algorithms capable of solving any task $\tau \in \mathcal{T}$. Given a performance evaluation function $F: \mathcal{T} \times\mathcal{M} \rightarrow \mathbb{R}$, we say that an algorithm $m \in \mathcal{M}$ outperforms $m' \in \mathcal{M}$ on some task $\tau \in \mathcal{T}$ if and only if $F(\tau, m) > F(\tau, m')$. Then, for any task $\tau \in \mathcal{T}$, the AAD problem can be written as the following optimization task, which seeks to find the optimal algorithm $m_\ast \in \mathcal{M}$ such that $F(\tau, m')$ is maximized:
\begin{eqnarray}
m_\ast &\in& \underset{m \in \mathcal{M}}{\mathrm{argmax}} \ F(\tau, m) \ .
\label{aad-problem}
\end{eqnarray}
\end{definition}
\section{Related Work}
Due to the general nature of the performance measuring function $F$, the AAD task in Eq.~\eqref{aad-problem} is commonly viewed as a black-box optimization (BBO) problem, where $F$ is taken for an oracle that can be queried at will given some input configuration. Most existing black-box optimization methods tend to approach this task via a sequential optimization strategy which alternates between evaluating observations and making informed decision about subsequent probings of the oracle. Typically, this decision is either guided by practical heuristics or a surrogate model that estimates the relationship between configurations and evaluated performances. We summarize a few typical BBO methods below:

\subsubsection{Heuristic search}
One of the most classical approach to optimize this black-box function is grid search, which systematically evaluates all input configurations projected on a lattice to find the best performing candidate. For continuous configurations, this lattice is obtained by discretizing the search space, whereas for categorical configurations, this routine simply means exhaustively trying all combinations of values in each dimension. Alternatively, another widely-used approach for black-box optimization is random search, where inputs are chosen completely at random to evaluate. Both grid search and random search are straight-forward to implement and have been used in several AAD tasks \cite{bergstra2011algorithms,liashchynskyi2019grid,bergstra2012random}. These methods, however, only focus on exploring the search space and have no built-in mechanism to exploit the collected observations, thus tend to scale poorly with the complexity of the input domain. 

In contrast, other heuristic optimization methods such as coordinate ascent~\cite{friedman2010regularization,wright2015coordinate} and simulated annealing~\cite{bertsimas1993simulated} focus on refining the best candidate found so far in a greedy manner. Coordinate ascent achieves this by successively optimizing along a specific dimension while fixing every other dimension of the parameter space, repeating for all dimensions until convergence. Instead of making queries along a single coordinate, the simulated annealing algorithm alternatively evaluates a small neighborhood around the current best estimate and tries to move in the direction that yield the best improvement. Both methods have also been used in many AAD tasks~\cite{bergstra2011algorithms,bellio2016feature,deblasio2020more}, but are typically vulnerable to being trapped in local optima due to a lack of exploration mechanism.

\subsubsection{Evolutionary strategies}
Evolutionary strategies (ES) are optimization techniques inspired by nature, in which a population of configurations is set to evolve over time and improve its averaged performance while doing so. For every generation of this population, most evolutionary algorithms will conduct the following three steps: (1)~estimate the fitness of each individual in the population; (2)~generate new individuals using certain reproduction and/or mutation operators; and (3)~replacing unfit individuals with new individuals. 

Evolutionary algorithms~\cite{fogel1994introduction,banzhaf1998genetic} have been widely applied in automated algorithm design, such as model selection for deep neural networks~\cite{olson2016tpot,lorenzo2017particle,miikkulainen2019evolving,liang2019evolutionary} and support vector machines (SVM)~\cite{lin2008particle}; or finding clinical interventions~\cite{miikkulainen2021prediction}. These techniques fundamentally differ from the above heuristic algorithms by balancing between exploration and exploitation. For example, in genetic optimization, high-fitness individuals can partially pass down their representations to the next generation via a crossover operator (i.e., exploitation), in hope that good performances can be preserved and improved. At the same time, new representations are continually generated via the random mutation operator (i.e., exploration), thus ensuring the optimization is not trapped in local optima.

\subsubsection{Sequential model-based optimization}
Unlike the above methods which are considered model-free, sequential model-based optimization (SMBO) is an optimization paradigm which iteratively uses collected information to estimate a surrogate model for the black-box function $F$. This model is then used to guide the acquisition of subsequent observations. Different SMBO methods have been used to address various AAD tasks such as configuring parameterized tree search algorithms and local SAT solvers~\cite{hutter2011sequential}. We now give a description of the Bayesian optimization (BO) algorithm~\cite{Snoek12}, which is a widely-used variant of SMBO.

The general BO algorithm usually prescribes a Gaussian Process (GP) prior~\cite{Rasmussen06} over the black-box objective function, i.e. $F \sim \mathcal{GP}(\mu, k)$ where $\mu: \mathcal{M} \rightarrow \mathbb{R}$ and $k: \mathcal{M}\times\mathcal{M} \rightarrow \mathbb{R}$ are respectively the prior GP mean and covariance functions. This prior implies that for any finite subset of candidate configurations $\{m_1, m_2 \dots m_T\}$ the corresponding performance vector $\left[F(m_1) F(m_2) \dots F(m_T)\right]$ is normally distributed \emph{a priori} with mean $\left[\mu(m_1), \mu(m_2), \dots ,\mu(m_T)\right]$ and covariance $\left[k(m_i, m_j)\right]_{i,j\in[T]}$. At any iteration $t$, the BO algorithm then uses this prior distribution and the set $\mathcal{D}_t$ of collected observations so far to derive a posterior predictive distribution $p(F(m_\ast) \mid m_\ast, \mathcal{D}_{t})$ for any subsequent candidate $m_\ast$. 

The posterior predictive mean can be used directly to estimate the expected performance of subsequent candidates, thus allowing us to exploit high-performing configurations without actually evaluating $F$. However, since the posterior naturally has high uncertainty in unobserved regions, doing so would discourage thorough exploration of the search space and risk missing out on good solutions. To address this issue, the BO algorithm instead constructs an \emph{acquisition function} that incorporates both the posterior mean and covariance to balance this exploitation and exploration trade-off. For example, the upper confidence bound (UCB) acquisition function proposed by \citet{Srinivas10} is given by:
\begin{eqnarray}
\alpha_{\textsc{UCB}}(m; M_t, \sigma_t) &\triangleq& M_t(m) + \beta \sqrt{\sigma_t(m)} \ ,
\end{eqnarray}
where $M_t(m)$ and $\sigma_t(m)$ respectively denote the posterior mean and variance at iteration $t$ given some candidate $m$. Here, the parameter $\beta$ reflects the trade-off between exploiting candidates with high expected performance and exploring candidates with high uncertainty. Finally, the BO algorithm can be described via the following update rules:
\begin{eqnarray}
m_{t+1} &=& \underset{m \in \mathcal{M}}{\argmax} \ \alpha_{\textsc{UCB}}\left(m; M_t, \sigma_t\right) \ , \nonumber \\
\mathcal{D}_{t+1} &=& \mathcal{D}_{t} \cup \{(m_{t+1}, F(m_{t+1})\} \ , \nonumber \\
M_{t+1}, \sigma_{t+1} &\leftarrow& p\left(F(m_\ast) \mid m_\ast, \mathcal{D}_{t+1}\right) \ .
\end{eqnarray}
Nonetheless, we remark that the vanilla BO algorithm is best suited for low-dimensional and continuous input domains. In practical AAD tasks where the space of configuration is structured and discrete, there is generally no unifying approach to parameterize the mean and covariance function of the GP surrogate. Furthermore, optimizing the acquisition functions will also become non-trivial and require special modelling considerations. 

\subsubsection{Domain-specific AAD approaches}
In many specific AAD problems, $\mathcal{M}$ can be specified as a sub-class of algorithms that is described by discrete data structures such as sub-trees \cite{Duvenaud13}, graphs \cite{he2020fednas, pham2018, bender2018understanding} or permutations \cite{marcais17}. This choice of representation implicitly prescribes a correlation structure among candidate models via the topology of $\mathcal{M}$ and thus allows the general AAD objective to subsequently be cast as optimization/search tasks on these structured domains. 

This thesis will focus on three such prototypical classes of AAD tasks that can be unified through the lens of structured optimization, namely kernel selection (KS), minimizer sketch design (MSD) and neural architecture search (NAS). In particular, the KS problem in Bayesian statistics \cite{Duvenaud13,Malkomes16} can be formulated as a tree search routine to find the composite function that best models the covariance structure of a stochastic process. The NAS problem is approached via setting $\mathcal{M}$ to be a set of computation graphs that share the same vertex and edge space, which respectively denotes all intermediate feature representations and all possible transformations \cite{zoph2016,pham2018}. Lastly, the MSD problem, which seeks to find an optimal sketching scheme for biological sequences, is expressed as finding the optimal permutation of all fixed length substrings induced by the sequence vocabulary \cite{marcais17,zheng20miniception,zheng21}.

Even though this domain restriction strategy has enabled more meaningful formulations of the AAD problem, their resulting discrete/combinatorial objectives are still challenging to solve due to the prohibitive sizes of their search domains. Concretely, without any further restriction, the tree search space in the KS problem naturally has unbounded depth, whereas the graph and permutation domains of the MSD/NAS problems are virtually infinite in most practical settings. 

In practice, this large amount of admissible candidate configurations would render the majority of standard approaches such as integer programming~\cite{balcan2017learning,balcan2018learning} and heuristic search inefficient, and thus necessitates better informed search strategies to navigate these massive search spaces. In addition, another major challenge arises due to the fact that none of the respective performance measuring functions have a closed-form expression, nor are they computationally cheap to evaluate. For example, measuring the fitness of a neural network architecture would typically entail training all layer parameters to convergence before measuring its predictive loss/accuracy with respect to some validation dataset. As such, many strategies that rely on repeatedly probing this evaluation function are inefficient and not applicable in practice. 

To address these challenges, a significant portion of the existing AAD literature is centered around the idea of developing an efficient knowledge transfer mechanism that can generalize algorithmic behaviors across different configuration instances. In particular, given some existing observed performance evaluations, methods under this paradigm often aim to~(a) infer unseen regions of high-performing configurations; or~(b) reduce the cost of evaluating new candidates. Examples of the former include sequential model-based optimization approaches which iteratively refine a surrogate performance function and leverage this to guide the acquisition of new observations~\cite{Kandasamy18, Lu18}. 

On the other hand, the latter direction focuses on making structural assumptions about the search space to facilitate reusing knowledge from past evaluations. A typical example of this approach is the weight sharing scheme in neural architecture search~\cite{pham2018, he2020fednas}, where every candidate architecture is constructed from the same pool of building-block layers. Each layer in this search space is continually optimized throughout all search iterations (i.e., whenever selected by the search algorithm), thus will alleviate the need to re-train each architecture from scratch. For ease of exposition, we will defer the review of these domain-specific methods to their corresponding technical chapter.

\section{Research goal and contributions}
Most approaches in the various AAD domains that we investigate (i.e., KS, MSD and NAS) either still suffer from the scalability issues above; or must rely on additional domain restrictions to sufficiently prune their massive search spaces. For instance, existing KS methods \cite{Duvenaud13,Malkomes16,Lu18} typically require that all candidate kernel functions must have upper-bounded complexities (e.g., functions with short expressions), whereas the permutation domain in the MSD problem is predominantly approximated by the space of sparse hitting sets \cite{zheng20miniception,ekim20pasha}. In practice, while such assumptions serve to ensure the feasibility of their respective optimization objectives, they are typically heuristics that do not factor in the optimalities of the pruned candidates. In addition, we note that most AAD frameworks only focus on exploiting the observed performances of configurations in the same task domain (i.e., fixing $\tau$ in Eq.~\eqref{aad-problem}). In practice, this is not always the sole source of information to guide exploration of the candidate space. For instance, in applications that require tuning for many related tasks, it would be more intuitive to leverage the combined knowledge from all task domains, rather than to independently approach these problems with their respective observations. Motivated by these shortcomings, this thesis therefore seeks to address the following research question: \textbf{Can we design information-efficient frameworks that can better exploit both intra-task and inter-task information, thus improving the scalability and performance for automated algorithm design?}

Specifically, this thesis proposes two new directions for information-efficient AAD frameworks, which are grounded in the settings of three practical AAD problems, namely kernel selection (KS), minimizer sketch design (MSD), and neural architecture search (NAS). The first direction seeks to improve the exploitation of intra-task information through learning complementary components that can reason about the desirability of candidate configurations without fully committing expensive performance evaluations. In the context of the KS problem, this manifests as an early-stopping policy for a recurrent kernel generator. This policy outputs the optimal pruning point in an infinitely long generative trajectory of kernel expressions (Chapter~\ref{c3:kernel}). For the MSD problem, this component takes form as a template model that learns to generate desirable substring ranking patterns which can subsequently be refined into high-performing candidate permutations (Chapter~\ref{c4:minimizer}). In both settings, we show that these are more effective formulations of the AAD objective that yield better performing configurations than other state-of-the-art methods.

Orthogonal to the above approaches, the second direction further investigates a \emph{multi-task} knowledge transfer paradigm that leverages information sharing across \emph{different tasks} to improve the performance and efficiency of concurrent AAD instances. Specifically, we study an extension of the neural architecture search problem in a federated learning setting \cite{McMahan17}, where an ensemble of predictive tasks collaborate to find their respective optimal architectures (Chapter~\ref{c5:nas}). Finally, we propose to investigate a meta-learning approach \cite{Finn17} that can effectively leverage existing model configuration experiences to address unseen and heterogeneous AAD instances, thus improving the scalability of AAD in multi-task learning settings (Chapter~\ref{c6:future}). The remainder of this thesis will be organized as follows:
\begin{itemize}
    \item Chapter~\ref{c3:kernel} investigates the kernel selection (KS) problem, which restricts the candidate domain to an infinite-depth tree induced by the kernel composition language \cite{Duvenaud13}. Our contribution is a novel Bayesian optimization framework that optimizes a recurrent kernel generator that outputs infinitely long sequences of kernel expressions. We then jointly learn a termination policy model which decides the optimal stopping point along each infinite generative trajectory. We show that this strategy results in better performing kernel functions on a wide range of predictive tasks.
    
    \item Chapter~\ref{c4:minimizer} studies the minimizer sketch design (MSD) problem, which aims to find a permutation of substrings that induces the optimal sketch of a sequence via the minimizer algorithm~\cite{schleimer03,marcais17}. Our contribution is a novel optimization framework, called~\textsc{DeepMinimizer}, which formulates as a pair of substring ranking networks which guarantee different properties of a good solution. Through negotiating for a consensus outcome, these networks result in a continuous relaxation of the original discrete permutation learning objective. We show that our method significantly improves the state-of-the-art performance and scalability of MSD. We further extend our algorithm to unify and provide sketch designs for other sequence sketching methods, such as the syncmer approach~\cite{edgar2021syncmers,shaw2021theory}.
    
    \item Chapter~\ref{c5:nas} investigates the neural architecture search (NAS) problem in a federated learning setting~\cite{McMahan17}, which seeks to concurrently optimize the solution architectures for multiple related tasks without exposing their private data. Our contribution is a novel personalized learning objective which distills the aggregated problem solving experiences into hierarchical architecture components which can be fine-tuned to solve each specific task. We show that this knowledge sharing scheme is capable of maintaining a high degree of personalization in the solution ensemble and thus significantly improves the efficiency of multi-task neural architecture search.

    \item Chapter~\ref{c6:future} discusses a future work which will extend the \emph{multi-task} AAD paradigm to deal with unseen and heterogeneous tasks. Specifically, drawing inspiration from the meta-learning paradigm for multi-task scenarios, we aim to learn a common meta-configuration that can be adapted quickly to solve any specific task. Unlike the standard meta-learning setting, we consider the scenario where the AAD task distribution in question is assumed to be multi-modal, hence there might not be a single configuration that suits every task. To address this issue, we propose extending the meta-learning framework with population-based optimization to train an evolving ensemble of meta-configurations, which reflects a more fine-grained hierarchical clustering of the diverse task space. 
\end{itemize}