intro.tex

\section{Introduction}

Reproducibility is a cornerstone of the scientific method~\cite{borgman2012data}. 
Its ability to advance science underscores its importance---reproducing by verifying and validating a scientific result leads to improved understanding, thus increasing possibilities of reusing or extending the result. 
Ensuring the reproducibility of a scientific result, however, often entails detailed documentation and specification of the involved scientific method. Historically, text and proofs in a publication have achieved this end. 
As computation pervades the sciences and transforms the scientific method, simple text and static images are no longer sufficient. 
In particular, apart from textual (and numeric) descriptions describing the result, a reproducible result must also include several computational artifacts, such as software, data, environment variables, platform dependencies and the state of computation that are involved in the adopted scientific method \cite{Sole}.  

Virtualization has emerged as a promising technology to reproduce computational scientific results.  One such approach is to conduct the entire computation relating to a scientific result within a virtual machine image, and then preserve and share the resulting image. This way "VMI"s become an authoritative, encapsulated, and executable record of the computation, especially computations whose results are destined for publication and/or re-use.  Virtual machine images, like files, can then be shared \cite{Lampoudi}. 
The resulting image, however, may be too large to share or distribute widely. An alternative light-weight form of virtualization is to encapsulate only the application software along with all its necessary dependencies into a self-contained package. The encapsulation is achieved by operating system-level sandboxing techniques that interpose application system calls and copy the necessary dependencies (data, libraries, code, etc.) into a package, making it lighter weight than a VMI~\cite{guo2011cde}. Yet, the package is not longer an executable record of the computation and still requires an accompanying operating system for execution. 

While both approaches provide mechanisms for encapsulating the computations associated with a scientific result, neither form of virtualization provides any guarantee that the included pieces of software will indeed reproduce the associated scientific result. In general, in the absence of reproducible policy guidelines, such guarantees can be difficult to provide. Preserving the encapsulated computations in such a way that they are always reproducible will improve upon the guarantees. A preservation mechanism can increase the ease of image or package installation, alter dependencies implicit to computation as software components evolve or become deprecated, and provide mechanisms for documentation that make computations easy to understand after the fact.

%Since reproducibility includes documentation, virtualization approaches in their current form only make it easy to capture the computations. Preserving the computations so that they are easy to understand, install, or alter implicit dependencies that are part of computation is not effectively addressed, especially as dependencies and software components evolve or become deprecated. 

The two approaches that address the preservation challenge are as follows: one, the introduction of tools that help document dependencies and provide software attribution within VMIs or packages; and two, the use of software delivery mechanisms such as centralized package management, Linux containers, and the more recent Docker framework. We examined the first approach previously in \cite{SoftProv}. 
In this paper we examine the second approach.  We consider in particular the lightweight virtualization because we believe together with more standardized software delivery mechanisms, the two combined can address the reproducibility challenge for a wide variety of scientific researchers. A package created by those lightweight approaches encapsulates all the necessary dependencies of an application, and can be used to repeat the application through different sandbox mechanisms, including Parrot
%In particular, we consider the light-weight virtualization approaches, because, we believe that a combination of light-weight approaches with more standardized software delivery mechanisms can lead to addressing the reproducibility challenge for a wide variety of scientific researchers.  
%A package created by those light-weight approaches encapsulates all the necessary dependencies of an application, and can be used to repeat the application through different sandbox mechanisms, including Parrot
~\cite{thain2005parrot}, CDE~\cite{guo2011cde}, PTU~\cite{PTU}, chroot, and Docker~\cite{boettiger2015introduction}. 

Of course our solution represents only one way to preserve applications. Broadly, two different approaches to preserve applications have been adopted: force cleanliness or measure the mess. The former forces users to specify the execution environment for an application in a well-organized way. The latter causes end users to construct the environment as desired, and the complexity of the environment is measured in terms of its dependencies. Our objective here is to measure the mess as-is and then preserve it over time.

%We do not claim our solution is the only way to preserve applications. Broadly, there are two different approaches to preserve applications: force cleanliness or measure the mess. The first approach forces users to specify the execution environment for an application in a well organized way. In the latter one end users construct the environment as desired, and the complexity of the environment is measured in terms of its dependencies. 
%Our objective here is to measure the mess as-is and then preserve it over time.

To conduct a thorough examination, we consider real-world complex high energy physics (HEP) applications, independently developed by two groups, that must be reproduced so that the entire HEP community can benefit from the analysis. We describe challenges faced in reproducing the applications, and we consider the extent to which reproducibility requirements can be satisfied with lightweight virtualization approaches and software delivery mechanisms. We propose an invariant framework for computational reproducibility that combines lightweight virtualization with software delivery mechanisms for efficiently capturing, invariantly preserving, and practically deploying applications. We measure the performance overhead of lightweight virtualization and software delivery approaches, and show how the preserved packages can be distributed to allow reproduction and verification.
%To conduct a thorough examination, we consider real-world complex high energy physics (HEP) applications, independently developed by two groups, that must be reproduced so that the entire HEP community can benefit from the analysis. We describe challenges in reproducing the applications, and consider the extent to which reproducibility requirements can be satisfied with light-weight virtualization approaches and software delivery mechanisms. We propose an invariant framework for computational reproducibility that combines light-weight virtualization with software delivery mechanisms for efficiently capturing, invariantly preserving and practically deploying applications.
%We measure the performance overhead of  light-weight virtualization and software delivery approaches, and show the preserved packages can be distributed to allow reproduction and verification.