-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathobservation.tex
34 lines (26 loc) · 4.56 KB
/
observation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
\section{Challenges in Reproducing HEP Applications}
The application specifications of \emph{TauRoast} and \emph{Athena} were provided to us from the CMS and ATLAS Collaborations respectively in the form of
email describing in prose how to obtain the source, build the program, and run it correctly on a specific platform type available at our home institutions. There were no explicit guarantees that it would run on alternative platforms. This minimal level of documentation about software is routine in the scientific world.
Below we describe the challenges faced when capturing application details in reproducible form and then preserving them for subsequent reuse.
\begin{itemize}
\item {\bf Identifying all dependencies.} Due to the distributed, collaborative nature of HEP software development, these applications depend on a large number of external and local software components.
External dependencies are often explicitly stated, such as when the application makes connections to Github resources or CVS servers for downloading source files.
When the application has initiated execution then implicit network connections may be present that require identification of dependencies on all machines where execution takes place.
Implicit local dependencies can arise as a result of mounted filesystems. In \emph{TauRoast}, the application data and code is distributed on five networked filesystems, and in \emph{Athena} on two networked filesystems.
Since these filesystems appear local to the application machine, it is important to check and capture mounted filesystems and their respective mount points.
\item {\bf Configuration Complexity.} In order to correctly reproduce an application requires that run-time configurations and consistency checks on the available software are effectively captured and preserved.
For CMS, the {\tt scram} software management tool is used to locate
the appropriate version of software, set environment variables such as the PATH, run any
tool-specific configurations, and do the same for all software on which it depends. A reproducible framework must capture the work of such software management tools so that the framework can conduct similar checks on a new machine.
\item {\bf High Selectivity.}
Although the total size of the resources accessed by HEP programs is very large, the size of the data and software actually used are typically much smaller. The script will often name an entire repository or data source, but the program needs only a handful of items from that source. For example, the data may be stored on an HDFS filesystem with 11.6TB of data, but the program may consume only 18GB. Thus there is a size tradeoff in terms of capturing dependencies mentioned in the program and dependencies actually used in the program. A reproducible framework must include robust rules about not including superfluous dependencies, but including unused dependencies that may potentially see much use during program execution.
\item {\bf Rapid Changes in Dependencies.}
Over the course of three months between collecting the initial email, analyzing the programs, and writing this paper, the computing environment saw continuous change. The CMSSW software distribution released a new version, the target execution environment was upgraded to a new operating system, and the application switched from CVS to Git for obtaining the software. For Athena, the computing environment has the potential for daily change since upgrades to the software framework occur on a nightly basis. While the users of this software seem accustomed to constant change, which is appropriate during algorithm development, any new technique for preservation that may rely on an external service (every one that may appear highly stable) will require caution to choose the actual releases used during the time of publication.
\item {\bf Dependencies for Reproducible Execution.} Capturing the necessary and sufficient dependencies that are part of an application is sufficient for repeatability, but possibly for not reproducibility.
Repeatability implies that if a result depends on running program X, we must be able to run exactly X again. In reproducibility the goal is rarely limited to running
\emph{precisely} what a predecessor did. Often, the objective is to
change a parameter or a data input in order to see how the result is affected. To that end, the preservation system must capture enough of the surrounding
material in order to permit modifications to succeed.
Further, a better understanding of how end users will consume preserved software will help to shape how
software should be preserved.
\end{itemize}