Software Sustainability Models
Selecting data for preservation is only part of the challenge. At the LHC, like experiments in other domains, data must undergo reconstruction, analysis, and processing by a number of software tools that are under continuous development and refinement. A rich ecosystem of software and services is available, ranging from low-level system libraries for performing I/O, numerical libraries for physics reconstruction, and graphical systems for visualizing the results. Like the actual data, software and other artifacts must also be preserved, but the goals and the mechanisms of preservation vary substantially.
Verification, validation and error quantification are necessary, but not sufficient for the practice of the scientific method. In computational science reproducibility of both computation and results are required. Reproducibility is the ability of others to recreate and verify computational results, given appropriate data, software, and computing resources. Unfortunately, exactly recreating a computation that ran a few years or even months ago is a very complex undertaking, both philosophically and technically. It is troublesome even to define what is meant by reproducing an execution, since every layer of software evolves according to its own needs:
- The core analysis algorithms in the software change over time as more knowledge is gained about the underlying physics and the behavior of the detector. Does the user wish to re-run the results using the latest scientific understanding, or using the previous state of knowledge?
- The operating systems and other standard software evolves according to external pressures such as security threats and changing hardware. Does the user wish to run on the latest (most secure) operating system or make use of the old (known reliable) one?
- The hardware architecture will not last forever. We have recently seen a shift from 32 to 64 bit processors, which revealed a number of bugs and incorrect assumptions in user-level software, necessitating many repairs and upgrades. Does the user want to run exactly the software that ran on the 32-bit machine, or apply the newest bug fixes?
- If the use case requires access to large scale resources, can the original algorithms support new middleware services from either persistent grids or cloud infrastructures?
Therefore maintaining usability of data for processing requires sustained preservation and operation of a number of components comprising the required execution environment: (i) User software and dependencies (ii) Shared (experiment-wide) software, and external package dependencies (iii) Execution platforms (operating systems, compilers, and other utilities) (iv) Data artifacts (of various types, in various formats) and (v) Associated services for computation.
Approaches to Preserving and Evolving Processing Environments
There are a number of known tools and methodologies that support reproducibility, such as standardized testbeds, open code for continuous data processing, standards and platforms for data sharing, or provenance and annotation of data and processes. Are these tools sufficient for reproducibility generally? What other tools and technologies could prove to be good solutions? Two HEP data preservation groups, both using virtual infrastructure framework, offer interesting approaches. The first, the BaBar LTDA at SLAC, captures and preserves environments using as much of the virtualized platform as possible, taking care to quarantine these (potentially insecure) environments from the public Internet. In this approach one will confront many issues associated with lifecycle management of virtual machines and their dependent infrastructure (hypervisors and higher level tools for image management). It remains to be seen whether viability of the execution environment can be preserved on timescales of LHC and beyond.
The second, the HERA experiments at DESY, seeks to evolve code and dependencies forward in time with continuous re-build and validation; a forward bootstrapping process is implied in this context. This approach is closer to the other end of the spectrum, where one specifies a computational environment as the composition of a number of source code packages, which upon demand can be compiled and installed to create the desired filesystem structure; the task to be executed is simply a command line accompanied by input files. This solution has the potential to evolve more gracefully with underlying technology changes. There is, however, a combinatorial explosion of software packages and versions, and not all (not many) are guaranteed to be compatible or produce scientifically valid results.
Both approaches require an infrastructure for automating validation, reproduction and re-analysis processes. In this context virtualized infrastructure can be used as a temporal, high-level tool to orchestrate the necessary resources and processes needed to perform the execution.
Rather than attempt to select one technology, we will explore the solution space of program semantics by implementing multiple methods of creating and preserving a software execution environment, whether by deploying virtual machines, constructing software from source on demand, or some combination of the two. These issues will be explored particularly through Workshop 4.
Sustainability Challenges
Regardless of what technical means is chosen, several challenges are common:
- Naming scheme or user interface. A user must have a simple way of specifying the environment that is both common to all users (when desired), customizable by individuals (when necessary), and composable for scalability. For example, a user might specify an environment of cms-software-34.85, which consists of environments scientific-linux-16, cernlib-2015, and cmssim-84, each of which contains additional constituent packages.
- Dependency analysis. Tools will be needed to evaluate a given task or software installation to determine what elements of the environments are actually needed.
- Task insertion. To actually carry off the execution, the desired input files, command line, and any other specific items must be inserted into a known place into the environment, and then invoked in a way that the While not technically complicated, this method must be well defined and consistent across technologies, experiments, and, eventually, disciplines.