Goal and Scope of Project
The goal of DASPOS is to “scout out” solutions to the most pressing technical problems, and make them available to those constructing preservation systems. In particular, this project will:
- Establish a dialogue with other fields facing preservation and re-use issues with Big Data. Identify areas of commonality and outline where solutions diverge due to specific needs.
- Develop metadata to support the preservation and re-use of HEP data, and its related software and computational algorithms. Design the metadata so as to meet the needs of as many other fields as possible for wide re-use.
- Define a reference architecture for a data preservation system targeted for HEP but coordinated with other fields. Include decision points where policy choices impact the architectural structure.
- Develop a preservation validation test-bed on which a technical implementation of the reference architecture can be developed and constructed.
- Perform a Curation Challenge, where a physics data analysis is conducted based solely on curated and archived data.
The choice of a relatively narrow technical focus limits the scope of the project to the eventual demonstration of a targeted set of technologies, and is commensurate with the size of the team. Nevertheless, the longitudinal nature of this effort will allow experience and evaluation of the issues and solutions associated with a full example of data preservation and access. We must emphasize that a complete preservation solution even just for HEP is extremely large and beyond the scope of this project. A solution for the field at large will require full participation from the HEP experiments, and will likely require a new generation of storage hardware, a refined software infrastructure for global data availability, a new lexicon for data description, and the rationalization of the problems of public access. Such a project will likely take many years and considerable support to build and implement.
We will consider this project a success if we have created a clear intellectual structure and useful prototypes that enable others to carry the effort forward with greater resources. For example, in a previous project involving both CS and HEP personnel, we investigated the problem of deploying simulation code to computation sites where the necessary software was not installed. We designed a global-scale filesystem for distributing the needed data on the fly. We developed a prototype implementation and demonstrated that it could work at large scale with CDF code. After this success, personnel at CERN developed our prototypes into a much more ambitious system known as CVMFS, which is now used globally in production by multiple LHC experiments. We consider this a successful example of exploratory CS/HEP research leading the way for the broader community.