Data Preservation in High Energy Physics
The DPHEP effort, a Study Group for Data Preservation in High Energy Physics under the auspices of the International Committee on Future Accelerator (ICFA), has produced a study outlining the current state of data preservation within HEP, including an extensive overview of other disciplines. They suggest a series of guidelines for HEP data preservation efforts, as well as a framework for global coordination. Their conclusions include a recognition of the scientific potential for data re-use, especially the desirability to preserve full analysis capability. They also emphasize the urgency required to begin and sustain global, coordinated data preservation efforts.
For the purposes of discussing preservation efforts, the DPHEP studies have identified different types of HEP data that span the full set of possibilities, ranging from publications, metadata, associated documentation of all types, software, digital information (the data themselves) and finally expertise and human resources. Digital information can be categorized by four tiers covering the scope from publications to the raw data and the software used to process it. As outlined by DPHEP, the four tiers are:
- Published results, along with additional analysis-related information, leading to more complete documentation of a given analysis
- Processed data available in a simplified format (i.e., particle four vectors) that can be used for outreach and simplified additional analyses
- The full processed experimental data and simulated data and the associated software for accessing and analyzing the data
- The full raw data of the experiment and all of the reconstruction software necessary for processing the data into a form where it can be useful for analysis, as well as the simulation software needed for modeling
In this description, Tiers 3 and 4 are the most complex. Yet, these are also the most important for experiments currently running at the LHC, for example, where data may be repeatedly accessed, reprocessed, and analyzed over the 20+ year lifetime of the experimental efforts. Since the LHC community has developed a worldwide computing and data access grid expressly for the purposes of processing and analyzing the data, it seems natural that preservation efforts, especially for data Tiers 3 and 4, should be built incorporating elements of the existing grid infrastructure, extending them where necessary. In parallel, models of data preservation have emerged which support desired use cases and are useful in characterizing each of the four Tiers. An experiment that is still active though no longer collecting data, like DØ, presents a slightly different set of challenges. While facing similar difficulties to the LHC experiments, DØ must ensure that they can take advantage of modern and evolving resources to access their Tier 3 and 4 data well into the future.
These concepts provide a context for discussing the structural underpinnings of this project. In particular, DASPOS relates as a technical preservation project, providing input to the HEP community that is informed by perspectives from other science communities as well as research in computer science and digital data preservation.