WORKSHOP 7
Abstracts and Speaker Bios
Rob Gardner, DASPOS Co-PI
Senior Fellow, Computation Institute
Senior Scientist, Enrico Fermi Institute
University of Chicago
Co-PI Gardner has been supported by NSF as PI for the ATLAS Midwest Tier 2 Center at the University of Chicago (NSF Cooperative Agreement, “U.S. ATLAS Operations: Empowering University Physicists to make Discoveries at the Energy Frontier”, PHY-11- 19200 and PHY-06-12811).
This email address is being protected from spambots. You need JavaScript enabled to view it.
Jarek Nabrzyski, DASPOS Co-PI
Director
Center for Research Computing, University of Notre Dame
Co-PI Nabrzyski is a co-PI on several NSF and NIH related awards. The EarthCube Research Coordination Network for High-Performance Distributed Computing in the Polar Sciences supports advances in computing tools and techniques that will enable the Polar Sciences Community to address significant challenges, both in the short and long-term. Nabrzyski is a co-PI (with M. Hildreth as the PI) on the NSF award to organize a series of workshops to gauge community requirements for public access to data from NSF-Funded Research. Nabrzyski is also the PI on a BlueWaters allocation award “Strategies for Topology and Application Aware Job Resource Management in 3D Torus-based Blue Waters System” and co-PI on an NIH award: VectorBase: A Bioinformatics Resource Center for Invertebrate Vectors of Human Pathogens. In DASPOS Nabrzyski and his team develop container-based architectures for preserving data and software to the point where the repetition of analysis using only the archived data, software, and analysis description is possible.
Douglas Thain, DASPOS Co-PI
Associate Professor,Computer Science and Engineering
Director of Graduate Studies,University of Notre Dame
Co-PI Thain has been supported by six NSF research grants and one research infrastructure grant since 2004. The most related to this work is the CAREER grant (“CAREER: Data Intensive Grid Computing on Active Storage Clusters”, 2007-2012). A significant output of this research project was the production of software prototypes that have been taken up by various user communities. For example, the Parrot interposition agent has seen use with the CDF, CMS, and MINOS experiments, the Chirp filesystem has seen use the ATLAS, and the Makeflow workflow system, used in the bioinformatics community.
Combining Containers and Workflow Systems for Reproducible Execution
Reproducibility of results is central to the scientific method, yet reproducibility is astonishingly hard in the computational sciences. A significant part of the problem is implicit dependencies -- complex programs depend on all sorts of files, operating system services, and network services that are not easily perceived by the end user. In this talk, I'll show two examples of how container technologies can be worked into complex scientific workflows. One example is Umbrella, a tool for generating portable environment specifications across a variety of technologies. The other is PRUNE, a Preserving Run Environment that tracks all the dependencies of a complex workflow as it is created. I'll show how these tools have been applied to the creation and preservation of complex applications in high energy physics, malaria research, and geneaology studies.
Vincent Batts
Principal Software Engineer
Red Hat
A mindful polyglot, Vincent Batts has spent the last 15 years participating in the Linux and open source community. Presently involved on the Open Containers Initiative as a maintainer and on the technical board. Still a current member of Slackware Core Team and has been a maintainer on the Docker project as well as the Go programming language for Fedora and Red Hat. He is currently working on all things container architecture in Red Hat's Office of Technology.
What is it we want in containers anyway?
Containers are bringing new standards, a plethora of tooling dynamics, and loads of interest from big to small.The common theme is: everyone sees something about containers that solves a use-case of theirs.
- sharing repeatable steps
- lowering the barrier-to-entry
- immutable environment
- continuous integration and delivery
- build, export, import
- isolation and performance
In this talk Vincent Batts will cover the basics of where containers are today, what standards could enable for container adoption, assumptions about what "containers" mean, and gaps still to improve on.
Euan Cochrane
Digital Preservation Manager
Yale University Library
Euan brings a wealth of practical knowledge and experience on digital preservation developed from a diverse range of positions he has held including working on the establishment of the data archive for official statistics at Statistics New Zealand, working in the Digital Continuity team at Archives New Zealand, consulting for Deloitte in Australia on Information Management.
In his current role as Digital Preservation Manager at Yale University Library, he established digital preservation services through Preservica and Emulation as a Service. He has a particular interest in software preservation and the use of emulation to maintain access to born digital content and is currently involved in grant funded work in both of these areas. Euan is interested in discussing the long term preservation of containers and the content they contain.
Emulation Remote Access Technology
Container technology provides a very promising avenue for ensuring all dependencies of computation-dependent science are able to be packaged together for preservation and sharing once a study has been completed. They help to ensure that the software, custom code, and configuration of the workflows used in undertaking computation-dependent science can be easily shared and in doing so enable the reproduction of the workflows they contain. There are numerous implementations of containers and container-like solutions and while the benefits of their use in science are readily apparent there are concerns to be raised about them regarding their use for long-term preservation of the content they are intended to package. All existing containerization technologies rely on some degree on an underlying Operating System (OS). Over the long term, as that OS changes the containers become at risk of becoming obsolete and unusable. In order to mitigate this risk of loss, efforts should be undertaken to preserve access to the underlying operating systems of containers using emulation solutions. Emulation solutions are being implemented for the wider digital preservation community and provide a long term mitigation to the risk of container obsolescence. An ecosystem of ‘canonical’ preserved operating systems, accessible via emulation, should be established and containers should be able to be tested against the preserved OSs as part of the review when submitting containers as part of the publication process. Emulation remote access technologies such as the bwFLA Emulation as a Service framework from the University of Freiburg and would enable this proposed container preservation approach to be rapidly implemented and scaled to meet the needs of the scientific community.
Rafael Ferreira da Silva
Computer Scientist
Collaborative Computing Group, University of Southern California, Information Sciences Institute
Rafael Ferreira da Silva is a Computer Scientist in the Collaborative Computing Group at the USC Information Sciences Institute. His research interest include scientific workflows, cloud and grid computing, data science, distributed computing, reproducibility, and machine learning. He received his PhD in Computer Science from INSA-Lyon, France, in 2013. In 2010, he received his Master's degree in Computer Science from Universidade Federal de Campina Grande, Brazil, and his BS degree in Computer Science from Universidade Federal da Paraiba, in 2007. See http://www.rafaelsilva.com for further information.
Reproducibility of Execution Environments in Scientific Workflows using Semantics and Containers
Rafael Ferreira da Silva and Ewa Deelman
Reproducibility of results of scientific experiments is a cornerstone of the scientific method. Therefore, the scientific community has been encouraging researchers to publish their contributions in a verifiable and understandable way. In computational science, or in-silico science, reproducibility often requires that researchers make code and data publicly available so that the data can be analyzed in a similar manner as in the original work described in the publication. Code must be made available, and data must be accessible in a readable format. Scientific workflows have emerged as a flexible representation to declaratively express complex such applications with data and control dependencies. Workflows have become mainstream for conducting largescale scientific research in domains such as astronomy, physics, climate science, earthquake science, biology, and others. Since workflows formally describe the sequence of computational and data management tasks, it is easy to trace the origin of the data produced. Many workflow systems capture provenance at runtime, what provides the lineage of data products and as such underpins scientific understanding and data reuse by providing the basis on which trust and understanding are built. A scientist would be able to look at the workflow and provenance data, retrace the steps, and arrive at the same data products. However, this information is not sufficient for achieving full reproducibility.
In the past years, we have addressed the reproducibility of the execution environment for a scientific workflow by using a logical-oriented approach to conserve computational environments, where the capabilities of the resources (virtual machines or containers) are described. From this description, any scientist, interested in reproducing an experiment, will be able to reconstruct the former infrastructure (or an equivalent one) in any cloud computing infrastructure (either private or public). One may argue that it would be easier to keep and share VM or container images with the community research through a common repository, however the high storage demand of to keep these images remains a challenging problem.
Our approach uses semantic-annotated workflow descriptions to generate lightweight scripts for an experiment management API that can reconstruct the required infrastructure. We propose to describe the resources involved in the execution of the experiment, using a set of semantic vocabularies, and use those descriptions to define the infrastructure specification. This specification can then be used to derive instructions that can be executed to obtain a new equivalent infrastructure. We defined and implemented four domain ontologies. From these models, we defined a process for documenting workflow applications, the workflow management system, and their dependencies.
We have used a practical experimentation process to describe workflow applications (astronomy and genomic processing applications) and their environments using a set of semantic models. These descriptions are then used by an experiment management tool to reproduce a workflow execution in different cloud platforms using virtual machines and containers. Experimental results show that our approach can reproduce an equivalent execution environment of a predefined VM or container images on academic (NSF Chameleon), public (Amazon and Google), and local (with Vagrant) cloud platforms.
References
1. I. Santana-Perez, R. Ferreira da Silva, M. Rynge, E. Deelman, M. S. Pérez-Hernández, and O. Corcho, “Reproducibility of Execution Environments in Computational Science Using Semantics and Clouds,” Future Generation Computer Systems, in press, 2016.
2. I. Santana-Perez, R. Ferreira da Silva, M. Rynge, E. Deelman, M. S. Pérez-Hernández, and O. Corcho, “A Semantic-Based Approach to Attain Reproducibility of Computational Environments in Scientific Workflows: A Case Study,” in Euro-Par 2014: Parallel Processing Workshops, vol. 8805, pp. 452-463, 2014.
Evan Hazlett
Senior Software Engineer
Docker
Evan Hazlett is a Senior Software Engineer at Docker and is currently the tech lead on the Docker Universal Control Plane. Evan has been working with Docker since the 0.5 release and also maintains several open source Docker solutions ranging from secret management to container load balancing.
Solving Issues of Consistency
One common problem in software delivery is environment consistency. How do you manage software releases across developer machines, test infrastructure and production? How do I manage my application dependencies across platforms? How do you streamline QA and software delivery while maintaining high efficiency in your QA team? This talk will demonstrate how Docker can help solve these problems giving a consistent development and deployment strategy.
Lukas Heinrich
PhD Student
New York University
Lukas Heinrich is a PhD student at NYU and part of the ATLAS collaboration working on analysis reproducibility with focus on re-execution in new physics contexts (RECAST)
LHC data analyses consist of workflows that utilize a diverse set of software tools to produce physics results. The different set of tools range from large software frameworks like Gaudi/Athena to single-purpose scripts written by the analysis teams. Currently analyses are de-facto not reproducible without significant input from the original analysis team, which severely limits re-executing them in the context of a new physics model.
The rapid progress of container technology enables us to easily and efficiently capture the data analysis code and its required dependencies in a portable fashion. Using these technologies, we developed a extensible workflow definition framework, allows to describe complex workflows linking arbitrary execution graphs of such containerized data analysis steps. Additionally a a fully containerized demontrator backend was provisioned that re-executes these analysis workflows based on the workflow descriptions.
Da Huo
Department of Computer Science and Engineering
University of Notre Dame
Da Huo is a third year computer science engineering major graduate student from the University of Notre Dame. He received his bachelor degree in Computer Science and Mathematics from DePauw University in 2013. He is doing research in the area of data preservation, semantic web technology and ontology engineering. He has interest in using Docker as a preservation tool and virtualization platform to assist computational science experiment preservation, publication and reproducibility. He also has great interests in machine learning, and cognitive computing with semantic web applications.
Smart Docker Containers, because provenance should be contagious
Charles F. Vardeman II, Da Huo, Michelle Cheatham, James Sweet, and Jaroslaw Nabrzyski
For data to be useful to scientists, data must be accompanied by the context of how it is captured, processed, analyzed, and other provenance information that identify the people and tools involved in this process. In the Computational Sciences, some of this context is provided by the identity of software, work flows and the computational environment where these computational activities take place. Smart Containers is a tool that wraps the standard docker command line tool with the intent to capture some of this context that is naturally associated with a Docker based infrastructure. We capture this metadata using linked open data principles and web standard vocabularies such as the W3C Prov-O recommendation to facilitate interoperabilty and reuse. This provenance information is attached directly to a docker container label using json-ld thus "infecting" containers and images derived from the original container resource with contextual information necessary to understand the identity of the contained computational environment and activities that environment affords. Use of linked data principles allows us to link to other vocabularies and incorporate other efforts such as Mozilla Science's Code as a Research Object, Schema.org, dbpedia, software vocabularies, and orcid to provide broader context for how a Docker container may be provisioned utilizing "Five Stars of Linked Data Vocabulary Use" recommendations.[1] We have extended the Prov-O notion of Activity by creating the formal ontology pattern of Computational Activity and a taxonomy to capture Computational Environment.[2] Lastly, we provide the ability for scientfic data to be published and preserved, along with its provenance using a docker container as a "research bundle". We utilize ideas from the W3C Linked Data Platform recommendation and W3C work on "Linked Data Fragments" using the Hydra Core Vocabulary, that is still in the development stage, to provide metadata for data entry points inside the docker container as well as the ability to attach rdf metadata to non-rdf dataset resources, which is a common use case in the sciences.
References
1. Krzysztof Janowicz, Pascal Hitzler, Benjamin Adams, Dave Kolas, and Charles Vardeman. Five Stars of Linked Data Vocabulary Use. Semantic Web 5 (2014). http://dx.doi.org/10.3233/SW-140135.
2. Da Huo, Jarek Nabrzyski, and Charles F. Vardeman II. An Ontology Design Pattern towards Preservation of Computational Experiments. Proceedings of the 5th Workshop on Linked Science 2015 Best Practices and the Road Ahead (LISC 2015)co-located with 14th International Semantic Web Conference (ISWC 2015) October 12, 2015. http://ceur-ws.org/Vol-1572/
Rick Johnson and Cynthia Hudson-Vitale
Co-Director, Digital Initiatives and Scholarship
Hesburgh Libraries, University of Notre Dame
Rick Johnson is the Co-Director of Digital Initiatives and Scholarship at Hesburgh Libraries of the University of Notre Dame. In this role, he directs the design and development of the libraries' data curation and digital library services. In addition, Rick currently serves as a Visiting Program Officer for SHARE with the Association of Research Libraries. Rick has contributed to several collaborations such as DASPOS (Data and Software Preservation for Open Science), and he spearheaded the implementation of the University of Notre Dame’s institutional repository, CurateND, with an emphasis on archiving all research output together. Over the years, he has been active in the multi-institutional Hydra collaboration as both a code committer and technical manager on several digital repository related projects.
Digital Data Outreach Librarian
Data & GIS Services, Washington University in St. Louis Libraries
Cynthia Hudson-Vitale is the Digital Data Outreach Librarian in Data & GIS Services at Washington University in St. Louis Libraries. In this position, Hudson-Vitale leads research data services and curation efforts for the Libraries. Since coming into this role in 2012, Hudson-Vitale has worked on several funded faculty projects to facilitate data sharing and interoperability, while also providing scaleable curation services for the entire University population. Hudson-Vitale currently serves as the Visiting Program Officer for SHARE with the Association of Research Libraries.
SHARE: Indexing, Linking, and Sharing Research Artifacts and Workflows
SHARE is an academic, community-driven initiative that is developing an open, structured (meta)dataset of scholarly outputs across the research lifecycle, including software. SHARE is gathering, cleaning, linking, and enhancing metadata that describes research activities and outputs—from data management plans and grant proposals to preprints, presentations, software, journal articles, and data repository deposits—in order to make this metadata openly and freely available. Its primary service, SHARE Notify, indexes from over 100 data providers, including the Open Science Framework (OSF). This lightning talk will give an overview of the diversity of data providers, types of indexed research events and artifacts, and future opportunities. In addition, we will discuss how integration work between NDS and the OSF can feed into SHARE.
SHARE is an initiative led by the Association of Research Libraries (ARL) and the Center for Open Science (COS) with the support of the Association of American Universities (AAU) and the Association of Public and Land-grant Universities (APLU). SHARE is underwritten, in part, by generous funding from the Institute of Museum and Library Services (IMLS) and the Alfred P. Sloan Foundation.
Maciej Malawski
Department of Computer Science
AGH University of Science and Technology, Krakow, Poland
Maciej Malawski holds Ph.D. in Computer Science, M.Sc. in Computer Science and in Physics. He is an assistant professor and a researcher at the Department of Computer Science AGH and at ACC Cyfronet AGH, Krakow, Poland. He was a postdoc and a visiting faculty at Center for Research Computing, University of Notre Dame, USA. He is coauthor of over 50 international publications including journal and conference papers, and book chapters. He participated in EU ICT Cross-Grid, ViroLab, CoreGrid, VPH-Share and PaaSage projects. His scientific interests include parallel computing, grid and cloud systems, resource management, and scientific applications.
Deployment of scientific workflows into containers with PaaSage
Maciej Malawski, Bartosz Balis, Kamil Figiela, Maciej Pawlik, Marian Bubak
PaaSage is a model-based cloud platform to provision resources from multiple clouds, deploy the application and automatically scale them according to the application demands. PaaSage utilizes the Cloud Modeling and Execution Language (CAMEL), a DSL describing various aspects of the application, including its deployment model and scalability rules. PaaSage includes Upperware subsystem, responsible for processing the CAMEL model and planning the deployment according to the requirements and optimization objectives specified in CAMEL. Subsequently, the refined model is passed to the Executionware subsystem, responsible for the deployment and steering of the application in the cloud. Executionware recently introduced Docker containers to deploy individual application components in virtual machines in the cloud.
One of the interesting use cases for PaaSage are large-scale scientific workflows which can take advantage of resources available from clouds. HyperFlow is a lightweight workflow engine that enables to execute tasks of scientific workflows on available computing resources (e.g. Virtual Machines in a cloud). HyperFlow is based on Node.js and allows combining execution of tasks in the cloud with advanced programming capability within the workflow engine.
To combine PaaSage with HyperFlow, we developed a CAMEL model of a workflow application, to be deployed on the cloud. This model describes the components comprising the workflow management system, including HyperFlow engine, the Redis database for storing workflow state, RabbitMQ messaging system, and HyperFlow Executor component running on worker nodes. In addition to modeling in CAMEL, we prepared a set of scripts supporting all the stages of component lifecycle, such as installation, configuration, start or stop actions. Application binaries executed by workflow tasks are also deployed in a similar way. Our deployment scripts use Chef recipes which facilitate automation and portability, as well as reusability of deployment when used in other environment than PaaSage. We have adapted our scripts to be compatible with Docker, and we also prepared custom Docker images to simplify and speed up the deployment of our workflow system.
Combining of PaaSage with HyperFlow has several benefits for the users of workflow application. Thanks to PaaSage, the whole workflow system together with the application can be deployed on multiple clouds (e.g. Amazon EC2, OpenStack or Flexiant), using the same high-level description. Automation of deployment helps achieving portability across infrastructures. It is not required to maintain a separate machine hosting the workflow management system, since all the components are deployed on-demand and can be shut down after completing the workflow execution. Our future work includes experiments with scientific applications provided by PaaSage partners, and performance comparison with other solutions, such as Docker Compose.
References
1. K. Jeffery, G. Horn, and L. Schubert, “A vision for better cloud applications,” in Proceedings of the 2013 international workshop on Multi-cloud applications and federated clouds. ACM, 2013, pp. 7–12.
2. B. Balis, “HyperFlow: A model of computation, programming approach and enactment engine for complex distributed workflows,” Futur. Gener. Comput. Syst., vol. 55, pp. 147–162, Sep. 2016
Tanu Malik
Computer Scientist
Computation Institute, University of Chicago
Tanu Malik is a Fellow at the Computation Institute, a joint initiative between The University of Chicago and Argonne National Laboratory. She researches in data science and data management, focusing on problems encountered in large-scale scientific analyses. She actively collaborates with astronomers, geoscientists, and urban scientists across several institutions. Her research is funded by the National Science Foundation, the Department of Energy, and the Sloan Foundation. She has a doctorate from the Johns Hopkins University, and a bachelors from the Indian Institute of Technology, Kanpur.
Reproducibility-as-a-Service
Joint work with Ian Foster, University of Chicago and Jonathan Goodall, Bakinam Essawy, University of Virginia
Container-based technology is becoming mainstream as a means to ensure reproducibility of scientific applications. But container-based technology itself is not sufficient. Scientists need the containers to be descriptive, well documented, easy to verify, and simple to run. This lightening talk will present SciDataspace--- a service framework that integrates a variety of services with containers. These services include sharing, authentication, annotation, provenance-querying, and re-execution. Using use cases from geoscience applications, we will demonstrate how the service integrated framework can dramatically reduce the cost of reproducing scientific applications. The lightening talk will further present the notion of a sciunit, which is a descriptive, annotated, and provenance-enabled container and issues with respect to managing such containers on data hubs.
Sharing and Reproducing Database Applications
Sharing and repeating scientific applications is crucial for verifying claims, reproducing experimental results (e.g., to repeat a computational experiment described in a publication), and promoting reuse of complex applications. The predominant methods of sharing and making applications repeatable are building a companion web site and/or provisioning a virtual machine image (VMI). Recently, application virtualization (AV), has emerged as a light-weight alter- native for sharing and efficient repeatability. AV approaches such as Linux Containers create a chroot-like environment, while approaches such as CDE trace system calls during application execution to copy all binaries, data, and software dependencies into a self-contained package. In principle, application virtualization techniques can also be applied to DB applications, i.e., applications that interact with a relational database. However, these techniques treat a database system as a black-box application process and are thus oblivious to the query statements or database model supported by the database system. To overcome this shortcoming, and leverage database semantics, we have introduced light-weight database virtualization (LDV) , a tool for creating packages of DB applications. An LDV package is light-weight as it encapsulates only the application and its necessary and relevant dependencies (input files, binaries, and libraries) as well as only the necessary and relevant data from the database with which the application interacted with. LDV relies on data provenance to determine which database tuples and input files are relevant. While monitoring an application to create a package we incrementally construct an execution trace (provenance graph), that records dependencies across OS and DB boundaries. In addition to providing a detailed record of how files and tuples have been produced by the application, we use it to determine what should be included in the package. The primary objective of this demonstration is to show the benefits of using LDV for repeating and understanding DB applications. For this, we consider real-world data sharing scenarios that involve a database and highlight the sharing and reproducibility challenges associated with them. We give an overview of our LDV approach to show how it can be used to build a light-weight package of a DB application that can be easily shared and reproduced. During the demonstration, the audience will experience three key features of LDV: (i) its ability to create self-contained pack- ages of a DB application that can be shared and run on different machine configurations without the need to install a database system and setup a database, (ii) how LDV extracts a slice of the database accessed by an application, (iii) how LDV’s execution traces can be used to understand how the files, processes, SQL operations, and database content of an application are related to each other.
Kenton McHenry
Senior Research Scientist
National Center for Supercomputing Applications (NCSA)
Kenton McHenry received a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2008 after completing a B.S. in Computer Science from California State University of San Bernardino. Kenton's background is in computer vision and artificial intelligence having done work in the areas of image segmentation, object/material recognition, and 3D reconstruction. He is currently a Senior Research Scientist at the National Center for Supercomputing Applications (NCSA) where he serves as Deputy Director for the Scientific Software and Applications division, co-leads Innovative Software and Data Analysis group (ISDA), and holds an Adjunct Assistant Professor position in the Department of Computer Science at the University of Illinois at Urbana-Champaign. Kenton has applied his experience in computer vision, AI, and machine learning towards research and development in software cyberinfrastructure for digital preservation, auto-curation, and the providing of access to contents in large unstructured digital collections (e.g. image collections). Kenton serves as PI/Co-PI on a number of awards from a variety of agencies/organizations ranging from NSF, NIH, NEH, and private sector partners. Kenton currently serves as the Project Director and PI of NSF CIF21 DIBBs - Brown Dog where his team works on means of making data agnostic to the file formats in which they are stored and providing general purpose easy to use tools to access uncurated collections by the automatic extraction of metadata and signatures from raw file contents.
Towards a National Data Service: Leveraging Tools, Frameworks, and Services Across Communities and Efforts
Nearly all scientific fields are at a point where they now somewhat rely on a computer driven aspect to the research they carry out, whether it be saving the outputs of the work, to data analysis using a software package, to computer models, to digital data collections, to computationally expensive simulations. The need for each scientific community to support work involving computation and digital data, compounded by the rapid development of computer technology, has resulted in a very diverse set of software and services being utilized by science overall. In terms of supporting these efforts, both efficiently and long term, this diversity and often overlapping of activity make it both inefficient and challenging, in particular where the reproduction of scientific results are involved, a key aspect in the definition of science itself. Agencies such as NSF, NIH, and NIST have been actively investing in technology, both hardware and software, to address these needs. Recently, towards addressing what many refer to as the “Big Data” problem a number of efforts have been undertaken to provide data management, access, and analysis tools both at the community level and more broadly.
The National Data Service Consortium is a collaboration of providers, developers, and users of these tools and services as well as academic publishers and archives in an attempt bring these diverse capabilities together. The NDSC aims to not only increase exposure for the tools being developed but to address the reuse and broad interoperability of these technologies as a means of both addressing needs within scientific communities as well as allowing for the publication, sharing, discovery, and reuse of scientific data itself towards the enabling of novel discoveries. Towards these goals container technology has been playing a larger and larger role within the efforts that make up the NDSC as well as the tools being developed by it to address exposure and interoperability of these technologies. I will review a number of the efforts that are making use of containers as part of the services they provide, how they are used, and show the current state of the NDS Labs Workbench.
Elliot Metsger
Senior Software Engineer
Digital Research and Curation Center, Sheridan Libraries, Johns Hopkins University
Elliot Metsger is a Senior Software Engineer at the Digital Research and Curation Center, Sheridan Libraries, Johns Hopkins University. He develops software for the Data Conservancy, an organization dedicated to digital archiving and preservation, and is active in the Fedora community. His current work focuses on digital packaging as an enabler of curation and provenance activities.
Packaging Digital Content is Not a New Concept
Packaging of digital content is not a new concept. Existing packaging specifications provide a foundation for describing the contents (e.g. file checksums and manifests) of a package, but lack the richness required to express the semantics and relationships of objects within the package.
This lightning talk introduces three complementary outputs of the Data Conservancy, enabling semantic description and ingestion of packages:
- The Data Conservancy Packaging Specification as a standard mechanism for including semantic descriptions of package content,
- the DC Package Tool GUI, an JavaFX application enabling the description and production of specification-compliant packages, and
- the DC Package Ingest Service, Apache Camel-based engine for consuming specification-compliant packages and depositing them into a Fedora 4 digital repository
Extending BagIt and leveraging OAI-ORE, the Data Conservancy Packaging Specification provides the ability to include semantic descriptions of package content. The specification is agnostic with respect to the domain models used to describe package content, and may be used with any RDF-based domain model. The Package Tool GUI supports multiple domain models, enabling semantic enrichment of a package using a point-and-click interface. Finally, the Package Ingest Service deposits the contents of a package into a Fedora 4 digital repository, exposing the contents of the package using Linked Data principles.
Natalie Meyers
E-Research Librarian
Digital Initiatives and Scholarship, Hesburgh Libraries, University of Notre Dame
Ms. Meyers is an E-Research librarian in Digital Initiatives and Scholarship where she helps pioneer and provide research data consulting services, including more in-depth data management services in support of grant-funded research. Natalie is currently Partnerships and Collaborations manager at Center for Open Science during a part-time leave from ND. She is thankful to have an opportunity to advance the work of this young organization making great strides in promoting scientific openness, reproducibility, and data sharing. Natalie devotes a significant part of her regular time as an embedded e-research librarian for grant-funded research by faculty members and serves as an advisor to groups and individuals regarding data and digital content management. She provides advice & works with units across campus and externally to provide collaborative, team-based support for data management needs, including development of GIS, as well as data and metadata services for the Center for Digital Scholarship. Advising for other library initiatives as needed. Also hosts, conducts, and/or designs related workshops.
Daniel Nüst
Researcher, Spatio-temporal Modelling Lab
Institute for Geoinformatics, University of Münster
Daniel is researcher at Spatio-temporal Modelling Lab at the Institute for Geoinformatics (ifgi), University of Münster, Germany. He pursues a PhD in the context of the DFG project Opening Reproducible Research (http://o2r.info). Before he was consultant, researcher and software developer at 52°North Initiative for Geospatial Software Solutions GmbH, after getting a Diploma in Geoinformatics from ifgi. His professional interest is making a little bit more sense of the world with new information technology—of course with Open Source software—and hopefully have a positive impact. Topics of his interest are reproducibility, containerization, geoprocessing, standardization and the sensor web. Daniel enjoys playing Ultimate Frisbee, the greatest sport ever invented by men.
Opening Reproducible Research
Reproducibility of computational research is a challenge with overwhelming complexity. The research project "Opening Reproducible Research" (http://o2r.info/, DFG-funded with six person-years) builds a platform on top of simple, focused and powerful building blocks: a BagIt bag carries a Docker image as well as the corresponding Dockerfile. The image executes an analysis when it is started and validates the generated output based on a single working directory with data and code. This gives us two levels of reproducibility: the first builds the software environment based on a complete scripted definition, the second falls back to the original self-contained run-time image created at time of submission of the research paper.
The more challenging goals of the projects, which include switching out data and code between archived papers and well-documented licenses for all of data, text and code, are approached with a mixture of semantic metadata and conventions. We will implement open-source prototypes for a cloud based infrastructure to create, validate, and interact with reproducible open-access publications. The architecture and implementations will not reinvent publication or archival processes, but integrate with existing open-source platforms.
We focus on the actual users and their needs in the domain where we are most active: geospatial analysis using R. The technical challenges are simple compared to the required shift in scientists mind sets, the required education, and adjustments of workflows. Therefore the efforts in the project are equally divided between the perspectives archival and preservation, usability and evaluation, as well as architecture and standardisation.
Coming from the architecture role, I will talk about technical solutions and details, but would highly appreciate sharing our overall vision and approach to data and software preservation with the workshop participants. In turn, I am sure our project will greatly benefit from the collected experience at the workshop, and the potential to start new collaborations.
Vicky Steeves and Rémi Rampin
Librarian for Research Data Management and Reproducibility
NYU Division of Libraries & Center for Data Science
Vicky Steeves is the Librarian for Research Data Management and Reproducibility, a dual appointment between New York University Division of Libraries and NYU Center for Data Science. In this role, she works supporting researchers in creating well-managed, high quality, and reproducible research through facilitating use of tools such as ReproZip.
Research Engineer
NYU Tandon School of Engineering
Rémi Rampin is a Research Engineer at NYU Tandon School of Engineering, where he has been maintaining VisTrails, a scientific workflow and provenance management system, and developing ReproZip, a software for creating reproducible packages of experiments and environments.
Reprozip
Reusing and reproducing research is a foundation of science which has becoming increasingly difficult in a digital environment. The evolution of tools, libraries, and formats makes it hard to recreate the environment in which the original will yield the same results, or work at all. When asking for data or details on the required environment, the standard response is “it’s in the paper”, and researchers often end up having to make do with tables and figures loosely describing the original process and no way to try it for themselves.
Beyond the technical limitations, there is a lack of documentation for digital data--the standard response is “it’s in the paper” when in reality, the version of Python that they use, or the software associated with their data files, are not. This is because that simply takes a lot of effort (and is human error-prone), and this isn’t seen as a value-add to research. This lack of documentation, coupled with the fast-paced landscape of research technology, make reproducibility all the more difficult.
This talk features ReproZip, an open source software developed at NYU that seeks to lower the barrier of making research reproducible by billing reproducibility as an afterthought. ReproZip allows researchers to create a compendium of their research environment by automatically tracking programs and identifying all their required dependencies (data files, libraries, configuration files, etc.). After two commands, the researcher ends up with a neat archived package of their research that they can then share with anyone else, regardless of operating system or configuration. These community members can unzip the package using ReproUnzip, and reproduce the findings.
James Sweet
Graduate Student
Department of Computer Science and Engineering, University of Notre Dame
SmartContainers (sc) for docker enabled software and data preservation
SmartContainers is a python wrapper for docker that facilitates the recording and tracking of provenance information using the W3C recommendation prov-o. SmartContainers is being developed as part of the Data and Software Preservation for Open Science (DASPOS) project.
Current build status: SmartContainers provides a command line tool, sc, that provides a surrogate for the docker command line tool. Sc--docker will create a docker label with provenance metadata using the W3C Prov-o vocabulary with respect to the computational environment created or provided by a particular docker container. A python setup file is provided for installation of the command line utility. It is recommended to install the tool in a Python virtual environment since the tool is under heavy development. pip install. Will install the tool and its dependencies in a virtual environment.
User Identity setup
On first use after installation, the sc command will guide the user through connecting the tool with the user's ORCID. It is recommended to setup a ORCID account to connect to the tool. If the user chooses not to create an ORCID account, the tool with prompt for a first and last name, email, and organization for provenance information. A global configuration file will be created in the user that contains this information so it only needs to be input once. The configuration file will be written to a .sc directory created in your home directory. In the future, the configuration file location will be a user option.
Purpose
For data to be useful to scientists, data must be accompanied by the context of how it is captured, processed, analyzed, and other provenance information that identify the people and tools involved in this process. In the computational sciences, some of this context is provided by the identity of software, workflows and the computational environment where these computational activities take place. Smart Containers is a tool that wraps the standard docker command line tool with the intent to capture some of this context that is naturally associated with a Docker based infrastructure. We capture this metadata using linked open data principles and web standard vocabularies such as the W3C Prov-O recommendation to facilitate interoperability and reuse. This provenance information is attached directly to a docker container label using JSON-LD thus "infecting" containers and images derived from the original container resource with contextual information necessary to understand the identity of the contained computational environment and activities that environment affords. Use of linked data principles allow us to link to other vocabularies and incorporate other efforts such as Mozilla Science's Code as a Research Object, Schema.org, dbpedia software vocabularies and ORCID to provide broader context for how a Docker container may be provisioned utilizing "Five Stars of Linked Data Vocabulary Use" recommendation. We have extended the Prov-O notion of Activity by creating the formal ontology pattern of Computational Activity and a taxonomy to capture Computational Environment. Lastly, we provide the ability for scientific data to be published and preserved, along with its provenance using a docker container as a "research bundle". We utilize ideas from the W3C Linked Data Platformrecommendation and W3C work on "Linked Data Fragments" using the Hydra Core Vocabulary, that is still in the development stage, to provide metadata for data entry points inside the docker container as well as the ability to attach rdf metadata to non-rdf dataset resources, which is a common use case in the sciences.
Jian Tao
Research scientist / IT consultant
Center for Computation and Technology, Louisiana State University
Jian Tao is a research scientist / IT consultant at the Center for Computation and Technology (CCT) at Louisiana State University (LSU). He received his Ph.D in Computational Astrophysics from Washington University in St. Louis.
Before becoming a research scientist, he worked at CCT as a postdoc on the NSF XiRel project to build the next generation infrastructure to model gravitational waves, and the NSF CyberTools project to develop the infrastructures needed for interdisciplinary research on computational sciences. He helped to manage the cyberinfrastructure development of the NSF Northern Gulf Coastal Hazards Collaboratory (NG-CHC) project at LSU where he led the development of SIMULOCEAN, a Service-Oriented Architecture (SOA) for deploying coastal models on High Performance Computing (HPC) systems. He is a member of the Cactus framework group and the PI of an ongoing NSF BIGDATA (SMALL) project to improve both performance and usability of the HDF5 library that is widely used in the scientific computing community. He is also a Co-PI of the LSU CUDA reseach center and works on an OpenACC to OpenCL translator (RoseACC) that helps to port legacy code to GPGPU, Xeon Phi, and FPGA.
Orchestrating Containerized Scientific Applications with SIMULOCEAN
Compared to the quick adoption of cloud computing technology in industry, the academic community, and especially the computational science community as a whole, has been slow to make the move, partially because of the lack of investment in cloudready systems from NSF and other major funding agencies. For years, many researchers and engineers who didn’t run largescale applications regularly were inhibited by the effort needed to gain the specialized knowledge needed to effectively use HPC resources for their research. Their time could be better spent on their research if they did not have to worry about how to run their applications. It was not a surprise that with the NSF Cloud initiative, NSF recently announced two $10 million projects “Chameleon” and “CloudLab” to enable the academic research community to drive research on a new generation of innovative applications for cloud computing and cloud computing architectures.
The Scientific Application Repository (SAR) is targeting such cloud and cloudlike architectures to enable quick deployment of scientific applications and their working environments. SAR will serve as a community repository for precompiled open source applications that are widely used by computational science researchers. While source code for various executables and libraries will be available, SAR will also introduce distribution of containerized scientific applications, which can run on any cloudlike architecture directly, and with negligible system overhead. The idea of containerization of cloudready applications is not new, but it has become a viable solution given the rapid development of kernel level virtualization technologies. One such technology, Docker, is an open platform allowing developers to build, ship, and run distributed applications in self-contained environments. Docker enables executable applications to be quickly assembled from components then run by a user without the need to rebuild or satisfy any external dependencies. As a result, a Docker-enabled app can be reliably executed in a known operating system environment on any system that supports Docker containers. With the help of SAR, a researcher can start running state-of-the-art scientific applications on the latest cloudready computing systems in minutes. Workflow management tools, such as SIMULOCEAN, can take advantage of SAR to quickly deploy scientific applications on academic and commercial cloud platforms while supporting certain applications on traditional HPC systems for large scale execution. This work is supported by NSF Award CCF1539567 and in part by HPC computing resources at LSU. We acknowledge the assistance and support of the CSDMS Integration Facility and XSEDE Extended Collaborative Support Service (ECSS) program.
Ian Taylor
Research Associate Professor and Reader
University of Notre Dame, Center for Research Computing and Cardiff University, UK
Ian Taylor is an adjunct research associate professor at the CRC, Notre Dame, and a Reader in Cardiff University, UK. He also consults often to the Naval Research Lab (NRL) and has led the IT development and infrastructures for several startups and redevelopment projects for existing businesses. Ian has a degree in Computing Science and a Ph.D. researching and implementing artificial-neural- network types for the determination of musical pitch. After his Ph.D, he joined the gravitational-wave group at Cardiff where he designed, procured and engineered the implementation of the data acquisition system for the GEO 600 gravitational wave detector. He also wrote the Triana workflow system and managed it thereafter. Ian's research over the last 25 years has covered a broad range of distributed computing areas but he now specializes in Web interaction and APIs, big data applications, open data access, distributed scientific workflows and data distribution, with application areas ranging from audio, astrophysics and engineering to bioinformatics and healthcare. He has managed over 15 research and industrial projects, published over 150 papers, 3 books, acted as guest editor for several special issues in journals and chairs the WORKS Workflow workshop yearly at Supercomputing. Ian has won the Naval Research Lab best research paper (ALAN Berman) prize in 2010, 2011 and 2015.
A Web Dashboard for Repeating and Reusing Research
In this presentation I will provide an overview of a dashboard we have been developing, which provides researchers a means of interacting with existing research. This work was motivated by the National Data Service (NDS), which is an emerging vision of how scientists and researchers across all disciplines can find, reuse, and publish data. NDS intends to provide an international federation of data providers, data aggregators, community-specific federations, publishers, and cyberinfrastructure providers by linking data archiving and sharing efforts together with a common set of tools. This talk will provide a status of the two existing proof-of-concept pilot dashboard implementations and how we plan to evolve this work. The researcher dashboard aims to provide an intuitive Web-based interface to expose fully interactive research containers that support the lifecycle of scholarly communication. Research containers enable executable and repeatable research by supporting methods, source code, and data within dynamically created Docker containers. We will discuss its two versions, which are: a Yii and an SQL engine dashboard integrated with the NDS Labs Epiphyte API; and a version that interoperates with the Open Science Framework (OSF), which is an environment that supports open materials, data, tools to connect projects and initiatives and easy online publishing of results. Using the latter system, a researcher can create a project on the OSF, connect data management tools to it (e.g. Google Drive, Dropbox, Box, Dataverse, and so on) and then use of the dashboard (implemented as an add on)to execute methods on OSF data in a container using the Boatload API. Boatload is an API for automating deployment and operations of Docker containers on clusters. Looking forward, we are planning on adopting a more lightweight federated approach by adopting a single page application (SPA) framework to integrate with multiple authentication and data infrastructures, using Ember.
Alexander Vyushkov
Research Programmer
Center for Research Computing, University of Notre Dame
Alexander Vyushkov is a research programmer at the Center for Research Computing, University of Notre Dame. He received a M.S. in physics from the Novosibirsk State University in Russia. He works closely with the Notre Dame faculty to develop scientific software and web portals. He is a lead developer for Data and Software Preservation for Open Science project.
Abstract Coming Soon
Dave Wilkinson
Computer Science
University of Pittsburg
Dave Wilkinson is a digital archivist and software preservation developer at the University of Pittsburgh. Their interests are in the building of tools for the preservation of science, art, and culture and also the design of distributed social systems.
OCCAM – Live Interactive Archival
The problem of software preservation is generally solved in one of two ways: simply hosting files and some README file, or a full-scale virtual machine image. OCCAM solves the preservation problem by meeting in the middle. Objects are described with enough metadata to build a virtual machine image to run a piece of software as best as it can considering the native machine you own. It uses information it learns over time and from other servers to decide how to build virtual machines in the future. You can teach it, for instance, how to run DOS programs and games and even give it some means of running Super Nintendo games. It will build a virtual machine on your laptop but it won't necessarily be the same virtual machine for your workstation, server, or your new computer 20 years from now. OCCAM has the ability to make new decisions at a later date.
OCCAM objects can record how they are built, how they are run, and be combined with other tools in workflows. An example workflow would be combining objects such as a set of applications, a trace generator, and a simulator to be able to encapsulate a scientific experiment for computer architecture study. One can always be pointed to that experiment at any time, duplicate it, and change the parameters all with a reasonable assurance that it will still run. Combined with the ability to create versatile virtual machine images over time, this system can preserve the ability to repeat and replicate digital experiments and reproduce results.
Anita de Waard
Senior Product Manager for Research Data / Vice President Research Data Collaborations
Elsevier Research Data Services
Research Elements Give Credit for your Software
Planning experiments starts a cycle of work, which includes creating experimental designs, tweaking methods, developing protocols, writing code, collecting and processing experimental data, etc. A large part of this process does not get published, which makes experiments difficult to reproduce. To address this concern, Elsevier has launched a series of peer-reviewed journal titles grouped under the umbrella name ‘Research Elements', which allow researchers to publish their data, software, materials and methods and other elements of the research cycle as a brief article.
In particular, many researchers spend a great amount of time developing software, but do not receive recognition for this work. Their software is not as easily accessible, citable and findable as are other scientific results. The journal SoftwareX aims to address this issue, by giving software a stamp of approval. The interdisciplinary nature of SoftwareX aims to showcase use cases of software that offers transferable tools and services which may be of use to communities outside the one it is developed for. Also, SoftwareX ensures software preservation, as the published version of the source code is stored in a dedicated GitHub repository. The journal is open access and publishes open source software. The novelty of this journal has been officially recognized by the Professional & Scholarly Publishing Association of American Publishers. In 2016, SoftwareX received the prestigious PROSE Award for Innovation in Journal Publishing.
The same holds for research data: many types of data, such as a replication data, negative datasets or data from “intermediate experiments” don’t get published because they are not within the scope of a research journal. The multidisciplinary journal Data in Brief publishes data articles and gives researchers the opportunity to share data directly, provided its utility to the rest of the scientific community can be justified. This gives researchers a way to share their data in a way that is peer reviewed, citeable and easily discoverable. The journal is open access as well. All the data associated with data articles is made publicly available upon publication. The benefits of publishing in a data or software journal is that the data and software benefit from the peer review process, and researchers are given credit in a format that grant and faculty review panels recognize. At the same time, the data/software articles are discoverable through the same search engines that researchers use to search for published scientific literature and thus improve to reproducibility and data and software reuse.