The best way to learn about the final results of this project is to read our paper:
Sara Alspaugh and
Ann Chervenak.
Policy-Driven Data Management for Distributed Scientific Collaborations Using a Rule Engine. CRA-W
Distributed Mentor Project Final Report. September 2008. (
pdf)
Also check out the poster that was accepted to SC08!
Sara Alspaugh,
Ann Chervenak, and
Ewa Deelman.
Policy-Driven Data Management for Distributed Scientific Collaborations Using a Rule Engine. ACM Student Competition Finalist Poster Paper to appear at The International Conference for High Performance Computing, Networking, Storage and Analysis. Austin, Texas, November 2008. (
summary: pdf)
This work focuses on researching policy-driven data management services for use by scientific virtual organizations that use Grid infrastructure and technology. In particular, we are interested in developing policy-driven data placement services to work within the Grid framework espoused and implemented by the
Globus Alliance, a community of developers and scientists creating and using Grid technologies.
This research will seek to discover an appropriate method for implementing policy-driven data management, and compare its performance, in terms of scalability, availability, efficiency, and more, to alternative methods for incorporating policy concerns into data management schemes.
Initial research will focus on making use of a rules-engine to enforce policies and oversee data movement. However, in the long term, we are interested in not only the feasibility and effectiveness of this approach in terms of the metrics mentioned previously, but also in discovering how policy-driven data management might conflict with other data-related concerns, such as workflow execution, and determining what complexities arise as policy-driven data management schemes are put in place in more real-world situations that involve a dynamic Grid environment in which numbers of VOs and their applications are competing for resources whose availability may change over time.
Many scientific collaborations make use of shared, distributed computational and storage resources to execute large scientific applications that process and generate petabytes of data, and to manage and distribute the volumes of data that these applications create. The problems posed by these and similar scenarios fall under the domain of Grid computing. (See
The Anatomy of the Grid, by Foster, Kesselman, and Tuecke, for a definition and thorough overview of this field.) In Grid computing parlance, such scientific collaborations, consisting of many scientists and individuals from independent institutions contributing their resources to achieve the goals of the collaboration, are known as virtual organizations (VOs).
VO collaborations are formed in many fields, from high energy physics, to computational genomics, to climate modeling, to name a few. Scientists in these fields have begun to conduct what is known as petascale science, in which numerical simulations or experimental apparatus produce petabytes of data. This data is increasingly becoming an important community resource, but in order to effectively make use of this data, it must be managed so that the scientific community can access, visualize, reproduce, analyze, and learn from this data. Thus, data management has become a critical aspect of science.
Data management is driven by many concerns, such as performance, reliability, security, and availability, among others. Data management goals and requirements can very from VO to VO, and can be considered policies of the VO. Data management policies, for example, might involve certain rules as to how the data is distributed among participating sites within the VO, or pertain to the degree of data replication that is maintained for the purposes of reliability and availability. Other policies might focus on which storage systems data can be placed on, or who is allowed to access the data, and how, in the interest of security.
Currently, in many cases, these important policies cannot be enforced except through the manual management of data. Such a process costs scientists in these situations a great deal of time and effort, discouraging effective widespread sharing of datasets and distracting many from working toward their true scientific aims. (For a real-world example of how this can occur, see the fourth paragraph of
End-to-End Data Solutions for Distributed Petascale Science by Schopf et al.) Thus, there is a demonstrable and important need for policy-driven data management solutions for Grid computing, which motivates this work.
My work this summer was made possible by the
CRA-W's
Distributed Mentor Project. My mentor is
Dr. Ann Chervenak. If you are a student interested in applying and have any questions for me about my experience, please don't hesitate to email me.