[Published originally in the May 2003 edition of Computing Research News, Vol. 15/No. 3, pp. 5, 9.]
Cyberinfrastructure: Challenges for Computer Science and Engineering Research
By Peter A. Freeman and Lawrence L. Landweber, NSF
The NSF vision for the future of cyberinfrastructure will only be achieved if there is continuous progress in basic CS&E research. The opportunity to provide an array of cutting-edge computational and information resources as common infrastructure for all of science and engineering stretches the envelope in many CS&E disciplines. Indeed, cyberinfrastructure will be an important, if not the most important, driver for CS&E research in the next decade.
In our column in the March issue of CRN, "Cyberinfrastructure: The Critical Role of Computer Science and Engineering Research," we postulated that the future of cyberinfrastructure (CI) is contingent upon CS&E research. In this column we examine some of the exciting challenges for CS&E researchers.
We define CI to be the cutting-edge, distributed computing and communications environment that can be built at any particular time to support a broad range of scientific and engineering research and education. For the purpose of this article, we target the CI of 2013. This will likely involve large numbers of teraflop computing systems and petabyte data stores, augmented by instruments such as colliders and telescopes and vast collections of sensors. There will also be lots of highly distributed small processors, sensors, and data stores with highly variable network requirements. A high-speed network, carrying petabits per second in its core, will connect these systems, services, and tools. The solution to a particular science problem may involve the distributed use of these resources. This CI environment will be available to the country's science and engineering research community. It is this environment that we hope the CS&E community will keep in mind as we formulate our CS&E research direction for the coming years.
This future CI will encompass many of the sub-disciplinary elements of CS&E. It will pose many very difficult research questions in networking, data management, distributed systems, software systems, and others. In the following, we describe some of the challenges that will be faced by researchers in these broad research categories.
CI will require networks that allow scientific collaborators to share resources on an unprecedented scale and allow geographically distributed groups to work together effectively. To address these issues, a scientific foundation to advance our understanding of the increasing complexity of large-scale networks is required. Advances must be made to create and sustain the science and technology needed for the effective engineering, control, and management of a ubiquitous network infrastructure designed to provide high-performance mechanisms for discovering and negotiating access to remote resources.
Next-generation networks are likely to exhibit unpredictable and complex behavior and dynamics, giving rise to a new set of exciting and challenging network problems. Research challenges include:
A new architecture for the core of the network to accommodate orders-of- magnitude increases in traffic.
Security, from network trust models to anonymity and privacy issues, to the social and management issues surrounding information assurance. In addition, the assumptions and requirements that underlie the CI applications of the future require new attention to problems related to scalability, adaptability, level and quality of service, routing and congestion control, reliability, and interoperability.
Database Management Systems
Database management technology provides facilities to enable the efficient location, transformation, replication, combination, and understanding of large, massively distributed data sets. It lets consumers of data focus on what they want to discover from the data, instead of the details of how and where it is stored, accessed, and processed.
Current DBMS are not sufficient for the task ahead. One must first define a relational schema, make the data conform to that schema, and then load it in the system. Once data is in a DBMS it can only be accessed through the DBMS SQL interface (non-SQL apps are out of luck). In addition, existing DBMS are not network aware and have only rudimentary support for distributed data. Lastly, they can be so painful to use that the vast majority of scientists use file systems, accomplishing data management tasks "by hand."
The CI vision requires new DBMS technology that is built with the fundamental assumption that data is massively distributed among autonomous heterogeneous sites.
It should allow "schema later" processing, in which the system need not know a complete rigid schema before managing the data. It should be able to manage data in a format dictated by the scientists rather than by the DBMS. Most importantly, it should be self-tuning and self-configuring so that a scientist can use it without taking courses to become a certified database administrator.
The ultimate goal of CI is to achieve a transparent and seamless computation and resource-sharing execution environment for user-centric applications. The challenge from the system designer's viewpoint lies in the development of the theoretical foundation, methodologies and models, and system implementations to facilitate collaboration and cooperation of interacting user applications. Much of current research on pervasive/grid/mobile computing is a step in this direction, but falls short in achieving the goal, mainly due to the complexity of managing the massive scale of heterogeneous computation, communication, and storage resources in the future networked environment.
Innovations and fundamental breakthroughs are critical in the following areas:
As with essentially all applications of computing technology, software will ultimately be the element of CI that enables it or causes it to fall short of its envisioned potential. Many of the issues discussed above, to say nothing of hundreds of others, will result in software implementations. The languages, constructs, techniques, and structures that are used will be key, but our current stock of software elements is no doubt insufficient. As a simple example, consider the basic mechanisms we have for describing data in terms that a domain-scientist can easily work with.
Software engineering (SE)--the tools, techniques, and processes for creating complex software systems--is clearly inadequate to the task ahead. Most software is still created by people with little or no knowledge of proven SE approaches, and while the result is often acceptable initially, the lifetime costs of modification and repair are often horrendous and prevent the kind of progress that we should be making. Even if everyone used the very best SE, there is ample evidence that the results would still be much less than appropriate.
In short, we need better software "building blocks" and better software engineering. If you consider other disciplines that ultimately produce engineered or constructed artifacts, you will note that they are based on a body of scientific knowledge and coherent, systemized experience. While we certainly have some aspects of this, by and large software and, more generally, computing-intensive systems, are not built on any such foundation.
Creating a true "science of design," along the lines indicated above, has to be a top priority for CS&E. (Other terms may be better, but some, such as "software science" are either taken or have a certain historical connotation that may not be appropriate here). As with most of the other examples we have cited, this should be a goal of fundamental CS&E research, independent of the need for it in creating CI. Building CI, however, presents a wonderful opportunity for advancing toward a science of design.
These challenges are well within the expected envelope of CS&E research over the next decade, regardless of the specific overarching strategic initiatives at play. The rate of technological change will continue at the exceptional speed at which it has progressed over past decades. These changes will drive and be driven by the research questions that are of utmost interest to our community. For the most part, these questions will be indistinguishable from those that cyberinfrastructure demands. Certainly research with no direct application to CI will occur, and we must be careful to seek out and support work on important questions that may fall outside the current demands of CI. Unstructured, investigator-driven research will always be the bedrock of future advances.
The application of leading-edge research to create integrated cyber resources began years ago, and the increased focus on CI need not change the research agenda. To achieve the goals of CI we undertake a journey without a specified end-point, nor a single path. It will be defined by the research that the CS&E community undertakes and by the needs of the domain scientists and engineers. It will test the capabilities of our advanced CS&E researchers, while providing focused goals and funding for our research. We are confident, colleagues, that you are equal to this test.
Peter A. Freeman is Assistant Director and Lawrence L. Landweber is a Senior Advisor in the Computer and Information Science and Engineering directorate at the National Science Foundation.
Copyright © 2007 Computing Research Association. All Rights Reserved. Questions? E-mail: email@example.com.