C.6 Dependability and Manageability

Principal Authors:

Richard Baseil, Barry R. Lewin and M. Satyanarayanan

Additional Contributors:

Ed Balkovich, Tom Brand, Gary Campbell, Timothy Chou, Harold T. Daugherty, Coyne Gibson, Bradford Glade, Roger Haskin, John Healy, John H. Howard, Ming-Yee Lai, Barbara Liskov, John McCormack, J. Eliot B. Moss, Sushil G. Munshi, Brian Noble, Louis Scerbo, Ian Service, Tom Soller, James Spencer, Arshad Syed, David Vereeke, John Wilkes and Bernard Ziegler


1. Introduction

While speed, ubiquity and functionality may dominate the headlines, the success of the National Information Infrastructure will rest ultimately on whether it meets users' expectations of dependability. This section exposes the fundamental threats to dependability and manageability (D&M) in the NII, and details research recommendations for coping with these threats in the NII's architecture, initial deployment and ongoing successful operation. The scope of the section is broader than its title might imply. Our intent is to span all facets of providing high grades of service to users.

The word "infrastructure" connotes dependability to many users. Indeed, society becomes alerted to the need to invest in infrastructure when its dependability is compromised (e.g., national highways, water distribution systems). In general, users will expect high levels of dependability for many services and will expect flawless delivery of those services. Each of the sample application areas (health care, manufacturing, education, commerce and government) presupposes an NII whose dependability is beyond question.

Imagine the impact of NII outages on these and other critical application areas. Can a patient undergo remote treatment if communication with the doctor is uncertain or if the doctor is unable to view critical diagnostic data in an emergency? Will any company entrust a key aspect of its manufacturing process to the NII if it is unreliable? What kind of a curriculum can educators hope to establish if the timeliness or quality of delivery of lesson material via the NII is poor? If the NII is not dependable, how can any government agency use it to deliver services? With millions of dollars of electronic commerce at stake each second, can the nation tolerate significant NII downtime?

The NII will consist of products built by many suppliers and managed by many network operators, all based on each individual's understanding of standards and requirements. This multiparty interaction lends itself to interoperability issues that can affect users' perceptions of dependable service. These issues can range from the inability to plug-and-play because of multiple interpretations of requirements, to coding errors that cause failures to propagate nationally.

Meeting user expectations of service quality will depend on how the NII is architected, constructed, implemented and maintained on an ongoing basis. To achieve high performance and dependability in the NII, it is necessary to consider all facets of D&M from the start and include them in the basic building blocks of the system. The NII will neither be dependable nor manageable if D&M issues are an afterthought.

2. A Plan for Dependability and Manageability

2.1 The Challenge of Growth and Evolution

As use of the NII grows, so will expectations of its dependability. Unfortunately, so will the threats to its stability. Although we have considerable experience in managing telecommunications and wide-area computer networks, our current solutions in their present forms, based on current technology and scale, are not going to work for long in the NII. Each quantum increase in scale and service complexity will stress old solutions and render some of them inadequate.

For the NII to remain viable, continuous and ongoing research is needed in scaling up of current solutions, in major improvements to them and in exploring new solutions. It is important to emphasize that refinements to solutions will be as important as their initial development and incorporation into the NII. A one-time, up-front research investment will not sustain the NII as a dependable and manageable entity forever.

A systemwide perspective is essential when addressing these issues. The NII will be a sophisticated and interdependent combination of hardware, software and storage media spanning many levels of the system. Problems can arise due to individual failures at any of these levels or because of unanticipated interactions across levels.

2.2 Balancing Progress With Stability

Clearly, the NII should be built with maximum flexibility to allow for deployment of as many new services as possible. But it is also important to recognize that precautions need to be taken to assure that these services can successfully coexist. In effect, the strategy for coping with growth and evolution must walk a thin line between two kinds of biases. One extreme is being excessively liberal, imposing no controls at all on increases in scale or introduction of new experimental services. The other extreme is being excessively conservative, opposing changes unless their safety and efficacy has been proven beyond doubt. A liberal bias will encourage innovation, lower entry barriers, and encourage competition and market forces to play their legitimate role in realizing the full potential of the NII. On the other hand, a conservative bias is more likely to yield an NII that is dependable and well managed.

One approach to coping with this dilemma is to classify services and to offer different levels of confidence in the dependability and manageability of those classes. In its simplest form this would involve two classes: "core" and "peripheral." Core services are those considered essential to the NII and for which centralized resources (such as management attention) may have to be committed for meeting service guarantees. Examples of core services might include those needed for a connection-oriented service, access to and use of selected servers of critical national importance, and services related to the NII's management. Adding a core service would be non-trivial and would require "certification." That certification process will need to be defined, but should include important aspects of interoperability. In contrast, peripheral services are those regarded as valuable but less critical and for which availability is offered on a best-effort basis. Adding a peripheral service involves minimal bureaucratic overhead, and no centrally funded resources are committed to its sustenance. During emergencies, peripheral services may be dropped or degraded in favor of core services.

What is core and what is peripheral will change over time. Since today's luxury often becomes tomorrow's necessity, some peripheral services will be promoted to core. Other core services may be created in response to new perceived needs. There will have to be a process of achieving consensus between users, service providers and NII administrators to determine what services are core and what are peripheral. Because the mechanism for achieving this consensus involves public policy rather than research, we do not address it further in this brief. But it will be an important aspect of the overall design and operation of the NII.

2.3 Architecture and Initial Deployment

D&M must be designed into the NII. The architecture needs to be sufficiently robust to help reduce the effects of failures, and it should act as an aid toward fault recovery, not a hindrance. Today, products and networks often have extensive delays in their recoveries from failures because D&M support is built on top of these systems instead of being built in as integral parts of the systems. The areas that should include D&M considerations from the beginning are:

2.4 Perennial Problems

While sound architecture is a good first step for the NII, it will not remain problem-free forever. We have identified a number of problems that we characterize as perennial problems for the NII. Although not exhaustive, they typify the threats to the D&M of the NII. These problems are never going to go away; they will always be in the wings, waiting to strike. Constant expenditure of resources and attention will be needed to hold them at bay. A program of continuous research and development will be needed to address these problems as the NII grows and evolves.
  • Failure containment is critical because a failed subsystem that is not rapidly isolated may easily bring down other parts of the network. An especially challenging task is establishing that the recovery mechanisms of the NII are indeed capable of handling anticipated failures. This requires proper simulation of the full range of abnormal operational conditions to ensure that recovery actions are triggered and stressed.
  • This implies that all changes in the system will have to be introduced gradually rather than atomically. While difficult enough with routine upgrades, this becomes a particularly challenging problem when the motivation for the upgrade is an emergency fix for security or reliability reasons.
  • 2.5 Research Goals

    The D&M research agenda for the NII should stimulate and nurture any activity that will improve our ability to cope with the perennial problems listed in the previous section. Some of the detailed recommendations in this brief, such as research on replication, caching and load balancing, follow directly from this broad goal. The value of such research activities is already recognized today; the creation of the NII will undoubtedly increase their importance.

    But our discussions also identified a number of critical research areas for the NII where there is a dearth of current activity. These areas are best described and understood in terms of the goals they support. We list these goals below:

    We wish to emphasize the importance of continuous improvement as well as radical innovation. Specifically, we recommend a balanced portfolio of research activities that 1) scale up and bullet-proof deployed mechanisms and subsystems, 2) extend and refine existing technologies and 3) develop and validate new enabling technologies. D&M are characteristics that will often require in situ study of implementations as well as of system usage and behavior. Hence, the research plan for the NII should recognize that the traditional distinctions between "research," "development" and "deployment" will be fuzzy in the context of D&M.

    3. Research and Development Recommendations

    The research required to meet the D&M goals of the NII can be grouped along three distinct dimensions. All three dimensions are important, and research on them will be required throughout the life of the NII. The next three sections list specific topics pertinent to each of these three research dimensions. For brevity, we list each topic only once even though it may be relevant to more than one research dimension. These topics are not intended to be exhaustive. Rather, they are meant to be examples of the kind of research that must be done to preserve and enhance the D&M of the NII.

    3.1 Characterization and Validation of Service Quality

    Unless we can crisply specify and quantify the resource requirements and performance of a service, we will have to rely solely on anecdotal evidence to decide if that service is being delivered satisfactorily. Without such characterization, it will be impossible to assess the impact of a new service on the NII. Developing the specifications is not enough; efficient runtime techniques that can confirm that the specifications are being met must also be developed.

    1) Developing and Validating Metrics to Describe Service Quality:

    2) Measuring Service Quality:

    3) Incorporating Service Quality into Interface Specifications:

    3.2 Continuous System Operation

    Techniques to improve the reliability and availability of hardware and software components of the system are clearly needed. To complement this effort, research is also needed on techniques to offer viable fallback options for services. The Titanic mentality ("It can never happen.") and the mentality that "there is no escape anyway" must be avoided. An overall approach that combines failure avoidance with contingency handling is likely to be more robust. Research on techniques to simplify routine system management as well as to help in troubleshooting and crash recovery are also important.

    1) Replication Strategies for Masking Failures:

    2) Fallback Mechanisms and Graceful Degradation:

    3) Software "Black-Box" Technology:

    4) Configuration Management, Resource Optimization and Security Administration:

    5) Resource Control and Accounting:

    6) Reduction and Visualization of System Management Data:

    7) Management Tools and Techniques:

    3.3 Orderly Growth and Evolution

    Avoiding problems before they arise will be an essential component of the NII's overall strategy for dependability and manageability. Toward this end, research in tools and techniques to simplify development and stress testing of robust services will be valuable. Research to develop mechanisms for certifying services will also be important. Empirical research on the NII to identify imminent bottlenecks and predict future traffic patterns will also be required.

    1) Design and Development Methodologies:

    2) Development and Validation Tools for Robust Services:

    3) Modeling Based on Analytical Techniques or Simulation:

    4) Long-Term Empirical Studies: