C.6 Dependability and Manageability
Principal Authors:
Richard Baseil, Barry R. Lewin and M. Satyanarayanan
Additional Contributors:
Ed Balkovich, Tom Brand, Gary Campbell, Timothy Chou, Harold T. Daugherty,
Coyne Gibson, Bradford Glade, Roger Haskin, John Healy, John H. Howard,
Ming-Yee Lai, Barbara Liskov, John McCormack, J. Eliot B. Moss, Sushil G.
Munshi, Brian Noble, Louis Scerbo, Ian Service, Tom Soller, James Spencer,
Arshad Syed, David Vereeke, John Wilkes and Bernard Ziegler
1. Introduction
While speed, ubiquity and functionality may dominate the headlines, the success
of the National Information Infrastructure will rest ultimately on whether it
meets users' expectations of dependability. This section exposes the
fundamental threats to dependability and manageability (D&M) in the NII,
and details research recommendations for coping with these threats in the NII's
architecture, initial deployment and ongoing successful operation. The scope of
the section is broader than its title might imply. Our intent is to span all
facets of providing high grades of service to users.
The word "infrastructure" connotes dependability to many users. Indeed, society
becomes alerted to the need to invest in infrastructure when its dependability
is compromised (e.g., national highways, water distribution systems). In
general, users will expect high levels of dependability for many services and
will expect flawless delivery of those services. Each of the sample application
areas (health care, manufacturing, education, commerce and government)
presupposes an NII whose dependability is beyond question.
Imagine the impact of NII outages on these and other critical application
areas. Can a patient undergo remote treatment if communication with the doctor
is uncertain or if the doctor is unable to view critical diagnostic data in an
emergency? Will any company entrust a key aspect of its manufacturing process
to the NII if it is unreliable? What kind of a curriculum can educators hope to
establish if the timeliness or quality of delivery of lesson material via the
NII is poor? If the NII is not dependable, how can any government agency use it
to deliver services? With millions of dollars of electronic commerce at stake
each second, can the nation tolerate significant NII downtime?
The NII will consist of products built by many suppliers and managed by many
network operators, all based on each individual's understanding of standards
and requirements. This multiparty interaction lends itself to interoperability
issues that can affect users' perceptions of dependable service. These issues
can range from the inability to plug-and-play because of multiple
interpretations of requirements, to coding errors that cause failures to
propagate nationally.
Meeting user expectations of service quality will depend on how the NII is
architected, constructed, implemented and maintained on an ongoing basis. To
achieve high performance and dependability in the NII, it is necessary to
consider all facets of D&M from the start and include them in the basic
building blocks of the system. The NII will neither be dependable nor
manageable if D&M issues are an afterthought.
2. A Plan for Dependability and Manageability
2.1 The Challenge of Growth and Evolution
As use of the NII grows, so will expectations of its dependability.
Unfortunately, so will the threats to its stability. Although we have
considerable experience in managing telecommunications and wide-area computer
networks, our current solutions in their present forms, based on current
technology and scale, are not going to work for long in the NII. Each quantum
increase in scale and service complexity will stress old solutions and render
some of them inadequate.
For the NII to remain viable, continuous and ongoing research is needed in
scaling up of current solutions, in major improvements to them and in exploring
new solutions. It is important to emphasize that refinements to solutions will
be as important as their initial development and incorporation into the NII. A
one-time, up-front research investment will not sustain the NII as a dependable
and manageable entity forever.
A systemwide perspective is essential when addressing these issues. The NII
will be a sophisticated and interdependent combination of hardware, software
and storage media spanning many levels of the system. Problems can arise due to
individual failures at any of these levels or because of unanticipated
interactions across levels.
2.2 Balancing Progress With Stability
Clearly, the NII should be built with maximum flexibility to allow for
deployment of as many new services as possible. But it is also important to
recognize that precautions need to be taken to assure that these services can
successfully coexist. In effect, the strategy for coping with growth and
evolution must walk a thin line between two kinds of biases. One extreme is
being excessively liberal, imposing no controls at all on increases in scale or
introduction of new experimental services. The other extreme is being
excessively conservative, opposing changes unless their safety and efficacy has
been proven beyond doubt. A liberal bias will encourage innovation, lower entry
barriers, and encourage competition and market forces to play their legitimate
role in realizing the full potential of the NII. On the other hand, a
conservative bias is more likely to yield an NII that is dependable and well
managed.
One approach to coping with this dilemma is to classify services and to offer
different levels of confidence in the dependability and manageability of those
classes. In its simplest form this would involve two classes: "core" and
"peripheral." Core services are those considered essential to the NII and for
which centralized resources (such as management attention) may have to be
committed for meeting service guarantees. Examples of core services might
include those needed for a connection-oriented service, access to and use of
selected servers of critical national importance, and services related to the
NII's management. Adding a core service would be non-trivial and would require
"certification." That certification process will need to be defined, but should
include important aspects of interoperability. In contrast, peripheral services
are those regarded as valuable but less critical and for which availability is
offered on a best-effort basis. Adding a peripheral service involves minimal
bureaucratic overhead, and no centrally funded resources are committed to its
sustenance. During emergencies, peripheral services may be dropped or degraded
in favor of core services.
What is core and what is peripheral will change over time. Since today's luxury
often becomes tomorrow's necessity, some peripheral services will be promoted
to core. Other core services may be created in response to new perceived needs.
There will have to be a process of achieving consensus between users, service
providers and NII administrators to determine what services are core and what
are peripheral. Because the mechanism for achieving this consensus involves
public policy rather than research, we do not address it further in this brief.
But it will be an important aspect of the overall design and operation of the
NII.
2.3 Architecture and Initial Deployment
D&M must be designed into the NII. The architecture needs to be
sufficiently robust to help reduce the effects of failures, and it should act
as an aid toward fault recovery, not a hindrance. Today, products and networks
often have extensive delays in their recoveries from failures because D&M
support is built on top of these systems instead of being built in as integral
parts of the systems. The areas that should include D&M considerations from
the beginning are:
- Architecture: The NII's architecture must be of a robust design to
assure that reliability requirements will be met. It must include management
systems that receive accurate and timely information to monitor and maintain
components of the NII. Management information must not be relegated to such a
low priority that it is unavailable at the very times when it is critically
needed.
- Requirements and standards: Incomplete or ambiguous requirements
and standards are major threats to NII's reliability. It is vitally important
for requirements and standards to be built with complete and unambiguous
standards that stress the importance of reliability.
- Distributed management: The NII will clearly be managed and updated
by many parties. We must avoid having each party act to optimize their own
operation at the possible penalty of others or at the expense of the NII as a
whole. Automated and manual operational safeguards need to be constructed to
ensure distributed manageability can be accomplished safely and effectively.
- Service characterization: The NII's services need to be built with
dependability in mind. Parameters for characterizing reliability as a part of
the service definition are needed. Not all services may require the same level
of dependability, but we must avoid designing less dependable services
initially and deluding ourselves into thinking that improved D&M can easily
be accomplished later.
- Postmortem capability: Because failures will happen, the ability to
understand those failures and make corresponding improvements needs to be part
of the NII's capabilities. We must recognize the eventuality of NII failure and
anticipate it by designing in mechanisms to trap key failure data and
establishing processes to recreate NII faults in a controlled setting for
further study. The airliner "black-box" is an example of where this concept has
been adopted and used effectively in another industry.
- Measurements: The manageability of the NII depends heavily on what
characteristics of the services (and networks) need to be measured, how they
are measured and how those data are used. Too much data is almost as useless as
too little data, if the uses for the data are not clearly understood from the
outset.
- Service deployment: The NII's success will depend on integrating
new, reliable services into the existing infrastructure with minimal service
disruption. The trade-off between speed and reliability needs to be better
understood.
2.4 Perennial Problems
While sound architecture is a good first step for the NII, it will not remain
problem-free forever. We have identified a number of problems that we
characterize as perennial problems for the NII. Although not exhaustive, they
typify the threats to the D&M of the NII. These problems are never going to
go away; they will always be in the wings, waiting to strike. Constant
expenditure of resources and attention will be needed to hold them at bay. A
program of continuous research and development will be needed to address these
problems as the NII grows and evolves.
- Failures of networks, servers or environment: Every hardware
element of the NII is subject to mechanical or electrical failure. Further,
there is already considerable evidence that software failures overshadow
hardware failures; this trend is likely to continue as software in the NII
grows more complex. Other failures such as power, earthquakes, floods, operator
errors and fiber breaks due to accidental dig ups will also pose a threat to
elements of the NII. Unless masked by sufficient redundancy, these failures
will result in unacceptable service outages.
Failure containment is critical because a failed subsystem that is not
rapidly isolated may easily bring down other parts of the network. An
especially challenging task is establishing that the recovery mechanisms of the
NII are indeed capable of handling anticipated failures. This requires proper
simulation of the full range of abnormal operational conditions to ensure that
recovery actions are triggered and stressed.
- Overloading of the NII: For an application that is time critical,
delivering information late may be as bad as not delivering it at all. In
electronic commerce, for example, a market opportunity may be lost in
milliseconds. For remote medical consultation, on the other hand, images of an
exploration need to be delivered jitter-free to the specialist's screen.
Overloading of NII components will be a major threat to timely delivery of
information. Such overloading can be steady-state or transient, and may arise
from extraordinary traffic characteristics, unplanned growth of the system, new
applications or unexpected service interactions.
- Security violations including denial of service: As the NII assumes
an increasingly vital role in society, it will also become an attractive target
for criminal activity. Unauthorized disclosure and modification of information
are obvious threats, but denial of service caused by unauthorized system load
and circumvention of accounting mechanisms will be an equally serious threat.
Denial of service is especially difficult to deal with because it is often
indistinguishable from accidental overloading of the system.
- Incompatibility and interoperability: Even at the initial scale of
the NII, it will be virtually impossible to pause the system for upgrades.
Increased scale will only worsen the problem. As the NII grows, it is likely
that upgrades will occur when the network is under stress, leading to even more
reliability problems.
This implies that all changes in the system will have to be introduced
gradually rather than atomically. While difficult enough with routine upgrades,
this becomes a particularly challenging problem when the motivation for the
upgrade is an emergency fix for security or reliability reasons.
- Clashes and inconsistencies between global and local policies:
Because the NII will be used by a diversity of individuals and
organizations, it will be necessary to decentralize policies as far as
possible. For example, one organization may centrally coordinate the attachment
of new machines to the NII. Others may take a more laissez faire approach and
leave this task to individuals or groups. As another example, one organization
may bear the entire cost of traffic originating from that organization. Another
organization may require individuals or internal cost centers to bear that
cost. Unfortunately, a proliferation of policies renders system management more
complex. It is inevitable that at least some systemwide policies be adopted for
the NII to be manageable. Balancing this tension between centralization and
localization will be a constant challenge as the system evolves.
- Scale-related increase in management complexity: Operator error is
already a major source of failures in high-availability systems. As the NII
increases in scale, it will become increasingly difficult to administer and
manage. A key problem will be information overload. It is easy to present NII
operators with so much information that they are overwhelmed. But excessive
data reduction is also harmful: Problems may be masked and fault isolation will
be difficult. Visualizing this huge and complex system in a manner that allows
timely and effective decisions to be made by operational staff will be a major
challenge.
2.5 Research Goals
The D&M research agenda for the NII should stimulate and nurture any
activity that will improve our ability to cope with the perennial problems
listed in the previous section. Some of the detailed recommendations in this
brief, such as research on replication, caching and load balancing, follow
directly from this broad goal. The value of such research activities is already
recognized today; the creation of the NII will undoubtedly increase their
importance.
But our discussions also identified a number of critical research areas for the
NII where there is a dearth of current activity. These areas are best described
and understood in terms of the goals they support. We list these goals below:
- Make it easier to characterize dependable infrastructure and services.
This includes quantifying currently intangible factors such as quality of
service, ease of management and usability.
- Support pre-implementation design, analysis and verification of
dependability in hardware and software building blocks of the NII and in
compositions of them. Especially important is the ability to model applications
that adapt to changing NII conditions.
- Guide the deployment and evolution of the NII by modeling of
infrastructure and services, and developing techniques for risk and
cost-benefit analyses.
- Substantially automate management of security, resource optimization and
configuration control, and better understand the human role in complex NII
systems and services. Such automation should include continuous monitoring of
system health as well as anticipatory actions to forestall problems.
- Enable validation of metrics, models and architectures through prototype
construction.
- Allow non-disruptive introduction and reconfiguration of services and
provisioning of service databases.
We wish to emphasize the importance of continuous improvement as well as
radical innovation. Specifically, we recommend a balanced portfolio of research
activities that 1) scale up and bullet-proof deployed mechanisms and
subsystems, 2) extend and refine existing technologies and 3) develop and
validate new enabling technologies. D&M are characteristics that will often
require in situ study of implementations as well as of system usage and
behavior. Hence, the research plan for the NII should recognize that the
traditional distinctions between "research," "development" and "deployment"
will be fuzzy in the context of D&M.
3. Research and Development Recommendations
The research required to meet the D&M goals of the NII can be grouped along
three distinct dimensions. All three dimensions are important, and research on
them will be required throughout the life of the NII.
- Characterization and validation of service quality.
- Continuous system operation.
- Orderly growth and evolution.
The next three sections list specific topics pertinent to each of these three
research dimensions. For brevity, we list each topic only once even though it
may be relevant to more than one research dimension. These topics are not
intended to be exhaustive. Rather, they are meant to be examples of the kind of
research that must be done to preserve and enhance the D&M of the NII.
3.1 Characterization and Validation of Service Quality
Unless we can crisply specify and quantify the resource requirements and
performance of a service, we will have to rely solely on anecdotal evidence to
decide if that service is being delivered satisfactorily. Without such
characterization, it will be impossible to assess the impact of a new service
on the NII. Developing the specifications is not enough; efficient runtime
techniques that can confirm that the specifications are being met must also be
developed.
1) Developing and Validating Metrics to Describe Service Quality:
- Performance specification.
- Reliability and availability specification.
- Quantifying manageability.
- Characterization of other service parameters (such as jitters, bandwidth,
availability, reliability).
- Evaluating effectiveness/appropriateness of metrics.
2) Measuring Service Quality:
- Development of efficient and accurate measurement techniques for service
metrics.
- Design and standardization of benchmark suites for service metrics.
- Development of techniques to efficiently monitor service quality in the
running system.
3) Incorporating Service Quality into Interface Specifications:
- Development of specification techniques for service quality.
- Design methodology for incorporating and validating appropriate service
quality metrics into interfaces.
- Techniques for empirical substantiation of specifications for individual
services.
- Research on balancing transparency with user-awareness in services
specifications, especially fault tolerance.
- Development of cost/benefit models for different levels of service
quality.
- Interoperability and standardization activities across suppliers and
operators.
- Compositional techniques to assess service quality from component
qualities.
3.2 Continuous System Operation
Techniques to improve the reliability and availability of hardware and software
components of the system are clearly needed. To complement this effort,
research is also needed on techniques to offer viable fallback options for
services. The Titanic mentality ("It can never happen.") and the mentality that
"there is no escape anyway" must be avoided. An overall approach that combines
failure avoidance with contingency handling is likely to be more robust.
Research on techniques to simplify routine system management as well as to help
in troubleshooting and crash recovery are also important.
1) Replication Strategies for Masking Failures:
- Service replication techniques for availability and performance.
- Hardware and software redundancy techniques for environmental failures.
- Transactional techniques.
- Fault-tolerant replication protocols.
2) Fallback Mechanisms and Graceful Degradation:
- Exploitation of caching.
- Adaptive techniques for coping with changing conditions.
- Exploration of trade-offs between effort expended to sustain a given
quality of service and cost of resorting to fallbacks.
- Validation techniques to ensure degraded service expectations are indeed
being met.
- Techniques for failure containment.
3) Software "Black-Box" Technology:
- Efficient techniques to record detailed event histories in compact form.
- Postmortem analysis techniques to determine causes of failure.
- Effective feedback into design and implementation phases.
4) Configuration Management, Resource Optimization and Security
Administration:
- Techniques for non-disruptive service introduction and reconfiguration.
- Reliable, fast and non-disruptive database provisioning techniques.
- Load balancing techniques.
5) Resource Control and Accounting:
- Strategies for efficient billing and quota enforcement.
- Dynamic inquiry and negotiation of service costs by applications.
- Anonymous electronic payment strategies.
- Price-based congestion control strategies.
6) Reduction and Visualization of System Management Data:
- Graphical presentation techniques to avoid operator overload.
- Ability to visualize effects of proposed changes.
- Modeling of traffic patterns to distinguish normal and abnormal
situations.
7) Management Tools and Techniques:
- Failure detection and isolation techniques in hardware and software.
- Network and service monitoring tools and systems, including those
supporting real-time tracing and diagnosis.
- Better understanding of human role in administering complex NII
systems/services
- Self-management techniques to reduce the number of highly trained system
administration personnel.
- Early-warning techniques to predict service disruptions.
- Intelligent techniques and expert systems to assist and partly automate
management.
3.3 Orderly Growth and Evolution
Avoiding problems before they arise will be an essential component of the NII's
overall strategy for dependability and manageability. Toward this end, research
in tools and techniques to simplify development and stress testing of robust
services will be valuable. Research to develop mechanisms for certifying
services will also be important. Empirical research on the NII to identify
imminent bottlenecks and predict future traffic patterns will also be required.
1) Design and Development Methodologies:
- Incremental construction techniques to reduce cost and enhance
reliability.
- Abstraction techniques to reduce apparent complexity of highly available
services.
- Techniques for reducing and surviving Byzantine failures.
- Development methodologies supporting change and extensibility of the
NII.
- Simulation and emulation methodologies to understand vulnerability of NII
to specific types of failures.
- Validation of metrics, models and architectures through prototypes.
- Research supporting systems reliability and analysis: risk and
cost-benefit analysis.
2) Development and Validation Tools for Robust Services:
- Interoperability and regression testing tools.
- Tools and analytical techniques to identify and correct undesirable
service interactions.
- Reliability analysis tools.
- Service quality assurance tools.
- Frameworks and tools to simplify future implementations of new and
existing protocols.
- In situ and standalone stress testing techniques for hardware and
software.
3) Modeling Based on Analytical Techniques or Simulation:
- Techniques to evaluate long-term cost-effectiveness of alternative
resource allocation strategies and to compare design choices before
investment.
- Infrastructure and service modeling to guide the design, deployment and
evolution of the NII.
- Modeling and analysis of adaptive applications.
- Techniques for assessing software reliability.
- Tools and techniques for reliability, performance and quality-of-service
modeling.
- Analytical approaches to network and server traffic analysis, demand
modeling, and capacity measurement and management.
- Measurement, modeling and prediction of hardware and software failures.
4) Long-Term Empirical Studies:
- Workload and traffic characterization.
- Postmortem analysis and understanding of system failures.
- Historical analysis of data and prediction of future evolution.
- Empirical approaches to demand modeling, and capacity measurement and
management.
- Validation of analytic and simulation models.