C.3 Information Access

Principal Authors:

Robert L. Simpson and Jeffrey D. Ullman

Additional Contributors:

Yigal Arens, Jose A. Blakeley, David Crocker, Peter Danzig, Julio Escobar, Michael Genesereth, James Gettys, David Maier, Calton Pu, Marek Rusinkiewicz, Norman K. Sondheimer, Ram Sriram, Lynn Streeter, Subu Subramanian, John Vittal, Darrell Woelk and Gregory Zack

1. Introduction

Information access concerns the aspect of the National Information Infrastructure that enables any American to ask a reasonably complicated question and expect a useful answer to be obtained from information resources available over the network. Important applications of this capability involve medical care or the conduct of business, as well as personal access to the options and opportunities available to us.

Getting to such a state will not be easy. We see four classes of technical problems that must be addressed by a strong research effort:

The creation of effective user interfaces to make a myriad of information resources understandable.
The discovery of ways to search information repositories and obtain answers to specific queries.
The creation of the information resources themselves, including the modeling of resource content to assist access.
The appropriate architecture of systems to integrate information from multiple sources.

Section 2 describes the problems in these four areas, Section 3 contains our research and development recommendations in each of the areas, and Section 4 prioritizes the agenda from the perspective of the applications we foresee.

2. Technical Challenges for Information Access in the NII

2.1 User Interfaces

Interfaces are the mechanisms by which a user searches, browses, produces and consumes information on the NII. Americans today use computers, ATM machines, libraries, televisions, fax machines, telephones, electronic bulletin boards and other mechanisms to obtain a desired piece of information (e.g., test results, airline flights, books, pictures, software, bank balances). The user is required to negotiate through a variety of dissimilar user interfaces, and understand the limitations of each service and how the service can be queried. The effectiveness of the response is directly related to the training and ability of the user, disenfranchising most of the population. The systems are literal and do not tolerate errors or imprecision. The results are returned in a variety of predetermined formats and incomplete results do not offer opportunities for reformulation. The systems generally offer limited interactivity with no support for a natural dialogue. The systems generally do not adapt to the user's behavior.

2.2 Search, Discovery and Update

We take "search" to mean locating relevant information anywhere in the NII, which is based on a highly distributed architecture. The criteria used to define what is relevant will have to be very flexible and will stress the current state of the art. Any search against such a large space will have to be highly optimized. "Discovery" is the automatic detection of interesting patterns in the data to derive new knowledge as well as the capability to guide a user in the process of finding "what's out there." "Update" includes issues related to changing existing information resources or appending new information to resources. Together, the mechanisms for search, discovery and update must lead to a manageable and scalable information utility across the NII.

2.3 Information Resource Modeling and Creation

The value of a resource lies not only in its content, but in the ability of people and programs to discern:

What the resource provides.
How the information therein was derived, and what the quality and reliability of the information is.
Whether the information is relevant to the task at hand.
What the data types of the information are and how they can be converted to a desired form for viewing and editing.

Models for representing information products need to support these features, and tools for their creation should help produce the ancillary information needed to supply them.

For example, a heart association might offer an information product that lets medical researchers and community hospitals access a collection of EKGs. This information product may need to be catalogued or advertised in several places so different user classes (e.g., researchers, practitioners) can find it.

A potential user needs to assess whether this information resource is germane to his or her inquiry (e.g., heart problems in geriatric females with diabetes). If so, the user's system needs to understand the data type of EKG waveforms in this resource. One user may be working in an environment that knows this EKG representation and the operations that can be performed on it, while another user may have to download the type information and code for the operations on that type. Finally, the user may need to know about the origins of the information. For example, it could be relevant to a medical researcher that the EKG collection contains mainly samples taken from patients between 85 and 89 using three different brands of monitoring equipment.

For effective information access we need a research program supporting the creation of resources and the modeling or description of these resources.

2.4 Interoperation Architectures

We believe that many information resources will be the result of combining existing sources in useful ways. Search will usually involve a combination of resources. Thus, it is crucial that information sources and products be able to work together.

We use the term "interoperation" for the integration of information resources, although that term is often used to refer to integration at a lower level of abstraction, on a level that is more syntactic than semantic. For example, great strides have been made recently in making different vendor's versions of UNIX or SQL "interoperate," i.e., work together seamlessly.

A framework for dealing with information access from such disparate sources is very close to what is needed for integrating services as well as information. Thus, the following observations are relevant to interoperation among heterogeneous systems of all kinds. An interoperation architecture must support the following capabilities:

Seamlessness: From a user's or an application program's perspective, heterogeneous, distributed resources should be accessible as though developed under a single system design and, to the extent possible, should appear to be available from a single resource.
Semantic interoperation: The data type of information should not obscure its inherent meaning. Rather, information should be converted from one representation to another automatically, while its meaning is preserved.
Scalability: The architecture must offer performance that does not degrade as the number of information sources or participants grows.
Extensibility: Ability to accommodate new resources and services as needed by applications.
Robustness: Activities must be maintained in the face of system failures, other changes in available resources, or incompleteness of information and services.
Efficiency: The architecture must support rapid exchange of data when needed, offer bounded-time performance if desired, and support access to lowest-cost providers.
Reliability: The architecture must support security, privacy, billing, audit trails, inconsistency detection and resolution, pedigrees regarding data origin and quality, and other services needed for a useful information infrastructure. The architecture should allow proprietary information or intellectual property to be shared in a secure fashion.
Controllability: The architecture should accept and use preference information for routing, should the user so desire.
Transaction management: The architecture should provide support for both long- and short-term transactions, very high transaction volume and intermittent connectivity. Long-term transactions may include subtransactions that require human response as well as response from a variety of automated information resources.
Support for negotiation: The architecture should provide the opportunity for resources to compete as servers and to cooperate in economically appropriate ways.

3. Research and Development Recommendations

3.1 User Interfaces

We believe that the following are fertile research issues for creating user interfaces that are significantly more flexible than today's, while at the same time offering the user capabilities not available today with any degree of difficulty.

Design and Implementation of Advanced Query Languages

New query languages must provide information access transparent of location, content and information organization/format. The language implementation should:

Act as an electronic agent or intermediary, tolerating inaccurate or imprecise requests and mapping them from the user's domain to the sources.
Support a constructive dialogue to reformulate the request if unsuccessful.
Anticipate user requests by pre-fetching information that meets the pattern of activity of the user.

Adaptive Interfaces

The user interface must adapt to user preferences, limitations and behavior. The adaptation may be through observation and learning on the part of the interface software or it may be through user input. Interfaces must be able to deal with context (e.g., different users ascribe different meanings to a term like "board") and adapt to different media for presentation (e.g., a weather map cannot be displayed on a telephone; visual presentation is inappropriate on any medium if the user is blind).

Information Filtering

Without mechanisms to reduce, digest and summarize information, access to the NII will be a nightmare for the average user. Usable filtering tools are needed that constrain searches, filter out unwanted information, and digest and present the useful information in a meaningful way.

3.2 Search, Discovery and Update

The following are the principal research issues associated with the goals of creating a manageable, flexible and scalable information utility.

Self-Describing Resources

Because the NII will contain a vast array of information sources, it is not possible to rely on the presence of a centralized schema. In fact, since data can move around within the network, we cannot rely on the site at which a data item currently resides to have a full description of that data item's behavior and meaning. Thus, we believe that data must carry its own description.

For example, it should be possible for data to carry an indication of its type and a way to locate and access the metadata associated with that type. This metadata might include:

The operations that can be performed on the data.
Constraints on the data.
The expected sizes of the results obtained by applying specific operations to the data.

Research is needed on formal ways to express this metadata and on the minimal metadata needed to use information effectively for query processing and optimization.

Another kind of metadata that must be supported in the NII is data that describes system resources. For example, if a user is trying to assemble a collection of information sources that support atomic transactions, it is important to be able to determine the transaction protocols that are supported by each of the components.

Efficiency of Access

The use of databases in a large-scale information system requires sophisticated optimization techniques that go beyond what is available in commercial, usually relational, systems today. Without such an optimization, searching the collective resources of the network could be prohibitively expensive.

For example, we need optimization techniques that can deal with data types like text strings, images, signals, and arbitrary tree and graph structures. Many of these optimization problems are motivated by multimedia data and are discussed in more detail in the Multimedia section.

We also need optimization techniques that are effective--even for conventional, record-based data--on wide-area networks that grow dynamically and that support such capabilities as:

Replication of data.
Local caching.
Migration of data.

Automatic Indexing and Categorization

There is a need to create tools that make it easier to search the vast amount of information available. For example, we need to create indexes automatically, to help channel queries to potentially useful sources. There is the potential for query systems to adapt to the environment--learning from experience which sources are useful to answer which kinds of queries, and keeping potentially useful information cached nearby. There is the potential for query systems to discover relationships among sources and their data (often called "data mining").

Advanced Query Capabilities

Search systems must support a number of new forms of query. Among these are the following:

Approximate query forms: We anticipate many situations where there is no answer to a query as stated. In addition to assisting the user in reformulating the query, as discussed in Section 2.1, a search system should be able to relax the query in appropriate ways. For example, a person who requests a flight from Boston to San Francisco leaving at 8 A.M. should be told about flights leaving at 8:30 A.M. and about connecting flights leaving at 8:10 A.M., but not about flights leaving at 8 A.M. to Los Angeles.
Query capability for non-traditional data: Users should be able to describe patterns in an EKG, offer sketches of images that they wish retrieved, find a musical score by humming a few bars and in general perform queries on information that is not in the traditional record-oriented forms. Notions of approximation to queries as stated are especially important for "multimedia" information.
Continuum between querying and browsing: New query facilities should support information access through a spectrum of approaches, ranging from menu-driven browsing through traditional queries, to the implementation of general-purpose search programs.

Data Quality

There are a number of interesting issues that address the fact that data is of varying quality. How do we deal with data that may be out of date? What if two sources conflict on their responses to a query? How does the choice of using old, easy-to-get data versus hard-to-get, more current data impact query processing and optimization?

Time-Critical Delivery

How can we offer real-time guarantees for queries that must search an unknown and changing universe? More specifically, how can we gracefully degrade the quality of the response in order to meet a time limit? For example, can we retrieve images in whatever resolution is appropriate to the expected response time and cost?

Transaction Processing Issues

It is unlikely that information search requires the classical kind of transaction, where information is locked for a brief period while the transaction runs. Rather, search activities will extend over a long period of time and will probably be able to tolerate some inconsistencies and/or out-of-date information. However, when there are updates or other actions that are taken in response to the information discovered as part of a search, there is a need for specialized forms of control. For example:

An operation to plan a trip may search airline, hotel and other information sources. Suboperations may book flights that then need to be rolled back because there is no available hotel.
Active databases support rules that are triggered when certain events occur. For example, when a library receives an image of a new document, it may generate notifications to clients who have requested to see the document. It may also notify other libraries and expect them to note in their own databases that the document is available. We must be able to control cascades of triggered rules, avoid cycles and separate the rule-generated consequences of an action from the action itself.

3.3 Information Resource Modeling and Creation

The following research issues are particularly important if we are to have authoring tools that assist in the creation of usable information products.

Modeling Content

We must develop standard descriptions for content and coverage for particular information types and user communities. We need to maintain registries of types and how to represent type descriptions so they can be exported along with base information. Because query processors accessing information of this type must develop efficient query plans, the type description must supply enough information that queries about that type can be optimized.

Manipulating Ancillary Information

When information resources are combined to form a new product, the ancillary information (relevance, coverage, type, origin, reliability, etc.) for the new product must be created, preferably by the same authoring tool that creates the new product itself. We need to invent approaches to combining ancillary information from several sources in a way that will maintain the usability and understandability of the result.

Tools to Understand Ancillary Information

Users need to be assisted to understand the content of an information resource. One approach is to develop display and summary tools for the ancillary information. Ontologies and other resources for formally defining meaning of terms must be created and integrated into the user interfaces discussed in Subsection 2.1.

3.4 Interoperation Architectures

The final piece of the research picture involves the integration of resources.

3.4.1 Connectivity Issues

There are a number of proposals and experiments that point the way to architectures for achieving the high-level integration that is necessary to support information access and interoperability of resources and services for the NII. The architectures can generally be placed in three groups:

Point-to-point connectivity: The system software provides reliable bit streams between any two processes, as suggested by Figure 3.1. Interoperation is assured by a common communication language. An example of this type of architecture is CNRI's Knowbot.
Mediators: A "mediator" is a facility that provides an information service. It accepts queries from users (or other mediators) and answers them by retrieving, combining and/or otherwise manipulating data in raw databases or information provided by other mediators. Mediators eliminate the need for users of information resources to interact with them directly, or even to know the precise identity of the sources of the information they require. In addition to simple retrieval of information, a mediator could summarize or abstract information in response to a user's request.

Mediators can be organized into hierarchies or layers, although the illustration in Figure 3.2 shows only a single layer. Examples of mediators include the LIM/IDI system (Unisys) and the SIMS system (USC/ISI).

Federated architectures: The mediator approach replaces direct linking of applications to databases. It allows applications to be programmed in terms of their information needs, instead of specific queries to specific sources for the data they require. It can be used to support dynamic identification of relevant data sources, and is thus much more extensible than incorporating database calls directly into applications. To be accessible to applications, new data sources need only be identified and made accessible to revelent mediators.

There is a great range of expressiveness in proposed facilitator communication languages. These range from scripting languages (e.g., TCL), through languages that support patterns with types and variables (e.g., BMS, Tooltalk, Corba), to "knowledge" as represented by logical expressions (e.g., ABSI). Another axis along which different types of facilitators can be distinguished is the range of services they provide in support of interoperation. These can range from "yellow-pages" service or interprogram mail through general inference and routing techniques that infer the relationship between requests and offered services. There is support at this level for control of access and billing.

3.4.2 Research Issues in Interoperation Architectures

Translation methodologies: We need to develop tools and techniques for translating among the many notations and standards that exist or will appear in the future. Simple, syntactic transformations (e.g., between TeX and RTF) are commercial matters, not research matters. However, higher-level translations involving the semantics of information are a matter for research. Relevant questions include translation among different data models (e.g., logical or relational versus networks or object-structures) and translation among different protocols (e.g., streams of replies versus bundled replies).

Equally important is the need to understand and translate among different "ontologies," that is collections of terms and their formal definitions. These ontologies are associated either with a field, e.g., cardiology, or with a particular user.

Content languages: As opposed to languages that describe structure and type, as discussed in Section 2.3 on resource modeling, there is a need to convey content among resources. We therefore need to develop standards for the exchange of information. These new languages must build upon what has been developed for specific kinds of information such as data (e.g., SQL and its successors), object-oriented messages (e.g., IDL), knowledge (e.g., KIF/KQML of the ARPA knowledge-sharing effort) and programs (e.g., scripting languages).
Ontology development: Communication among resources in an application domain requires the development of suitable ontologies for the domain and making these ontologies widely available. These ontologies will offer both human users and the integration system a formal definition of domain terms and their relationships. Their creation requires both the development of languages and tools to capture domain knowledge, that is, for specification of meaning and the careful design of the definitions themselves. We expect to see research into the development of ontologies for medicine, engineering, manufacturing and process control, transportation planning, commerce and financial management, among many other domains.
Support for important system requirements: Research is needed on ways to support capabilities mentioned at the beginning of this subsection, such as reliability, controllability and transaction management.
Runtime support software: When using a federated approach, most decisions need to be made at runtime, because we cannot predict in advance what resources will participate in a search or other interaction. Tools to monitor system performance and debug failures are essential to the effective use of integration services.
Off-line support software: Not all integration will be at runtime, or "on line." We recognize also that there is a great need for "off-line" design of integrated information resources. For example, we mentioned above the need for tools to capture domain knowledge in ontologies. Section 2.3 discussed the need for tools to create representations of the content of databases and other information resources. We also see the need for tools to configure integrated systems from components, along the lines of Figs. 3.1 and 3.2.

4. Prioritizing the Information Access R&D Agenda

We believe that the development of the information access tools and technologies needs to be coupled closely to the creation of specific applications. Without a body of experience, it is hard to prioritize the various desiderata. Thus, we began with a thought experiment in which we examined the predicted or expected early applications of the NII in the areas of health care, electronic commerce and electronic libraries. Our goal was to identify research challenges that appear important for all of these applications. Examples of the research issues that were considered crucial to most, if not all, applications are:

Support for approximate queries and automatic query relaxation.
Real-time or time-constrained delivery of information.
Active elements: alerting and triggering.
Support for location of relevant information resources, description of available resources and automatic delivery of information by subscription.
Support for finding information of best quality or most recent information.

On the other hand, other issues seem to be particularly important to some applications such as:

Adaptive user profiles and filtering of information (electronic libraries).
Ultra-high transaction volume (electronic commerce).
Access to non-traditional information resources (health care).
Support for audit trails (commerce, health care).
Support for exchange of confidential information (commerce, health care).