C.3 Information Access

Principal Authors:

Robert L. Simpson and Jeffrey D. Ullman

Additional Contributors:

Yigal Arens, Jose A. Blakeley, David Crocker, Peter Danzig, Julio Escobar, Michael Genesereth, James Gettys, David Maier, Calton Pu, Marek Rusinkiewicz, Norman K. Sondheimer, Ram Sriram, Lynn Streeter, Subu Subramanian, John Vittal, Darrell Woelk and Gregory Zack


1. Introduction

Information access concerns the aspect of the National Information Infrastructure that enables any American to ask a reasonably complicated question and expect a useful answer to be obtained from information resources available over the network. Important applications of this capability involve medical care or the conduct of business, as well as personal access to the options and opportunities available to us.

Getting to such a state will not be easy. We see four classes of technical problems that must be addressed by a strong research effort:

Section 2 describes the problems in these four areas, Section 3 contains our research and development recommendations in each of the areas, and Section 4 prioritizes the agenda from the perspective of the applications we foresee.

2. Technical Challenges for Information Access in the NII

2.1 User Interfaces

Interfaces are the mechanisms by which a user searches, browses, produces and consumes information on the NII. Americans today use computers, ATM machines, libraries, televisions, fax machines, telephones, electronic bulletin boards and other mechanisms to obtain a desired piece of information (e.g., test results, airline flights, books, pictures, software, bank balances). The user is required to negotiate through a variety of dissimilar user interfaces, and understand the limitations of each service and how the service can be queried. The effectiveness of the response is directly related to the training and ability of the user, disenfranchising most of the population. The systems are literal and do not tolerate errors or imprecision. The results are returned in a variety of predetermined formats and incomplete results do not offer opportunities for reformulation. The systems generally offer limited interactivity with no support for a natural dialogue. The systems generally do not adapt to the user's behavior.

2.2 Search, Discovery and Update

We take "search" to mean locating relevant information anywhere in the NII, which is based on a highly distributed architecture. The criteria used to define what is relevant will have to be very flexible and will stress the current state of the art. Any search against such a large space will have to be highly optimized. "Discovery" is the automatic detection of interesting patterns in the data to derive new knowledge as well as the capability to guide a user in the process of finding "what's out there." "Update" includes issues related to changing existing information resources or appending new information to resources. Together, the mechanisms for search, discovery and update must lead to a manageable and scalable information utility across the NII.

2.3 Information Resource Modeling and Creation

The value of a resource lies not only in its content, but in the ability of people and programs to discern: Models for representing information products need to support these features, and tools for their creation should help produce the ancillary information needed to supply them.

For example, a heart association might offer an information product that lets medical researchers and community hospitals access a collection of EKGs. This information product may need to be catalogued or advertised in several places so different user classes (e.g., researchers, practitioners) can find it.

A potential user needs to assess whether this information resource is germane to his or her inquiry (e.g., heart problems in geriatric females with diabetes). If so, the user's system needs to understand the data type of EKG waveforms in this resource. One user may be working in an environment that knows this EKG representation and the operations that can be performed on it, while another user may have to download the type information and code for the operations on that type. Finally, the user may need to know about the origins of the information. For example, it could be relevant to a medical researcher that the EKG collection contains mainly samples taken from patients between 85 and 89 using three different brands of monitoring equipment.

For effective information access we need a research program supporting the creation of resources and the modeling or description of these resources.

2.4 Interoperation Architectures

We believe that many information resources will be the result of combining existing sources in useful ways. Search will usually involve a combination of resources. Thus, it is crucial that information sources and products be able to work together.

We use the term "interoperation" for the integration of information resources, although that term is often used to refer to integration at a lower level of abstraction, on a level that is more syntactic than semantic. For example, great strides have been made recently in making different vendor's versions of UNIX or SQL "interoperate," i.e., work together seamlessly.

A framework for dealing with information access from such disparate sources is very close to what is needed for integrating services as well as information. Thus, the following observations are relevant to interoperation among heterogeneous systems of all kinds. An interoperation architecture must support the following capabilities:

3. Research and Development Recommendations

3.1 User Interfaces

We believe that the following are fertile research issues for creating user interfaces that are significantly more flexible than today's, while at the same time offering the user capabilities not available today with any degree of difficulty.

Design and Implementation of Advanced Query Languages

New query languages must provide information access transparent of location, content and information organization/format. The language implementation should:

Adaptive Interfaces

The user interface must adapt to user preferences, limitations and behavior. The adaptation may be through observation and learning on the part of the interface software or it may be through user input. Interfaces must be able to deal with context (e.g., different users ascribe different meanings to a term like "board") and adapt to different media for presentation (e.g., a weather map cannot be displayed on a telephone; visual presentation is inappropriate on any medium if the user is blind).

Information Filtering

Without mechanisms to reduce, digest and summarize information, access to the NII will be a nightmare for the average user. Usable filtering tools are needed that constrain searches, filter out unwanted information, and digest and present the useful information in a meaningful way.

3.2 Search, Discovery and Update

The following are the principal research issues associated with the goals of creating a manageable, flexible and scalable information utility.

Self-Describing Resources

Because the NII will contain a vast array of information sources, it is not possible to rely on the presence of a centralized schema. In fact, since data can move around within the network, we cannot rely on the site at which a data item currently resides to have a full description of that data item's behavior and meaning. Thus, we believe that data must carry its own description.

For example, it should be possible for data to carry an indication of its type and a way to locate and access the metadata associated with that type. This metadata might include:

Research is needed on formal ways to express this metadata and on the minimal metadata needed to use information effectively for query processing and optimization.

Another kind of metadata that must be supported in the NII is data that describes system resources. For example, if a user is trying to assemble a collection of information sources that support atomic transactions, it is important to be able to determine the transaction protocols that are supported by each of the components.

Efficiency of Access

The use of databases in a large-scale information system requires sophisticated optimization techniques that go beyond what is available in commercial, usually relational, systems today. Without such an optimization, searching the collective resources of the network could be prohibitively expensive.

For example, we need optimization techniques that can deal with data types like text strings, images, signals, and arbitrary tree and graph structures. Many of these optimization problems are motivated by multimedia data and are discussed in more detail in the Multimedia section.

We also need optimization techniques that are effective--even for conventional, record-based data--on wide-area networks that grow dynamically and that support such capabilities as:

Automatic Indexing and Categorization

There is a need to create tools that make it easier to search the vast amount of information available. For example, we need to create indexes automatically, to help channel queries to potentially useful sources. There is the potential for query systems to adapt to the environment--learning from experience which sources are useful to answer which kinds of queries, and keeping potentially useful information cached nearby. There is the potential for query systems to discover relationships among sources and their data (often called "data mining").

Advanced Query Capabilities

Search systems must support a number of new forms of query. Among these are the following:

Data Quality

There are a number of interesting issues that address the fact that data is of varying quality. How do we deal with data that may be out of date? What if two sources conflict on their responses to a query? How does the choice of using old, easy-to-get data versus hard-to-get, more current data impact query processing and optimization?

Time-Critical Delivery

How can we offer real-time guarantees for queries that must search an unknown and changing universe? More specifically, how can we gracefully degrade the quality of the response in order to meet a time limit? For example, can we retrieve images in whatever resolution is appropriate to the expected response time and cost?

Transaction Processing Issues

It is unlikely that information search requires the classical kind of transaction, where information is locked for a brief period while the transaction runs. Rather, search activities will extend over a long period of time and will probably be able to tolerate some inconsistencies and/or out-of-date information. However, when there are updates or other actions that are taken in response to the information discovered as part of a search, there is a need for specialized forms of control. For example:

3.3 Information Resource Modeling and Creation

The following research issues are particularly important if we are to have authoring tools that assist in the creation of usable information products.

Modeling Content

We must develop standard descriptions for content and coverage for particular information types and user communities. We need to maintain registries of types and how to represent type descriptions so they can be exported along with base information. Because query processors accessing information of this type must develop efficient query plans, the type description must supply enough information that queries about that type can be optimized.

Manipulating Ancillary Information

When information resources are combined to form a new product, the ancillary information (relevance, coverage, type, origin, reliability, etc.) for the new product must be created, preferably by the same authoring tool that creates the new product itself. We need to invent approaches to combining ancillary information from several sources in a way that will maintain the usability and understandability of the result.

Tools to Understand Ancillary Information

Users need to be assisted to understand the content of an information resource. One approach is to develop display and summary tools for the ancillary information. Ontologies and other resources for formally defining meaning of terms must be created and integrated into the user interfaces discussed in Subsection 2.1.

3.4 Interoperation Architectures

The final piece of the research picture involves the integration of resources.

3.4.1 Connectivity Issues

There are a number of proposals and experiments that point the way to architectures for achieving the high-level integration that is necessary to support information access and interoperability of resources and services for the NII. The architectures can generally be placed in three groups: Mediators can be organized into hierarchies or layers, although the illustration in Figure 3.2 shows only a single layer. Examples of mediators include the LIM/IDI system (Unisys) and the SIMS system (USC/ISI). There is a great range of expressiveness in proposed facilitator communication languages. These range from scripting languages (e.g., TCL), through languages that support patterns with types and variables (e.g., BMS, Tooltalk, Corba), to "knowledge" as represented by logical expressions (e.g., ABSI). Another axis along which different types of facilitators can be distinguished is the range of services they provide in support of interoperation. These can range from "yellow-pages" service or interprogram mail through general inference and routing techniques that infer the relationship between requests and offered services. There is support at this level for control of access and billing.

3.4.2 Research Issues in Interoperation Architectures

  • Equally important is the need to understand and translate among different "ontologies," that is collections of terms and their formal definitions. These ontologies are associated either with a field, e.g., cardiology, or with a particular user.
  • 4. Prioritizing the Information Access R&D Agenda

    We believe that the development of the information access tools and technologies needs to be coupled closely to the creation of specific applications. Without a body of experience, it is hard to prioritize the various desiderata. Thus, we began with a thought experiment in which we examined the predicted or expected early applications of the NII in the areas of health care, electronic commerce and electronic libraries. Our goal was to identify research challenges that appear important for all of these applications. Examples of the research issues that were considered crucial to most, if not all, applications are: On the other hand, other issues seem to be particularly important to some applications such as: