C.3 Information Access
Principal Authors:
Robert L. Simpson and Jeffrey D. Ullman
Additional Contributors:
Yigal Arens, Jose A. Blakeley, David Crocker, Peter Danzig, Julio Escobar,
Michael Genesereth, James Gettys, David Maier, Calton Pu, Marek Rusinkiewicz,
Norman K. Sondheimer, Ram Sriram, Lynn Streeter, Subu Subramanian, John Vittal,
Darrell Woelk and Gregory Zack
1. Introduction
Information access concerns the aspect of the National Information
Infrastructure that enables any American to ask a reasonably complicated
question and expect a useful answer to be obtained from information resources
available over the network. Important applications of this capability involve
medical care or the conduct of business, as well as personal access to the
options and opportunities available to us.
Getting to such a state will not be easy. We see four classes of technical
problems that must be addressed by a strong research effort:
- The creation of effective user interfaces to make a myriad of information
resources understandable.
- The discovery of ways to search information repositories and obtain
answers to specific queries.
- The creation of the information resources themselves, including the
modeling of resource content to assist access.
- The appropriate architecture of systems to integrate information from
multiple sources.
Section 2 describes the problems in these four areas, Section 3 contains our
research and development recommendations in each of the areas, and Section 4
prioritizes the agenda from the perspective of the applications we foresee.
2. Technical Challenges for Information Access in the NII
2.1 User Interfaces
Interfaces are the mechanisms by which a user searches, browses, produces and
consumes information on the NII. Americans today use computers, ATM machines,
libraries, televisions, fax machines, telephones, electronic bulletin boards
and other mechanisms to obtain a desired piece of information (e.g., test
results, airline flights, books, pictures, software, bank balances). The user
is required to negotiate through a variety of dissimilar user interfaces, and
understand the limitations of each service and how the service can be queried.
The effectiveness of the response is directly related to the training and
ability of the user, disenfranchising most of the population. The systems are
literal and do not tolerate errors or imprecision. The results are returned in
a variety of predetermined formats and incomplete results do not offer
opportunities for reformulation. The systems generally offer limited
interactivity with no support for a natural dialogue. The systems generally do
not adapt to the user's behavior.
2.2 Search, Discovery and Update
We take "search" to mean locating relevant information anywhere in the NII,
which is based on a highly distributed architecture. The criteria used to
define what is relevant will have to be very flexible and will stress the
current state of the art. Any search against such a large space will have to be
highly optimized. "Discovery" is the automatic detection of interesting
patterns in the data to derive new knowledge as well as the capability to guide
a user in the process of finding "what's out there." "Update" includes issues
related to changing existing information resources or appending new information
to resources. Together, the mechanisms for search, discovery and update must
lead to a manageable and scalable information utility across the NII.
2.3 Information Resource Modeling and Creation
The value of a resource lies not only in its content, but in the ability of
people and programs to discern:
- What the resource provides.
- How the information therein was derived, and what the quality and
reliability of the information is.
- Whether the information is relevant to the task at hand.
- What the data types of the information are and how they can be converted
to a desired form for viewing and editing.
Models for representing information products need to support these features,
and tools for their creation should help produce the ancillary information
needed to supply them.
For example, a heart association might offer an information product that lets
medical researchers and community hospitals access a collection of EKGs. This
information product may need to be catalogued or advertised in several places
so different user classes (e.g., researchers, practitioners) can find it.
A potential user needs to assess whether this information resource is germane
to his or her inquiry (e.g., heart problems in geriatric females with
diabetes). If so, the user's system needs to understand the data type of EKG
waveforms in this resource. One user may be working in an environment that
knows this EKG representation and the operations that can be performed on it,
while another user may have to download the type information and code for the
operations on that type. Finally, the user may need to know about the origins
of the information. For example, it could be relevant to a medical researcher
that the EKG collection contains mainly samples taken from patients between 85
and 89 using three different brands of monitoring equipment.
For effective information access we need a research program supporting the
creation of resources and the modeling or description of these resources.
2.4 Interoperation Architectures
We believe that many information resources will be the result of combining
existing sources in useful ways. Search will usually involve a combination of
resources. Thus, it is crucial that information sources and products be able to
work together.
We use the term "interoperation" for the integration of information resources,
although that term is often used to refer to integration at a lower level of
abstraction, on a level that is more syntactic than semantic. For example,
great strides have been made recently in making different vendor's versions of
UNIX or SQL "interoperate," i.e., work together seamlessly.
A framework for dealing with information access from such disparate sources is
very close to what is needed for integrating services as well as information.
Thus, the following observations are relevant to interoperation among
heterogeneous systems of all kinds. An interoperation architecture must support
the following capabilities:
- Seamlessness: From a user's or an application program's
perspective, heterogeneous, distributed resources should be accessible as
though developed under a single system design and, to the extent possible,
should appear to be available from a single resource.
- Semantic interoperation: The data type of information should not
obscure its inherent meaning. Rather, information should be converted from one
representation to another automatically, while its meaning is preserved.
- Scalability: The architecture must offer performance that does not
degrade as the number of information sources or participants grows.
- Extensibility: Ability to accommodate new resources and services as
needed by applications.
- Robustness: Activities must be maintained in the face of system
failures, other changes in available resources, or incompleteness of
information and services.
- Efficiency: The architecture must support rapid exchange of data
when needed, offer bounded-time performance if desired, and support access to
lowest-cost providers.
- Reliability: The architecture must support security, privacy,
billing, audit trails, inconsistency detection and resolution, pedigrees
regarding data origin and quality, and other services needed for a useful
information infrastructure. The architecture should allow proprietary
information or intellectual property to be shared in a secure fashion.
- Controllability: The architecture should accept and use preference
information for routing, should the user so desire.
- Transaction management: The architecture should provide support for
both long- and short-term transactions, very high transaction volume and
intermittent connectivity. Long-term transactions may include subtransactions
that require human response as well as response from a variety of automated
information resources.
- Support for negotiation: The architecture should provide the
opportunity for resources to compete as servers and to cooperate in
economically appropriate ways.
3. Research and Development Recommendations
3.1 User Interfaces
We believe that the following are fertile research issues for creating user
interfaces that are significantly more flexible than today's, while at the same
time offering the user capabilities not available today with any degree of
difficulty.
Design and Implementation of Advanced Query Languages
New query languages must provide information access transparent of location,
content and information organization/format. The language implementation should:
- Act as an electronic agent or intermediary, tolerating inaccurate or
imprecise requests and mapping them from the user's domain to the sources.
- Support a constructive dialogue to reformulate the request if
unsuccessful.
- Anticipate user requests by pre-fetching information that meets the
pattern of activity of the user.
Adaptive Interfaces
The user interface must adapt to user preferences, limitations and behavior.
The adaptation may be through observation and learning on the part of the
interface software or it may be through user input. Interfaces must be able to
deal with context (e.g., different users ascribe different meanings to a term
like "board") and adapt to different media for presentation (e.g., a weather
map cannot be displayed on a telephone; visual presentation is inappropriate on
any medium if the user is blind).
Information Filtering
Without mechanisms to reduce, digest and summarize information, access to the
NII will be a nightmare for the average user. Usable filtering tools are needed
that constrain searches, filter out unwanted information, and digest and
present the useful information in a meaningful way.
3.2 Search, Discovery and Update
The following are the principal research issues associated with the goals of
creating a manageable, flexible and scalable information utility.
Self-Describing Resources
Because the NII will contain a vast array of information sources, it is not
possible to rely on the presence of a centralized schema. In fact, since data
can move around within the network, we cannot rely on the site at which a data
item currently resides to have a full description of that data item's behavior
and meaning. Thus, we believe that data must carry its own description.
For example, it should be possible for data to carry an indication of its type
and a way to locate and access the metadata associated with that type. This
metadata might include:
- The operations that can be performed on the data.
- Constraints on the data.
- The expected sizes of the results obtained by applying specific operations
to the data.
Research is needed on formal ways to express this metadata and on the minimal
metadata needed to use information effectively for query processing and
optimization.
Another kind of metadata that must be supported in the NII is data that
describes system resources. For example, if a user is trying to assemble a
collection of information sources that support atomic transactions, it is
important to be able to determine the transaction protocols that are supported
by each of the components.
Efficiency of Access
The use of databases in a large-scale information system requires sophisticated
optimization techniques that go beyond what is available in commercial, usually
relational, systems today. Without such an optimization, searching the
collective resources of the network could be prohibitively expensive.
For example, we need optimization techniques that can deal with data types like
text strings, images, signals, and arbitrary tree and graph structures. Many of
these optimization problems are motivated by multimedia data and are discussed
in more detail in the Multimedia section.
We also need optimization techniques that are effective--even for conventional,
record-based data--on wide-area networks that grow dynamically and that support
such capabilities as:
- Replication of data.
- Local caching.
- Migration of data.
Automatic Indexing and Categorization
There is a need to create tools that make it easier to search the vast amount
of information available. For example, we need to create indexes automatically,
to help channel queries to potentially useful sources. There is the potential
for query systems to adapt to the environment--learning from experience which
sources are useful to answer which kinds of queries, and keeping potentially
useful information cached nearby. There is the potential for query systems to
discover relationships among sources and their data (often called "data
mining").
Advanced Query Capabilities
Search systems must support a number of new forms of query. Among these are the
following:
- Approximate query forms: We anticipate many situations where there
is no answer to a query as stated. In addition to assisting the user in
reformulating the query, as discussed in Section 2.1, a search system should be
able to relax the query in appropriate ways. For example, a person who requests
a flight from Boston to San Francisco leaving at 8 A.M. should be told about
flights leaving at 8:30 A.M. and about connecting flights leaving at 8:10 A.M.,
but not about flights leaving at 8 A.M. to Los Angeles.
- Query capability for non-traditional data: Users should be able to
describe patterns in an EKG, offer sketches of images that they wish retrieved,
find a musical score by humming a few bars and in general perform queries on
information that is not in the traditional record-oriented forms. Notions of
approximation to queries as stated are especially important for "multimedia"
information.
- Continuum between querying and browsing: New query facilities should
support information access through a spectrum of approaches, ranging from
menu-driven browsing through traditional queries, to the implementation of
general-purpose search programs.
Data Quality
There are a number of interesting issues that address the fact that data is of
varying quality. How do we deal with data that may be out of date? What if two
sources conflict on their responses to a query? How does the choice of using
old, easy-to-get data versus hard-to-get, more current data impact query
processing and optimization?
Time-Critical Delivery
How can we offer real-time guarantees for queries that must search an unknown
and changing universe? More specifically, how can we gracefully degrade the
quality of the response in order to meet a time limit? For example, can we
retrieve images in whatever resolution is appropriate to the expected response
time and cost?
Transaction Processing Issues
It is unlikely that information search requires the classical kind of
transaction, where information is locked for a brief period while the
transaction runs. Rather, search activities will extend over a long period of
time and will probably be able to tolerate some inconsistencies and/or
out-of-date information. However, when there are updates or other actions that
are taken in response to the information discovered as part of a search, there
is a need for specialized forms of control. For example:
- An operation to plan a trip may search airline, hotel and other
information sources. Suboperations may book flights that then need to be rolled
back because there is no available hotel.
- Active databases support rules that are triggered when certain events
occur. For example, when a library receives an image of a new document, it may
generate notifications to clients who have requested to see the document. It
may also notify other libraries and expect them to note in their own databases
that the document is available. We must be able to control cascades of
triggered rules, avoid cycles and separate the rule-generated consequences of
an action from the action itself.
3.3 Information Resource Modeling and Creation
The following research issues are particularly important if we are to have
authoring tools that assist in the creation of usable information products.
Modeling Content
We must develop standard descriptions for content and coverage for particular
information types and user communities. We need to maintain registries of types
and how to represent type descriptions so they can be exported along with base
information. Because query processors accessing information of this type must
develop efficient query plans, the type description must supply enough
information that queries about that type can be optimized.
Manipulating Ancillary Information
When information resources are combined to form a new product, the ancillary
information (relevance, coverage, type, origin, reliability, etc.) for the new
product must be created, preferably by the same authoring tool that creates the
new product itself. We need to invent approaches to combining ancillary
information from several sources in a way that will maintain the usability and
understandability of the result.
Tools to Understand Ancillary Information
Users need to be assisted to understand the content of an information resource.
One approach is to develop display and summary tools for the ancillary
information. Ontologies and other resources for formally defining meaning of
terms must be created and integrated into the user interfaces discussed in
Subsection 2.1.
3.4 Interoperation Architectures
The final piece of the research picture involves the integration of resources.
3.4.1 Connectivity Issues
There are a number of proposals and experiments that point the way to
architectures for achieving the high-level integration that is necessary to
support information access and interoperability of resources and services for
the NII. The architectures can generally be placed in three groups:
- Point-to-point connectivity: The system software provides reliable
bit streams between any two processes, as suggested by Figure 3.1.
Interoperation is assured by a common communication language. An example of
this type of architecture is CNRI's Knowbot.
- Mediators: A "mediator" is a facility that provides an information
service. It accepts queries from users (or other mediators) and answers them by
retrieving, combining and/or otherwise manipulating data in raw databases or
information provided by other mediators. Mediators eliminate the need for users
of information resources to interact with them directly, or even to know the
precise identity of the sources of the information they require. In addition to
simple retrieval of information, a mediator could summarize or abstract
information in response to a user's request.
Mediators can be organized into hierarchies or layers, although the
illustration in Figure 3.2 shows only a single layer. Examples of mediators
include the LIM/IDI system (Unisys) and the SIMS system (USC/ISI).
- Federated architectures: The mediator approach replaces direct
linking of applications to databases. It allows applications to be programmed
in terms of their information needs, instead of specific queries to specific
sources for the data they require. It can be used to support dynamic
identification of relevant data sources, and is thus much more extensible than
incorporating database calls directly into applications. To be accessible to
applications, new data sources need only be identified and made accessible to
revelent mediators.
There is a great range of expressiveness in proposed facilitator communication
languages. These range from scripting languages (e.g., TCL), through languages
that support patterns with types and variables (e.g., BMS, Tooltalk, Corba), to
"knowledge" as represented by logical expressions (e.g., ABSI). Another axis
along which different types of facilitators can be distinguished is the range
of services they provide in support of interoperation. These can range from
"yellow-pages" service or interprogram mail through general inference and
routing techniques that infer the relationship between requests and offered
services. There is support at this level for control of access and billing.
3.4.2 Research Issues in Interoperation Architectures
- Translation methodologies: We need to develop tools and techniques
for translating among the many notations and standards that exist or will
appear in the future. Simple, syntactic transformations (e.g., between TeX and
RTF) are commercial matters, not research matters. However, higher-level
translations involving the semantics of information are a matter for research.
Relevant questions include translation among different data models (e.g.,
logical or relational versus networks or object-structures) and translation
among different protocols (e.g., streams of replies versus bundled replies).
Equally important is the need to understand and translate among different
"ontologies," that is collections of terms and their formal definitions. These
ontologies are associated either with a field, e.g., cardiology, or with a
particular user.
- Content languages: As opposed to languages that describe structure
and type, as discussed in Section 2.3 on resource modeling, there is a need to
convey content among resources. We therefore need to develop standards for the
exchange of information. These new languages must build upon what has been
developed for specific kinds of information such as data (e.g., SQL and its
successors), object-oriented messages (e.g., IDL), knowledge (e.g., KIF/KQML of
the ARPA knowledge-sharing effort) and programs (e.g., scripting languages).
- Ontology development: Communication among resources in an
application domain requires the development of suitable ontologies for the
domain and making these ontologies widely available. These ontologies will
offer both human users and the integration system a formal definition of domain
terms and their relationships. Their creation requires both the development of
languages and tools to capture domain knowledge, that is, for specification of
meaning and the careful design of the definitions themselves. We expect to see
research into the development of ontologies for medicine, engineering,
manufacturing and process control, transportation planning, commerce and
financial management, among many other domains.
- Support for important system requirements: Research is needed on
ways to support capabilities mentioned at the beginning of this subsection,
such as reliability, controllability and transaction management.
- Runtime support software: When using a federated approach, most
decisions need to be made at runtime, because we cannot predict in advance what
resources will participate in a search or other interaction. Tools to monitor
system performance and debug failures are essential to the effective use of
integration services.
- Off-line support software: Not all integration will be at runtime,
or "on line." We recognize also that there is a great need for "off-line"
design of integrated information resources. For example, we mentioned above the
need for tools to capture domain knowledge in ontologies. Section 2.3 discussed
the need for tools to create representations of the content of databases and
other information resources. We also see the need for tools to configure
integrated systems from components, along the lines of Figs. 3.1 and 3.2.
4. Prioritizing the Information Access R&D Agenda
We believe that the development of the information access tools and
technologies needs to be coupled closely to the creation of specific
applications. Without a body of experience, it is hard to prioritize the
various desiderata. Thus, we began with a thought experiment in which we
examined the predicted or expected early applications of the NII in the areas
of health care, electronic commerce and electronic libraries. Our goal was to
identify research challenges that appear important for all of these
applications. Examples of the research issues that were considered crucial to
most, if not all, applications are:
- Support for approximate queries and automatic query relaxation.
- Real-time or time-constrained delivery of information.
- Active elements: alerting and triggering.
- Support for location of relevant information resources, description of
available resources and automatic delivery of information by subscription.
- Support for finding information of best quality or most recent
information.
On the other hand, other issues seem to be particularly important to some
applications such as:
- Adaptive user profiles and filtering of information (electronic
libraries).
- Ultra-high transaction volume (electronic commerce).
- Access to non-traditional information resources (health care).
- Support for audit trails (commerce, health care).
- Support for exchange of confidential information (commerce, health
care).