|
Much learned from project to digitize OED
A decade of annual gatherings to explore and celebrate the science of computational linguistics, text databases and how to make an electronic dictionary ended last fall with reflections on progress made and future prospects. The centerpiece of this research is the Oxford English Dictionary. Although the second version was released in 1989, the lessons learned from digitizing the 21,728 pages packed into 20 volumes are now shaping a new generation of applications such as decision-support systems and software engineering.
In 1983, the Oxford University Press circulated a call for proposals to convert the venerable OED, first published in 1928, into electronic form. With some prodding from UW President Doug Wright, a partnership between the university and Oxford University Press was struck in 1984. The New Oxford English Dictionary Project was then launched with money from the British government and with hardware, software and personnel donated by IBM-United Kingdom Ltd. The International Computaprint Corp. of Fort Washington, PA, won a contract to enter the entire OED and the supplement into electronic form, a task that consumed 120 people for 18 months. UW received support from the Natural Sciences and Engineering Research Council and the university. What was needed for the updated dictionary was to incorporate text-aware features typically found in information retrieval systems into a standard database management system, but supported by a data model based on formal languages. In other words, the project needed software that could search the OED and compile information rapidly. Tompa and others at UW brought experience with non-standard databases, beginning with videotex back in 1981. "We in computer science have always been interested in the management of data," Tompa said. "Others had experience in text processing and the heuristics." The proposal was interdisciplinary, with strong support from the humanities. The problem with text, unlike numbers, is that retrieving information is difficult because of the nuances of language itself. This is magnified when dealing with the 55 million words of the OED. According to Tompa, there is simple formalism behind the now ubiquitous relational databases that provide a structural security. Structured text is different, compromised in one way or another. But the OED has a structure all its own. Each entry has standardized parts and subparts, with paragraphs, sections, definitions, citations and so on. This organization allowed the UW researchers to look at the text as a formal language. Of course, not every citation has all these fields. When the UW team looked at the entire dictionary, they found more than 100 different patterns for what the outline of an entry looks like.
The secret to the UW search strategy was to develop parsed strings: a string of text or a sequence that is placed into a hierarchy. A search for the word "computer" would create a parsed string that would find more than just the single word "computer." It would find the first instance of the word "computer" and then include all the text that follows to the end of the dictionary. When a text-searching tool, such as the pattern searching tool developed by Tompa and co-workers, uses parsed strings, it is searching not for an individual word but for strings that begin with that word. The result is that text can be searched to yield results that were not possible before. A. Walton Litz, a professor of English at Princeton University and a member of the Oxford University Press advisory council, told Time magazine, "I've never been associated with a project, I've never even heard of a project, that was so incredibly complicated and that met every deadline." In 1989, a spin-off company, Open Text Systems, was formed to commercialize the technology, initially for the academic market. By 1991, the company changed to Open Text Corp. and focused on larger, commercial applications. Now Tompa has set his sights on commercial databases. Traditional database management systems have weak facilities for storing and manipulating text, and text-based systems ignore facilities for concurrent access and updates. The goal now is to create an application program interface that supports SQL (the industry standard for relational data) and SGML (the industry standard for structured text). "We want to add features and compatibilities without them appearing to be warts," he said. Looking back, Tompa said the interaction with the editors at Oxford University Press was critical to the success of the project. "It was very important and extremely productive. Most of us work in a very small circle," he said. Douglas Powell is a graduate student at the University of Guelph. |