l CSC 400 Team Project: Semantic Internet Access

CSC 400 Team Project: Semantic Internet Access




Team: Tagyoung Chung, Yi Chu, Jonathan Gordon, Tivadar Papai, Naushad UzZaman.

Your mission is to implement a system that can take simple text questions that have definite, factual, numerical or historical date answers, e.g. "what is the melting point of aluminum" or "when did the Hindenburg crash" find an answer on the internet, and report the answer to the user in the form of a statement e.g., "The melting point of aluminum is 660 degrees centigrade". The system should accept free-form questions, and not require they be fit a specific rigid template (e.g. "How (long, high, heavy) is #item-name# in #unit-term#.) You are free to have the system respond that it cannot understand a question, in fact this is probably a good idea, but it should not happen so frequently that a naive user gets frustrated. A significant part of the challenge is to get the units right. This adds a semantic element to the problem.

Note that the goal is NOT to generate pointers to web pages or fragments that contain the answer ("AskJeeves.com", (now renamed "ask.cpom") etc. already do that), but to answer the question. The user does not look at any web pages. What makes this potentially feasible is that the numerical form of the answers significantly restricts the form of the questions, and provides strong contraints that can be employed in searching for an answer.

The scope of the problem extends only to questions where the numerical answer is directly present in some form on the web. In particular, computational questions such as "what is the product of 34 and 78?" are not included, nor are questions requiring reasoning such as "how many brazil nuts would fit in a wombat", unless the answer has already been figured out and posted by someone.

Using "Ask Jeeves" or similar "semantic" agents such as google's unit converter, as components of your system is disallowed here, though I don't actually think Jeeves will help you a lot. Use of vanilla indexers such as google, on the other hand is necessary - you don't have to construct your own primary web index.

On a designated day near the end of the semester, there will be a live demontration of the sytem to the rest of the class. The team will illustrate the operation of the system, and then allow the rest of the class to test it. The team will also need to give a presentation on their effort (on a class after the demonstration)

The team is to create a single, coherent, written report on their approach to the project, describing in detail what approaches were used, what did not work, any comparative studies done, what considerations went into the design, where the processing time was spent, how the system would be expected to scale, etc. etc.




It is not necessary, and probably is counterproductive, to start by creating a general purpose NL parser and language generation system. Past experience with this problem has revealed that most common questions involving numerical answers fit into a handful of common forms that have formulaic response patterns. On the other hand, it might be very useful to have special purpose parsers for interpreting general mixed numerical expressions such as 3, three, three million, 3 million, 4.5 billion, 3 x 10^12, etc., and for dates. Given this, a very simple approach that works surprisingly often is to reorder a standard form such as "how high is the empire state building" to "the empire state building is #number-expression# #unit-term# high", and search through pages obtained by giving all the words of the query to google for a string matching the specific pattern. It is not unlikely that somebody, somewhere, has posted an answer in the standard form.

Getting units to work has some complexity, but there are actually not that many sorts of quantities that people commonly ask about. It may not even be necessary to deal with all the different units of for a certain quantity. For instance, size/distance can be expressed in nanometers, microns, mils, inches, centimeters, cubits, rods, furlongs, kilometers, miles, AUs, parsecs, etc., but since the answer to any reasonable question probably occurs many times on the web in different units, you can probably get away with a handful. There are actually strong syntactical cues about what constitues a unit term, (e.g. they often occur following numerical expressions), and you can do reasonably well without ever referring to a knowledge base of specific units. However, this approach gets into trouble with statements such as "1.1 billion short tons of coal were produced in 2001 by US mines". Introducing some knowledge about units and question words they likely relate to (e.g. "short tons" is a unit term associated with weight, mass, or quantity) can thus definitely be useful. The syntactic heuristic would remain useful in cases where unknown units were involved.




Back to CSC 400 main page