l
Note that the goal is NOT to generate pointers to web pages or fragments that contain the answer ("AskJeeves.com", (now renamed "ask.cpom") etc. already do that), but to answer the question. The user does not look at any web pages. What makes this potentially feasible is that the numerical form of the answers significantly restricts the form of the questions, and provides strong contraints that can be employed in searching for an answer.
The scope of the problem extends only to questions where the numerical answer is directly present in some form on the web. In particular, computational questions such as "what is the product of 34 and 78?" are not included, nor are questions requiring reasoning such as "how many brazil nuts would fit in a wombat", unless the answer has already been figured out and posted by someone.
Using "Ask Jeeves" or similar "semantic" agents
such as google's unit converter, as components of your
system is disallowed here, though I don't actually think Jeeves will help
you a lot. Use of vanilla indexers such as google, on the other hand
is necessary - you don't have to construct your own primary web index.
On a designated day near the end of the semester, there will be a live demontration of the sytem to the rest of the class. The team will illustrate the operation of the system, and then allow the rest of the class to test it. The team will also need to give a presentation on their effort (on a class after the demonstration)
The team is to create a single, coherent, written report on their approach to the project, describing in detail what approaches were used, what did not work, any comparative studies done, what considerations went into the design, where the processing time was spent, how the system would be expected to scale, etc. etc.
Getting units to work has some complexity, but there are actually not that many sorts of quantities that people commonly ask about. It may not even be necessary to deal with all the different units of for a certain quantity. For instance, size/distance can be expressed in nanometers, microns, mils, inches, centimeters, cubits, rods, furlongs, kilometers, miles, AUs, parsecs, etc., but since the answer to any reasonable question probably occurs many times on the web in different units, you can probably get away with a handful. There are actually strong syntactical cues about what constitues a unit term, (e.g. they often occur following numerical expressions), and you can do reasonably well without ever referring to a knowledge base of specific units. However, this approach gets into trouble with statements such as "1.1 billion short tons of coal were produced in 2001 by US mines". Introducing some knowledge about units and question words they likely relate to (e.g. "short tons" is a unit term associated with weight, mass, or quantity) can thus definitely be useful. The syntactic heuristic would remain useful in cases where unknown units were involved.