2003 ARDA Workshop

Title: Towards More Reliable Information Retrieval Technology (pdf version)

Leader: Donna Harman, National Institute of Standards and Technology (NIST)

Workshop Team

Donna Harman, NIST Team Lead), Chris Buckley, SABIR, Jamie Callan, Carnegie Mellon University, Charlie Clarke, University of Waterloo, Andres Corrada-Emmanuel, University of Massachusetts , Amherst, David Evans, Clarivoyance Corporation, Warren Greiff, MITRE. Andy MacFarlane, City University, Tomek Strzalkowski, University of Albany

Problem

For many years the standard approach to question answering, or searching for information, has involved information retrieval systems. These systems were initially Boolean systems, requiring major effort from the users, but new systems allow natural language input for questions and return ranked lists of documents.

Whereas the question answering (QA) task has evolved beyond just the ranked list approach to answers, the QA systems depend on the information retrieval (IR) technology at two different stages. First, the IR technology is needed to make the initial cut at finding the information from many gigabytes of text. Most of the QA systems use heavy amounts of NLP technology and therefore use IR technology to narrow the pool of potential information to 100 or 200 documents before doing more intensive processing. But a second, and equally important, need of IR systems is to provide a fallback position for questions that are beyond the current abilities of the QA systems. Having a ranked list of documents is clearly better than having nothing!

Current IR systems generally provide a set of reasonable results for the QA systems to work with. However results from these systems are unpredictable in that there are occasional failures. Even more important for the AQUAINT program, these systems are unpredictable to the system researchers; that is, the systems cannot reliably customize output on a per question basis. This leads to lower precision in the top set of documents for some questions, and radical variance in performance using different retrieval technologies for the same input question.

Approach

The ability to predict how to customize an IR system on a per query basis has been an unrealized dream for many years. Techniques such as pseudo-relevance feedback work well on average, but fail for some queries. The use of stemming (in English) does not generally improve performance ON AVERAGE but has dramatic improvements/degradations on a per query basis. So the problem being proposed for this workshop is known to be a very hard problem.

The approach we plan to use is to recruit a small group of researchers, each with a different IR system that uses different technology. Then put these people and their systems through a 6-week focused study of the problem. We would use the 150 questions (topics) from the last three years of the TREC ad hoc task, along with the documents used in that task. The topics would be appropriately subsetted for training and testing so that probably one-third of them would never be examined manually but only used for final testing. Earlier TREC material could also be used.

What is envisioned (and has been discussed by potential participants) is to start on a very focused question: how to predict when pseudo-relevance feedback will improve results and what are the appropriate parameters for each system. The various groups would work at least 2 weeks on this problem. This work would involve making appropriate runs, separately and jointly performing failure analysis, and then testing hypotheses based on this analysis.

In the second two-week session, the groups would look at a small number of topics and address the question: what are the pieces of information that the system must learn in order to perform well on the topics? Is this information available from just the query, or from \world knowledge", or does it need to be explicitly gotten from the user by the system, or can it be gotten implicitly from the user? Some groups would use facet analysis in order to better \understand" the query and tune the results to the various types of linguistic structure implied by that facet analysis.

It is less clear what the third two-week session will involve. Depending on the earlier results, the third two-week session could be a continuation of the second session; further exploring the issues of \needed information" for a topic, but concentrating more on how that information could be obtained. Some groups could look at whether given the knowledge that a particular type of information is needed, can we get that information from a set of known relevant documents (routing/filtering/relevance feedback). Alternatively follow-on work could involve extending the results beyond the TREC data to other types of data or working with “users” in an interactive testing mode.

The overall structure of all sessions will be individual and joint failure analysis using the assembled expertise, and then task forces of several groups/systems addressing the particular issues that come up in the failure analysis. The focus will be on multiple participants working on multiple problems together rather than individual participants working separately on problems and putting them together at the end.

Workshop Duration and Format

This workshop would run for 6-weeks in June, July and August. This is the proposed schedule.

  • June 16-27 (two weeks)
  • break for week of July 4th
  • July 7-18th (two weeks)
  • break for 2 weeks, including SIGIR meeting
  • August 4-15 (two weeks)

Deliverables

  • Set of algorithms/guidelines for information retrieval systems that enable customization for specific questions and sets of data
  • Series of papers reporting on the results with the goal of providing guidance to non-participating IR systems
  • Suite of failure analysis tools and metrics that will be made available to the community through NIST
  • Installation and cross-comparison of 6-10 leading IR research and commercial systems
  • Monthly progress reports
  • Final paper