Project: Document Summarization by Text Mining
Student Researchers: Sarah Elizabeth Allen, Ester Gubbrud, Rachel Kirby, Lillian Sears Kittredge
Advisors: Dave Musicant, Jeff Ondich
Institution: Carleton College





Introduction

The Frequently Asked Question lists (FAQs) associated with on-line discussion groups constitute a marvelous informational resource for the on-line community. Each list is useful on its own, and they are also usable in aggregate as a searchable information source. However, it is hard work to build a useful FAQ. Someone has to read hundreds of discussion group messages, identify the common questions and themes, rephrase those questions into some canonical form, and then provide sensible answers.

We began our research with the idea that some of the labor involved in FAQ construction could be automated. In particular, we would like to develop tools for the automatic identification of frequently asked questions in newsgroup archives. Such a system could be used, for example, by a newsgroup administrator to create a first draft list of questions, or to monitor recent postings for possible additions to an existing FAQ.

Our research so far has laid some groundwork for our ultimate goal of frequent question extraction. We began by obtaining an archive of 20,000 messages from 20 Usenet newsgroups. Then, we designed a MySQL database to facilitate analysis of the messages, and developed tools for loading the messy archived messages into our database. Creating this database involved, among other things, deciding how to store discussion thread pointers and how to deal with missing discussion thread roots.

Once our database was built, we constructed a training set to be used for supervised learning. We randomly selected 3,650 of the 20,000 messages and manually identified all the questions they contained. About half of these messages contained at least one question. During this stage, Carleton student Janet Campbell became involved as an unsupported "fifth member" of the CREW project and played a significant role in constructing this training set.

Identifying questions turned out to be much more difficult than searching for "who, what, where, when, how" and question marks. Indeed, the most important part of this phase of our project was making decisions about what constitutes a question. We did not want to include rhetorical questions, but we did want to include many of the "questions" that message authors phrase as declarative sentences.

Once our training set was ready, we developed tools to prepare the data for use in the supervised learning packages SVMlight and WEKA. Our first supervised learning goal was to automatically identify those messages that contain at least one question, so we prepared the data with this goal in mind.

Results

As suggested above, automatically detecting whether a message contains a question is fairly complicated. Words such as "who, what, where, when, how" can appear in sentences that are not questions. Moreover, people who post to message groups often use inconsistent punctuation. Nonetheless, we were extremely pleased to find that we could make progress at discriminating messages with questions from those without. Of the 3,650 messages that were manually labeled by students, 55% of them were classified as "non-questions." This means that an accuracy rate of 55% is the basic hurdle that we had to beat in order to show that we were doing something fancier beyond guessing "everything is a not a question."

All our results are averages measured on "test sets" produced by "tenfold cross validation." This is the standard methodology for measuring the success of machine learning techniques. Our most promising results came from a support vector machine classifier, which produced a test set accuracy of 66.5%. The support vector machine classifier also identified sets of words that contribute to whether a document is considered to be a question. Specifically, it identified the words "anyone, does, any, what, thanks, how, help, know, there, do, question" as indicative that the message might contain a question, and the words "re, sale, m, references, not, your" as indicative that the message might not contain a question.

We tried two other algorithms as well. The WEKA software suite provides a technique called "1R", which bases its decision on one feature only. It chose to classify as containing a question any document that contained a question mark. This provided a useful sanity check. WEKA also provides a decision tree classifier, which suggested that the question mark and the words "advance, windows, anyone, able, looking" were the words and symbol that provided the most useful information in deciding whether a document contained a question. The first two words here seem unusual. On further examination, we discovered that "advance" is useful because it usually appears in the phrase "Thanks in advance." "Windows" seems unusual as well, but less so when one considers the fact that two of our newsgroups, "comp.os.ms-windows.misc" and "comp.windows.x" have proportionately more questions than many of the other groups.

Conclusions and The Experience

During the course of the last year, we made much progress in identifying how to identify whether or not message group postings contain questions. We learned how to use a variety of new tools, such as Perl, MySQL, SVMlight, and WEKA. Finally, we gained significant experience in collaborating with each other and in working independently. Our work has been accepted as a poster to be shown at the Grace Hopper Celebration of Women in Computing, and we look forward to showing it off there. There is clearly much more work that could be done with this project, such as trying fancier natural language processing ideas and more complex machine text mining techniques. In that sense, we have built a research project that will benefit future students.