Project: Extracting Questions from Discussion Groups via Text Mining
Student Researchers: Alina Badus, Marie Joiner, Rachel Kirby (unsupported, Lillian Kittredge)
Advisors: Dave Musicant, Jeff Ondich
Institution: Carleton College






Introduction
The Frequently Asked Question lists (FAQs) associated with on-line discussion groups constitute a marvelous informational resource for the on-line community. Each list is useful on its own, and they are also usable in aggregate as a searchable information source. However, it is hard work to build a useful FAQ. Someone has to read hundreds of discussion group messages, identify the common questions and themes, rephrase those questions into some canonical form, and then provide sensible answers.

We began our research with the idea that some of the labor involved in FAQ construction could be automated. In particular, we would like to develop tools for the automatic identification of frequently asked questions in newsgroup archives. Such a system could be used, for example, by a newsgroup administrator to create a first draft list of questions, or to monitor recent postings for possible additions to an existing FAQ.

This project was started in 2001-2002 as a CREW project. By the end of last year, they had constructed a database that could be used for various inquiries and had succeeded in identifying messages with questions. Two of the continuing members of the group presented these findings at the Grace Hopper Celebration of Women in Computing in the Fall of 2002, in Vancouver, BC.

What we did this year
We spent the first part of the year training the new members of the group in using Perl, MySQL and in becoming familiarized with the project.

One of the ideas we pursued throughout the year was the use of parsing information to identify questions. Writing a parser was beyond the scope of our project, so we found a parser that would suit our needs; this was the Link Grammar parser written D. Temperley, D. Sleator and J. Lafferty at Carnegie Mellon University. We used the Lingua Link version of the Link Grammar Parser. The nice thing about this parser is that the parsing information can be stored in an array, which can be easily stored and used in machine-learning techniques.

As we were trying to separate the messages into sentences for parsing, we noticed that due to the way marked messages had been stored, parts of long sentences had been lost. We decided that a whole re-marking was needed to get valid results, so we redesigned the message-reading survey. We decided to split the sentences before the readers saw them, and then store not the sentence itself but the sentenceid. In this way we were able to retrieve an entire marked sentence, without any lost information. As in the previous year, we had fellow students volunteers who read and marked about 3500 of the 20000 messages.

We proceeded to parse the sentences which had been marked as questions or not by human volunteers. Half of the sentences parsed, which is expected because of the casual language of newsgroup messages.

To increase the amount of data for machine learning, we also calculated term frequency for each word in each sentence, as well as term frequency inverse sentence frequency, which is supposed to provide a better measure for the relative importance of the words in a sentence. As it turned out, term frequency provided better results when used in the machine learning algorithms.

Once we had parsing and term frequency information for every marked sentence, we could use this information in various combinations in some machine learning techniques. More specifically, we worked with SVMLight, an implementation of a support vector machine, and with the Weka package. Since the questions are such a rare class, the first run of the algorithms classified everything as a non-question. Although this might give a high accuracy, it is not helpful for constructing an FAQ, so we decided to focus our efforts on getting better precision and recall.

Results and Conclusions
We used the f-measure to find the best parameters for our algorithms. Our best results give an f-measure of 49.39% and an accuracy of 96%; they were obtained using SVMLight on the term frequency data. Classifying every sentence with a question mark as a question gives an accuracy of 93.3% and an f-measure of 45.2. Thus our results are better than those obtained by simply searching for question marks, although there is still room for improvement.

Although we did not develop a full-fledged FAQ extractor, we learned a lot about group research. We became much more comfortable with the tools we were using, especially Perl and MySQL, and we feel more confident about learning how to use new tools in the future.