Introduction
The Frequently Asked Question lists (FAQs) associated with on-line
discussion groups constitute a marvelous informational resource for
the on-line community. Each list is useful on its own, and they are
also usable in aggregate as a searchable information source. However,
it is hard work to build a useful FAQ. Someone has to read hundreds of
discussion group messages, identify the common questions and themes,
rephrase those questions into some canonical form, and then provide
sensible answers.
We began our research with the idea that some of the labor involved in
FAQ construction could be automated. In particular, we would like to
develop tools for the automatic identification of frequently asked
questions in newsgroup archives. Such a system could be used, for
example, by a newsgroup administrator to create a first draft list of
questions, or to monitor recent postings for possible additions to an
existing FAQ.
This project was started in 2001-2002 as a CREW project. By the end of
last year, they had constructed a database that could be used for
various inquiries and had succeeded in identifying messages with
questions. Two of the continuing members of the group presented these
findings at the Grace Hopper Celebration of Women in Computing in the
Fall of 2002, in Vancouver, BC.
What we did this year
We spent the first part of the year training the new members of the
group in using Perl, MySQL and in becoming familiarized with the
project.
One of the ideas we pursued throughout the year was the use of parsing
information to identify questions. Writing a parser was beyond the
scope of our project, so we found a parser that would suit our needs;
this was the Link Grammar parser written D. Temperley, D. Sleator and
J. Lafferty at Carnegie Mellon University. We used the Lingua Link
version of the Link Grammar Parser. The nice thing about this parser
is that the parsing information can be stored in an array, which can
be easily stored and used in machine-learning techniques.
As we were trying to separate the messages into sentences for parsing,
we noticed that due to the way marked messages had been stored, parts
of long sentences had been lost. We decided that a whole re-marking
was needed to get valid results, so we redesigned the message-reading
survey. We decided to split the sentences before the readers saw them,
and then store not the sentence itself but the sentenceid. In this way
we were able to retrieve an entire marked sentence, without any lost
information. As in the previous year, we had fellow students
volunteers who read and marked about 3500 of the 20000 messages.
We proceeded to parse the sentences which had been marked as questions
or not by human volunteers. Half of the sentences parsed, which is
expected because of the casual language of newsgroup messages.
To increase the amount of data for machine learning, we also
calculated term frequency for each word in each sentence, as well as
term frequency inverse sentence frequency, which is supposed to
provide a better measure for the relative importance of the words in a
sentence. As it turned out, term frequency provided better results
when used in the machine learning algorithms.
Once we had parsing and term frequency information for every marked
sentence, we could use this information in various combinations in
some machine learning techniques. More specifically, we worked with
SVMLight, an implementation of a support vector machine, and with the
Weka package. Since the questions are such a rare class, the first run
of the algorithms classified everything as a non-question. Although
this might give a high accuracy, it is not helpful for constructing an
FAQ, so we decided to focus our efforts on getting better precision
and recall.
Results and Conclusions
We used the f-measure to find the best parameters for our
algorithms. Our best results give an f-measure of 49.39% and an
accuracy of 96%; they were obtained using SVMLight on the term
frequency data. Classifying every sentence with a question mark as a
question gives an accuracy of 93.3% and an f-measure of 45.2. Thus our
results are better than those obtained by simply searching for
question marks, although there is still room for improvement.
Although we did not develop a full-fledged FAQ extractor, we learned a
lot about group research. We became much more comfortable with the
tools we were using, especially Perl and MySQL, and we feel more
confident about learning how to use new tools in the future.