Introduction
The Frequently Asked Question lists (FAQs) associated with on-line
discussion groups constitute a marvelous informational resource for
the on-line community. Each list is useful on its own, and they are
also usable in aggregate as a searchable information source. However,
it is hard work to build a useful FAQ. Someone has to read hundreds
of
discussion group messages, identify the common questions and themes,
rephrase those questions into some canonical form, and then provide
sensible answers.
We began our research with the idea that some of the labor involved
in
FAQ construction could be automated. In particular, we would like to
develop tools for the automatic identification of frequently asked
questions in newsgroup archives. Such a system could be used, for
example, by a newsgroup administrator to create a first draft list of
questions, or to monitor recent postings for possible additions to an
existing FAQ.
Our research so far has laid some groundwork for our ultimate goal of
frequent question extraction. We began by obtaining an archive of
20,000 messages from 20 Usenet newsgroups. Then, we designed a MySQL
database to facilitate analysis of the messages, and developed tools
for loading the messy archived messages into our database. Creating
this database involved, among other things, deciding how to store
discussion thread pointers and how to deal with missing discussion
thread roots.
Once our database was built, we constructed a training set to be used
for supervised learning. We randomly selected 3,650 of the 20,000
messages and manually identified all the questions they contained.
About half of these messages contained at least one question. During
this stage, Carleton student Janet Campbell became involved as an
unsupported "fifth member" of the CREW project and played
a
significant role in constructing this training set.
Identifying questions turned out to be much more difficult than
searching for "who, what, where, when, how" and question marks.
Indeed, the most important part of this phase of our project was
making decisions about what constitutes a question. We did not want
to include rhetorical questions, but we did want to include many of
the "questions" that message authors phrase as declarative
sentences.
Once our training set was ready, we developed tools to prepare the
data for use in the supervised learning packages SVMlight and WEKA.
Our first supervised learning goal was to automatically identify those
messages that contain at least one question, so we prepared the data
with this goal in mind.
Results
As suggested above, automatically detecting whether a message contains
a question is fairly complicated. Words such as "who, what, where,
when, how" can appear in sentences that are not questions. Moreover,
people who post to message groups often use inconsistent punctuation.
Nonetheless, we were extremely pleased to find that we could make
progress at discriminating messages with questions from those without.
Of the 3,650 messages that were manually labeled by students, 55% of
them were classified as "non-questions." This means that an
accuracy
rate of 55% is the basic hurdle that we had to beat in order to show
that we were doing something fancier beyond guessing "everything
is a
not a question."
All our results are averages measured on "test sets" produced
by
"tenfold cross validation." This is the standard methodology
for
measuring the success of machine learning techniques. Our most
promising results came from a support vector machine classifier, which
produced a test set accuracy of 66.5%. The support vector machine
classifier also identified sets of words that contribute to whether
a
document is considered to be a question. Specifically, it identified
the words "anyone, does, any, what, thanks, how, help, know, there,
do, question" as indicative that the message might contain a question,
and the words "re, sale, m, references, not, your" as indicative
that
the message might not contain a question.
We tried two other algorithms as well. The WEKA software suite
provides a technique called "1R", which bases its decision
on one
feature only. It chose to classify as containing a question any
document that contained a question mark. This provided a useful sanity
check. WEKA also provides a decision tree classifier, which suggested
that the question mark and the words "advance, windows, anyone,
able,
looking" were the words and symbol that provided the most useful
information in deciding whether a document contained a question. The
first two words here seem unusual. On further examination, we
discovered that "advance" is useful because it usually appears
in the
phrase "Thanks in advance." "Windows" seems unusual
as well, but less
so when one considers the fact that two of our newsgroups,
"comp.os.ms-windows.misc" and "comp.windows.x" have
proportionately
more questions than many of the other groups.
Conclusions and The Experience
During the course of the last year, we made much progress in
identifying how to identify whether or not message group postings
contain questions. We learned how to use a variety of new tools, such
as Perl, MySQL, SVMlight, and WEKA. Finally, we gained significant
experience in collaborating with each other and in working
independently. Our work has been accepted as a poster to be shown at
the Grace Hopper Celebration of Women in Computing, and we look
forward to showing it off there. There is clearly much more work that
could be done with this project, such as trying fancier natural
language processing ideas and more complex machine text mining
techniques. In that sense, we have built a research project that will
benefit future students.