|
|
Project: Information Retrieval Based on Large Text Collections:
Design and Development of a Word Conflation Module
Student Researchers: Emily Gibson, Christina Grape
Advisor: Dr. Miroslav Martinovic
Institution: The College of New Jersey
Background
An algorithm for word conflation introduced by M.F. Porter in 1980 has
long been recognized as a rather simple, computationally inexpensive and
successful technique to bring together the words conveying the same or
similar meaning and treat them as the same content contributors. The algorithm
comprises of five simple modules each dedicated to handling certain kinds
of word transformations. These modules are applied to a given word sequentially,
producing their own simplified versions of the word (i.e. for the word
RADICALLIZATIONS, the following words are produced by the five modules,
respectively: RADICALLIZATION, RADICALLIZE, RADICALLIZE, RADICALL, and
RADICAL - the final product). It has been observe, however, that in some
of the cases the algorithm did not conflate related words into a same
common stem word ( i.e. DEEPENINGS conflated to DEEPEN, while DEEP stayed
DEEP. Also, RELATEDNESS conflated to RELATED, while RELATED transformed
into RELAT).
Purpose
The purpose of our research was to redesign Porter's algorithm to overcome
its present deficiencies in reducing similar stems. We introduced and
evaluated the idea that sets of all five words (outputs of each of Porter
algorithm's individual steps) rather than the final word (output of the
last [fifth] step) be used as a representative stem. We therefore developed
a theoretical description and justification of our concept and followed
it by building an independent software module that implemented and tested
our idea. Ideally, this algorithm could be applied to any word stemming
algorithm, not just Porters. Upon full implementation, our goal
was to incorporate this module into the larger text corpora IR system.
Process
The first step in our research was to prove our algorithm was in fact
feasible. The primary concern was whether the relation on sets of Porter
words (a set of five words, where each word in the set is the output of
one step in Porters algorithm) formed an equivalence relation. Eventually
we were able to define an equivalence relation on five words sets that
correlated two sets containing at least one common word. Our next step
was to define our minimal stemming algorithm using this equivalence relation.
Once our algorithm was designed, we were able to sketch out pseudo code
to implement our general algorithm, not just for Porters stemming
algorithm but for Paices as well. The final step was to fully implement
the minimal stemming module for as many stemming algorithms as time permitted
and then to integrate this module into the larger QASTIIR (Question Answering
System Through Intelligent Information Retrieval) system.
Conclusions
The primary result of this research was the development of a minimal stemming
algorithm that is compatible with almost any stemming algorithm. The minimal
stem module can be applied to a stemming algorithm and maximize its efficiency.
The minimal stem algorithm was fully sketched out in Java pseudo code
and partially implemented. Due to time constraints, we were not able to
fully implement the algorithm or integrate it with the complex modules
of the QASTIIR system. However, Emily Gibson will be continuing her work
with the minimal stem algorithm and integrating a completed word conflation
module into the QASTIIR system through independent research with Dr. Martinovic
this coming fall.
Links
CREW Web page:
www.tcnj.edu/~grape2/CREW/CREW.html
Poster presented at the Sigma XI conference on this research: www.tcnj.edu/~gibson2/crew/WC.ppt
M.F. Porters Paper
http://www.tcnj.edu/~mmmartin/Porter.html
|