Project: Information Retrieval Based on Large Text Collections: Design and Development of a Word Conflation Module

Student Researchers: Emily Gibson, Christina Grape
Advisor: Dr. Miroslav Martinovic
Institution: The College of New Jersey




Background

An algorithm for word conflation introduced by M.F. Porter in 1980 has long been recognized as a rather simple, computationally inexpensive and successful technique to bring together the words conveying the same or similar meaning and treat them as the same content contributors. The algorithm comprises of five simple modules each dedicated to handling certain kinds of word transformations. These modules are applied to a given word sequentially, producing their own simplified versions of the word (i.e. for the word RADICALLIZATIONS, the following words are produced by the five modules, respectively: RADICALLIZATION, RADICALLIZE, RADICALLIZE, RADICALL, and RADICAL - the final product). It has been observe, however, that in some of the cases the algorithm did not conflate related words into a same common stem word ( i.e. DEEPENINGS conflated to DEEPEN, while DEEP stayed DEEP. Also, RELATEDNESS conflated to RELATED, while RELATED transformed into RELAT).

Purpose

The purpose of our research was to redesign Porter's algorithm to overcome its present deficiencies in reducing similar stems. We introduced and evaluated the idea that sets of all five words (outputs of each of Porter algorithm's individual steps) rather than the final word (output of the last [fifth] step) be used as a representative stem. We therefore developed a theoretical description and justification of our concept and followed it by building an independent software module that implemented and tested our idea. Ideally, this algorithm could be applied to any word stemming algorithm, not just Porter’s. Upon full implementation, our goal was to incorporate this module into the larger text corpora IR system.

Process

The first step in our research was to prove our algorithm was in fact feasible. The primary concern was whether the relation on sets of Porter words (a set of five words, where each word in the set is the output of one step in Porter’s algorithm) formed an equivalence relation. Eventually we were able to define an equivalence relation on five words sets that correlated two sets containing at least one common word. Our next step was to define our minimal stemming algorithm using this equivalence relation. Once our algorithm was designed, we were able to sketch out pseudo code to implement our general algorithm, not just for Porter’s stemming algorithm but for Paice’s as well. The final step was to fully implement the minimal stemming module for as many stemming algorithms as time permitted and then to integrate this module into the larger QASTIIR (Question Answering System Through Intelligent Information Retrieval) system.

Conclusions

The primary result of this research was the development of a minimal stemming algorithm that is compatible with almost any stemming algorithm. The minimal stem module can be applied to a stemming algorithm and maximize its efficiency. The minimal stem algorithm was fully sketched out in Java pseudo code and partially implemented. Due to time constraints, we were not able to fully implement the algorithm or integrate it with the complex modules of the QASTIIR system. However, Emily Gibson will be continuing her work with the minimal stem algorithm and integrating a completed word conflation module into the QASTIIR system through independent research with Dr. Martinovic this coming fall.

Links

CREW Web page:
www.tcnj.edu/~grape2/CREW/CREW.html

Poster presented at the Sigma XI conference on this research:
www.tcnj.edu/~gibson2/crew/WC.ppt

M.F. Porter’s Paper
http://www.tcnj.edu/~mmmartin/Porter.html