
|
Project: Information Retrieval Based on Large Text Collections:
Design and Development of a Word Conflation Module
An algorithm for word conflation introduced by M.F. Porter in 1980 (http://open.muscat.com/stemming
), has long been recognized as a rather simple, computationally inexpensive
and successful technique to bring together the words conveying the same
or similar meaning and treat them as the same content contributors. The
algorithm comprises of five simple modules each dedicated to handling certain
kinds of word transformations. These modules are applied to a given word
sequentially, producing their own simplified versions of the word (i.e.
for the word RADICALLIZATIONS, the following words are produced by the five
modules, respectively: RADICALLIZATION, RADICALLIZE, RADICALLIZE, RADICALL,
and RADICAL - the final product). It has been observed though, that in some
of the cases, the algorithm did not conflate related words into a same common
stem word ( i.e. DEEPENINGS conflated to DEEPEN, while DEEP stayed DEEP.
Also, RELATEDNESS conflated to RELATED, while RELATED transformed into RELAT). |