Similar to other stemmers, UEA-Lite operates on a set of rules which are used as steps. There are two groups of rules: the first to clean the tokens, and the second to alter suffixes.

The first group of rules first avoids a small list of six frequent problem words. An improvement to the stemmer would be to expand this list by adding other problem words which the second rule set cannot deal with. Second, possessive apostrophes are removed and contractions are expanded. All hyphens are removed and tokens containing digits are left untouched. Strings which are all upper case and digits are left untouched unless there is a lower case terminal 's' (i.e. transforming plural forms of acronyms to singular forms).

Proper nouns should not usually be stemmed, except to remove possessives; our implementation will respect PoS tags if they are present. If the text is untagged the stemmer uses the simple heuristic that any capitalized token not preceded by sentence breaking punctuation is a proper noun.

Many texts, particularly scientific papers, contain sequences of digits, single letters, and other non-word tokens. Our implementation ignores tokens containing digits, single-letter tokens, and tokens with embedded punctuation.

The second group of rules contains 139 suffix rules, each testing for a specific type of suffix. The rules are set in a particular order so that the longest suffix applicable is used rather a shorter one which could lead to nonsense words and more words not stemmed entirely to their root form.

References

Jenkins, Marie-Claire, Smith, Dan, Conservative stemming for search and indexing, 2005 (127 KB, PDF)

Download

UEA-lite stemmer (Perl) (24 KB)
UEA-lite stemmer (Java) (4KB, ZIP)
 

Links

Martin Porter's page - Contains links to versions of the Porter algorithm in many languages.
Paice/Husk stemmer - Official website for the stemmer which references many relevant resources and implementations.
Lovins stemmer - Java, Perl and C implementations.
Edward D. Loper, Steven Bird, Natural Language Toolkit - A Python library containing stemming, PoS, parsing and other tools.

Research Team

Dr. Dan Smith, Marie-Claire Jenkins