Human languages are notoriously ambiguous. Computer languages are notoriously un-ambiguous. Humans (mostly) are comfortable with uncertainty. Computers don’t even believe uncertainty is possible. It’s what led us to create un-ambiguous languages specifically for computers.
One morning, I shot an elephant in my pajamas. How he got in my pajamas I don’t know.”
Groucho Marx
Natural Language Processing is asking a computer to analyze human language; which really shouldn’t be possible. To do this, we have to provide the computer with a set of rules, as well as a set of exceptions to each rule. It also assumes inaccuracy.
stemming is a tool used in Natural Language Processing. It is a method of gathering words with similar meanings, so the analysis of a corpus (body of text) is more accurate. Like the rest of NLP, it makes assumptions and is often inaccurate.
For example, consider walking, walked, walk, and walker. In a story about hiking, these four words should be counted as four times the same word, rather than one time each of four words. We could conclude the article is about a slow hike, versus running or biking down the trail.
Here’s the R code…
> stemDocument(c("walking", "walked", "walk", "walker"))
[1] "walk" "walk" "walk" "walker"
>
Now consider the word “cheaply.” I ran a poll on LinkedIn and Mastodon about a natural language processing function called stemDocument
(). It’s part of the R tm
package. Here’s the poll…
Go ahead. Answer the question…
…I’ll wait.
…Still waiting
Time’s up
On LinkedIn and Mastodon, 100% of the answers were “cheap.”
But that’s the wrong answer! It’s “cheapli”
Don’t believe me? Here’s some R code…
>library(tm)
Loading required package: NLP
> stemDocument("cheaply")
[1] "cheapli"
In what world would converting cheaply to cheapli be the correct answer? Let me answer this question by showing the algorithm for this change, and then a reason the algorithm exists.
cheaply = cheapli
The tm
package in R relies on the Porter stemming algorithm. The decision tree is described in The English (Porter2) stemming algorithm.
Step 1c of the algorithm is responsible for our confusion: “replace suffix y or Y by i if preceded by a non-vowel which is not the first letter of the word (so cry->cri, by->by, say->say)”
Great – so “possibly” should become “possibli,” right? Nope. Here’s the code…
> stemDocument(c("possibly","possible"))
[1] "possibl" "possibl"
It turns out that “…bly” is also a special case.
But why?
There is no winning, but there is compromise. Consider this code example:
> stemDocument(c("many","man", "manly"))
[1] "mani" "man" "man"
Although “many,” “man,” and “manly” are almost the same word, they have different meanings and shouldn’t be lumped together when analyzing a corpus. So rather than just cutting off the “…y” suffix, it is replaced with “…i.” Which brings us around to “cheaply” again. The internals of the algorithm are confusing and frustrating – but then so is human language.
Learn More
I’ve produced several courses on stemming in particular and natural language processing in general. Here are links to two of them.
or
Performing Natural Language Processing with R” on Educative.