magnifier on a printed page

A Guide to Natural Language Processing with R

Let’s assume you’re reading this for one of three reasons:

  • You have experience with R, but not NLP
  • You have experience with NLP, but not R
  • You have no idea what this is all about, but someone said you need this for some reason. (Perhaps a thesis advisor? A data scientist? A trendy article?)

Let’s start with two brief explanations you can use to orient yourself in this new world.

R and NLP

First, R is a programming language, just like python or java or C++ or a thousand other languages. R is tailored for statistical analysis and working with large data sets. Where python thinks about values one-at-a-time, R thinks about vectors (known as arrays in most other computer languages), matrices (rows and columns), and arrays (three or more dimensions of rows, columns, pages, etc).

A basic NLP concept is a Document-Term matrix (aka DTM). A DTM represents the text of a document as a list of terms (words) in each document of a collection of documents (a corpus). The cells in the matrix show how many times that term appears in each document. R is optimized to process matrices, so manipulating and processing a DTM is a simple process with R.

Next, Natural Language Processing (aka “NLP”) uses a computer to analyze writing. NLP is accomplished with a set of tools which implement concepts. As a simple example, many word processors provide word and character count – which are basic NLP tools. Grammar checkers understand parts of speech such as nouns, verbs, adverbs, and other components of a sentence; parts of speech (POS) is also an NLP concept.

You Have Experience with R

Maybe you have some / a lot / a little experience with the R programming language. You understand the following simple R code:

someText <- c("twas brillig","the slithey","toves")
print(sample(someText,1))

This code prints a random choice of the text in sometext. That’s nothing special, but if you understand this bit of code, then at least you understand how R syntax works and how R thinks of values in a variable (aka vector).

Natural Language Processing with R is best accomplished by adding in a library such as tm, tidytext, or quanteda. These libraries provide shortcuts to NLP concepts such as breaking a corpus into a DTM then finding the most frequent term. For example, this R code creates a vector topPoetryTerms with the ten most frequent terms in a directory full of poetry:

library(tm)
topPoetryTerms <- Corpus(DirSource(directory = "poetry",
                                   pattern = "*.txt")) |>
  DocumentTermMatrix( control = list(tolower = TRUE,
                                    removePunctuation = TRUE, 
                                    stopwords = TRUE,
                                    removeNumbers = TRUE)) |>
  removeSparseTerms(sparse = .1) |>
  as.matrix() |>
  colSums() |>
  sort(decreasing = TRUE) |>
  head(n = 10)

Notice I’m using pipe-forward (|>) notation – similar to the tidyverse, but now part of Base R.

tm , quanteda, and tidytext provide collections of tools like DocumentTermMatrix() such as tf-idf, findFreqTerms(), “findAssocs(), and removal of stop words. Each package has its strengths and provides a different way of looking at NLP.

You Have Experience with NLP

Natural Language Processing can be performed with almost any computer language; python, java, javascript, and julia for starters. But many NLP libraries were developed in R and support for NLP with R continues to be ahead of the curve.

Many of the NLP libraries are cross platform, so the concepts remain the same although the languages differ: weka, NLTK, Word2Vec, SpaCyR, and TensorFlow are available for different languages, including R. Anything you have learned about NLP using these libraries can be easily ported to R. In many cases, the R version will provide extra capabilities over other languages.

In any case, consider that you will want to publish your work and the community of researchers you are publishing to will have standards regarding the languages, tools and formats used in the vetting and publication process. For academics, R is a standard language.

You Have No Idea

Research is a messy business; data is often stored in a multitude of document formats: microsoft word, wordperfect, pdf, or scanned paper print-outs. It would be convenient if all data were stored in SQL databases or excel spreadsheets – but it’s not. So now what will you do?

Research often depends on analyzing unformatted information – which is the purpose of Natural Language Processing. Most likely, an advisor or a friend pointed you at NLP as a solution. Although learning both R and NLP simultaneously is a steep learning curve, the alternative is a lifetime of tedious hand-formatting.

Fortunately, there is a wealth of instructional materials available for you to learn the tools. A warning; some materials are good, some are not worth your time. Look for recommendations before investing time in a confusing course.

My Humble Plug

I’ve released several courses on Natural Language Processing with R. My most recent is via educative titled Performing Natural Language Processing with R. It covers R and the three top NLP libraries: tm, quanteda, and tidytext. This course will get you up to speed on practical NLP without getting bogged down in the deep statistics and theory used in research.

How about some Tips & Tricks on Programming R?

Sign up to receive content in your inbox every month.

We don’t spam! Read our privacy policy for more info.

How about some Tips & Tricks on Programming R?

Sign up to receive content in your inbox every month.

We don’t spam! Read our privacy policy for more info.