Kasper Welbers, Wouter van Atteveldt & Philipp Masur 2022-01
- Introduction
- Step 1: Importing text and creating a quanteda corpus
- Step 2: Creating the DTM (or DFM)
- Step 3: Analysis
In this tutorial you will learn how to perform text analysis using the quanteda package. In the R Basics: getting started tutorial we introduced some of the techniques from this tutorial as a light introduction to R. In this and the following tutorials, the goal is to get more understanding of what actually happens ‘under the hood’ and which choices can be made, and to become more confident and proficient in using quanteda for text analysis.
The quanteda package is an extensive text analysis suite for R. It covers everything you need to perform a variety of automatic text analysis techniques, and features clear and extensive documentation. Here we’ll focus on the main preparatory steps for text analysis, and on learning how to browse the quanteda documentation. The documentation for each function can also be found here.
For a more detailed explanation of the steps discussed here, you can read the paper Text Analysis in R (Welbers, van Atteveldt & Benoit, 2017).
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
The first step is getting text into R in a proper format. stored in a variety of formats, from plain text and CSV files to HTML and PDF, and with different ‘encodings’. There are various packages for reading these file formats, and there is also the convenient readtext that is specialized for reading texts from a variety of formats.
For this tutorial, we will be importing text from a csv. For convenience, we’re using a csv that’s available online, but the process is the same for a csv file on your own computer. The data consists of the State of the Union speeches of US presidents, with each document (i.e. row in the csv) being a paragraph. The data will be imported as a data.frame.
library(tidyverse)
url <- 'https://bit.ly/2QoqUQS'
d <- read_csv(url)
head(d) ## view first 6 rows
We can now create a quanteda corpus with the corpus()
function. If you
want to learn more about this function, recall that you can use the
question mark to look at the documentation.
?corpus
Here you see that for a data.frame, we need to specify which column contains the text field. Also, the text column must be a character vector.
corp <- corpus(d, text_field = 'text') ## create the corpus
corp
Rather than a csv file, your texts might be stored as separate files,
e.g. as .txt, .pdf, or .docx files. You can quite easily read these as
well with the readtext
function from the readtext
package. You might
have to install that package first with:
install.packages("readtext")
You can then call the readtext function on a particular file, or on a folder or zip archive of files directly.
library(readtext)
url <- "https://github.com/ccs-amsterdam/r-course-material/blob/master/data/files.zip?raw=true"
texts <- readtext(url)
texts
As you can see, it automatically downloaded and unzipped the files, and converted the MS Word and PDF files into plain text.
I read them from an online source here, but you can also read them from your hard drive by specifying the path:
texts <- readtext("c:/path/to/files")
texts <- readtext("/Users/me/Documents/files")
You can convert the texts directly into a corpus object as above:
corp2 <- corpus(texts)
corp2
Many text analysis techniques only use the frequencies of words in documents. This is also called the bag-of-words assumption, because texts are then treated as bags of individual words. Despite ignoring much relevant information in the order of words and syntax, this approach has proven to be very powerfull and efficient.
The standard format for representing a bag-of-words is as a
document-term matrix
(DTM). This is a matrix in which rows are
documents, columns are terms, and cells indicate how often each term
occured in each document. We’ll first create a small example DTM from a
few lines of text. Here we use quanteda’s dfm()
function, which stands
for document-feature matrix
(DFM), which is a more general form of a
DTM.
# An example data set
text <- c(d1 = "Cats are awesome!",
d2 = "We need more cats!",
d3 = "This is a soliloquy about a cat.")
# Tokenise text
text2 <- tokens(text)
text2
# Construct the document-feature matrix based on the tokenised text
dtm <- dfm(text2)
dtm
Here you see, for instance, that the word are
only occurs in the first
document. In this matrix format, we can perform calculations with texts,
like analyzing different sentiments of frames regarding cats, or the
computing the similarity between the third sentence and the first two
sentences.
However, directly converting a text to a DTM is a bit crude. Note, for
instance, that the words Cats
, cats
, and cat
are given different
columns. In this DTM, “Cats” and “awesome” are as different as “Cats”
and “cats”, but for many types of analysis we would be more interested
in the fact that both texts are about felines, and not about the
specific word that is used. Also, for performance it can be useful (or
even necessary) to use fewer columns, and to ignore less interesting
words such as is
or very rare words such as soliloquy
.
This can be achieved by using additional preprocessing
steps. In the
next example, we’ll again create the DTM, but this time we make all text
lowercase, ignore stopwords and punctuation, and perform stemming
.
Simply put, stemming removes some parts at the ends of words to ignore
different forms of the same word, such as singular versus plural (“gun”
or “gun-s”) and different verb forms (“walk”,“walk-ing”,“walk-s”)
text2 <- text |>
tokens(remove_punct = T, remove_numbers = T, remove_symbols = T) |> ## tokenize, removing unnecessary noise
tokens_tolower() |> ## normalize
tokens_remove(stopwords('en')) |> ## remove stopwords (English)
tokens_wordstem() ## stemming
text2
dtm <- dfm(text2)
dtm
By now you should be able to understand better how the arguments in this
function work. The tolower
argument determines whether texts are
(TRUE
) or aren’t (FALSE
) converted to lowercase. stem
determines
whether stemming is (TRUE
) or isn’t (FALSE
) used. The remove
argument is a bit more tricky. If you look at the documentation for the
dfm function (?dfm
) you’ll see that remove
can be used to give “a
pattern of user-supplied features to ignore”. In this case, we actually
used another function, stopwords()
, to get a list of english
stopwords. You can see for yourself.
stopwords('en')
This list of words is thus passed to the remove
argument in the
dfm()
to ignore these words. If you are using texts in another
language, make sure to specify the language, such as stopwords(‘nl’) for
Dutch or stopwords(‘de’) for German.
There are various alternative preprocessing techniques, including more advanced techniques that are not implemented in quanteda. Whether, when and how to use these techniques is a broad topic that we won’t cover today. For more details about preprocessing you can read the Text Analysis in R paper cited above.
For this tutorial, we’ll use the State of the Union speeches. We already
created the corpus above. We can now pass this corpus to the dfm()
function and set the preprocessing parameters.
dtm <- corp |>
tokens(remove_punct = T, remove_numbers = T, remove_symbols = T) |>
tokens_tolower() |>
tokens_remove(stopwords('en')) |>
tokens_wordstem() |>
dfm()
dtm
This dtm has 23,469 documents and 20,429 features (i.e. terms), and no longer shows the actual matrix because it simply wouldn’t fit. Depending on the type of analysis that you want to conduct, we might not need this many words, or might actually run into computational limitations.
Luckily, many of these 20K features are not that informative. The distribution of term frequencies tends to have a very long tail, with many words occuring only once or a few times in our corpus. For many types of bag-of-words analysis it would not harm to remove these words, and it might actually improve results.
We can use the dfm_trim
function to remove columns based on criteria
specified in the arguments. Here we say that we want to remove all terms
for which the frequency (i.e. the sum value of the column in the DTM) is
below 10.
dtm <- dfm_trim(dtm, min_termfreq = 10)
dtm
Now we have about 5000 features left. See ?dfm_trim
for more options.
Using the dtm we can now employ various techniques. You’ve already seen some of them in the introduction tutorial, but by now you should be able to understand more about the R syntax, and understand how to tinker with different parameters.
Get most frequent words in corpus.
textplot_wordcloud(dtm, max_words = 50) ## top 50 (most frequent) words
textplot_wordcloud(dtm, max_words = 50, color = c('blue','red')) ## change colors
textstat_frequency(dtm, n = 10) ## view the frequencies
You can also inspect a subcorpus. For example, looking only at Obama
speeches. To subset the DTM we can use quanteda’s dtm_subset()
, but we
can also use the more general R subsetting techniques (as discussed last
week). Here we’ll use the latter for illustration.
With docvars(dtm)
we get a data.frame with the document variables.
With docvars(dtm)$President
, we get the character vector with
president names. Thus, with docvars(dtm)$President == 'Barack Obama'
we look for all documents where the president was Obama. To make this
more explicit, we store the logical vector, that shows which documents
are ‘TRUE’, as is_obama. We then use this to select these rows from the
DTM.
is_obama <- docvars(dtm)$President == 'Barack Obama'
obama_dtm <- dtm[is_obama,]
textplot_wordcloud(obama_dtm, max_words = 25)
Compare word frequencies between two subcorpora. Here we (again) first
use a comparison to get the is_obama vector. We then use this in the
textstat_keyness()
function to indicate that we want to compare the
Obama documents (where is_obama is TRUE) to all other documents (where
is_obama is FALSE).
is_obama <- docvars(dtm)$President == 'Barack Obama'
ts <- textstat_keyness(dtm, is_obama)
head(ts, 20) ## view first 20 results
We can visualize these results, stored under the name ts
, by using the
textplot_keyness function
textplot_keyness(ts)
As seen in the first tutorial, a keyword-in-context listing shows a given keyword in the context of its use. This is a good help for interpreting words from a wordcloud or keyness plot.
Since a DTM only knows word frequencies, the kwic()
function requires
a tokenized corpus object as input.
k <- kwic(tokens(corp), 'freedom', window = 7)
head(k, 10) ## only view first 10 results
The kwic()
function can also be used to focus an analysis on a
specific search term. You can use the output of the kwic function to
create a new DTM, in which only the words within the shown window are
included in the DTM. With the following code, a DTM is created that only
contains words that occur within 10 words from terror*
(terrorism,
terrorist, terror, etc.).
terror <- kwic(tokens(corp), 'terror*')
terror_corp <- corpus(terror)
terror_dtm <- terror_corp |>
tokens(remove_punct = T, remove_numbers = T, remove_symbols = T) |>
tokens_tolower() |>
tokens_remove(stopwords('en')) |>
tokens_wordstem() |>
dfm()
Now you can focus an analysis on whether and how Presidents talk about
terror*
.
textplot_wordcloud(terror_dtm, max_words = 50) ## top 50 (most frequent) words
You can perform a basic dictionary search. In terms of query options this is less advanced than AmCAT, but quanteda offers more ways to analyse the dictionary results. Also, it supports the use of existing dictionaries, for instance for sentiment analysis (but mostly for english dictionaries).
An convenient way of using dictionaries is to make a DTM with the columns representing dictionary terms.
dict <- dictionary(list(terrorism = 'terror*',
economy = c('econom*', 'tax*', 'job*'),
military = c('army','navy','military','airforce','soldier'),
freedom = c('freedom','liberty')))
dict_dtm <- dfm_lookup(dtm, dict, exclusive=TRUE)
dict_dtm
The “4 features” are the four entries in our dictionary. Now you can perform all the analyses with dictionaries.
textplot_wordcloud(dict_dtm)
tk <- textstat_keyness(dict_dtm, docvars(dict_dtm)$President == 'Barack Obama')
textplot_keyness(tk)
You can also convert the dtm to a data frame to get counts of each concept per document (which you can then match with e.g. survey data)
df <- convert(dict_dtm, to="data.frame")
head(df)
A good dictionary means that all documents that match the dictionary are indeed about or contain the desired concept, and that all documents that contain the concept are matched.
To check this, you can manually annotate or code a sample of documents and compare the score with the dictionary hits.
You can also apply the keyword-in-context function to a dictionary to quickly check a set of matches and see if they make sense:
kwic(corp, dict$terrorism)