In this tutorial, I will go over how to make categorical (indicator) variables in a data frame based on if text is present in another variable. To do this I will be using grep and dplyr.

Loading Packages

The first step is to install dplyr in your console using install.packages() and then using the library function to load it into your environment.

library(dplyr)

Creating a Data Frame

Now after this is completed, I will create a dataframe to use in this example. There will only be one variable called text, but we will add more variables later on in the tutorial. To do this I will make a vectors and then make a data frame using the data.frame function.

text <- c("hello world I am cool", "hEllo there you are cool", "you are the coolest person I know")
df <- data.frame(text)
text
hello world I am cool
hEllo there you are cool
you are the coolest person I know

Using grep and dplyr to Make New Variables

The function mutate allows you to add new variables onto a given data frame and the function grepl returns a logical statement if a given word is present.

Finding if Keywords are Present

Let’s say we want to know if the word cool is in the text.

If we only want the word cool and not coolest we would need to have the query that we are searching for be “cool$” which means that we want the word to stop after the letter l.

The following code is used for the word “cool”:

df <- df %>% mutate(cool = as.integer(grepl("cool$", df$text)))
text cool
hello world I am cool 1
hEllo there you are cool 1
you are the coolest person I know 0

The result is that now there is a new variable called cool in the data frame that is either 0 if the word cool shows up in the text and 0 if it does not. Since we used “cool$”, the word “coolest” does not count, therefore the value of the row is 0.

The following code is used if you are interested in if the word “cool” comes up anywhere:

df <- df %>% mutate(any_cool = as.integer(grepl("cool", df$text)))
text cool any_cool
hello world I am cool 1 1
hEllo there you are cool 1 1
you are the coolest person I know 0 1

The result this time is that now there is another variable called any_cool that has a value of 1 if the word cool shows up at all in the text, not accounting for if the word continues, and a 0 if it is not present at all. As you can see, the value is 1 for every row.

Capitalization of Words

For some text, words are either capitalized or not. Grep’s default is that it is cap-sensitive. To fix this issue, we can use the ignore.case argument in grepl.

First, we are interested in seeing if the word “hello” is in the text.

The following code is used:

df <- df %>% mutate(hello = as.integer(grepl("hello", df$text)))
text cool any_cool hello
hello world I am cool 1 1 1
hEllo there you are cool 1 1 0
you are the coolest person I know 0 1 0

As you can see above, the value for the new variable hello is a 1 for the first row because the word is present, and a 0 for the second row because it is not present, but the value for the second row is 0 because there is a typo in the word hello. What if we want to have a 1 for that row.

This is where we use the ignore.case argument.

df <- df %>% mutate(all_hello = as.integer(grepl("hello", df$text, ignore.case = TRUE)))
text cool any_cool hello all_hello
hello world I am cool 1 1 1 1
hEllo there you are cool 1 1 0 1
you are the coolest person I know 0 1 0 0

Now, the second row has a 1 because we allowed any combination of capital letters to be counted setting ignore.case to TRUE.