In this tutorial, I will go over how to make categorical (indicator) variables in a data frame based on if text is present in another variable. To do this I will be using grep and dplyr.
The first step is to install dplyr
in your console using install.packages()
and then using the library function to load it into your environment.
library(dplyr)
Now after this is completed, I will create a dataframe to use in this example. There will only be one variable called text
, but we will add more variables later on in the tutorial. To do this I will make a vectors and then make a data frame using the data.frame function.
text <- c("hello world I am cool", "hEllo there you are cool", "you are the coolest person I know")
df <- data.frame(text)
text |
---|
hello world I am cool |
hEllo there you are cool |
you are the coolest person I know |
The function mutate
allows you to add new variables onto a given data frame and the function grepl
returns a logical statement if a given word is present.
Let’s say we want to know if the word cool is in the text.
If we only want the word cool
and not coolest
we would need to have the query that we are searching for be “cool$” which means that we want the word to stop after the letter l.
The following code is used for the word “cool”:
df <- df %>% mutate(cool = as.integer(grepl("cool$", df$text)))
text | cool |
---|---|
hello world I am cool | 1 |
hEllo there you are cool | 1 |
you are the coolest person I know | 0 |
The result is that now there is a new variable called cool
in the data frame that is either 0 if the word cool shows up in the text and 0 if it does not. Since we used “cool$”, the word “coolest” does not count, therefore the value of the row is 0.
The following code is used if you are interested in if the word “cool” comes up anywhere:
df <- df %>% mutate(any_cool = as.integer(grepl("cool", df$text)))
text | cool | any_cool |
---|---|---|
hello world I am cool | 1 | 1 |
hEllo there you are cool | 1 | 1 |
you are the coolest person I know | 0 | 1 |
The result this time is that now there is another variable called any_cool
that has a value of 1 if the word cool shows up at all in the text, not accounting for if the word continues, and a 0 if it is not present at all. As you can see, the value is 1 for every row.
For some text, words are either capitalized or not. Grep’s default is that it is cap-sensitive. To fix this issue, we can use the ignore.case
argument in grepl
.
First, we are interested in seeing if the word “hello” is in the text.
The following code is used:
df <- df %>% mutate(hello = as.integer(grepl("hello", df$text)))
text | cool | any_cool | hello |
---|---|---|---|
hello world I am cool | 1 | 1 | 1 |
hEllo there you are cool | 1 | 1 | 0 |
you are the coolest person I know | 0 | 1 | 0 |
As you can see above, the value for the new variable hello
is a 1 for the first row because the word is present, and a 0 for the second row because it is not present, but the value for the second row is 0 because there is a typo in the word hello. What if we want to have a 1 for that row.
This is where we use the ignore.case
argument.
df <- df %>% mutate(all_hello = as.integer(grepl("hello", df$text, ignore.case = TRUE)))
text | cool | any_cool | hello | all_hello |
---|---|---|---|---|
hello world I am cool | 1 | 1 | 1 | 1 |
hEllo there you are cool | 1 | 1 | 0 | 1 |
you are the coolest person I know | 0 | 1 | 0 | 0 |
Now, the second row has a 1 because we allowed any combination of capital letters to be counted setting ignore.case to TRUE.