Overview

This document shows you how to explore some variables within the sample campfire data set of tweets. See campfire data info for more information on what other data is available.

Data Import

The data is saved as a RDS, which is a file format specific to R. Similar to a .sav file for SPSS or a .dta file for Stata.

setwd("../")
cf <- readRDS("data/campfire-tweets-2020-04-27.Rds")

Here are some things you may want to do to explore the data.

Frequency of tweets

Total count over time

We could plot the total number of tweets as a time series, which is somewhat helpful since we have gotten all of the tweets until noon on the first day, then only random samples of 500 here and there until the end of the year.

ts_plot(cf)

Okay.. that’s because we pulled full timelines since like 2009 for the town of paradise, and Chico FD. Let’s look at post-fire. This is important b/c i’m interested in how frequent users like these tweetd before, during, and after the fire.

This means you will have to filter your dates before analysis to get tweets about the actual CampFire!

cf_real <- cf %>% filter(created_at > lubridate::ymd_hm("2018-11-08-13-50"))
ts_plot(cf_real)

Count of tweets per hour

First we use group_by and summarize to count the number of tweets per hour

cf_real %>% group_by(tweet_hour) %>% 
       summarize(tweet_count=n()) %>% 
    ggplot(aes(x=tweet_hour, y=tweet_count)) + geom_col()

The large amount of space between these blocks is due to this being a sample of data that does not cover the entire time frame that is showing.

Text contents

The contents of a tweet is contained in the text variable. Here are the first 6 tweets in our time frame (11/7 at 6am):

head(cf$text)

## [1] "Is #fire a solid, a liquid, or a gas? - Elizabeth Cox https://t.co/yxZtJ0R0cv"                                                                                                                                 
## [2] "Northern California back into #fire #weather, high winds, low humidity https://t.co/g4Mwc0Vrac - today through Friday morning https://t.co/C3HD4oBopA"                                                         
## [3] "KEEP STREAMING!!! ON SPOTIFY TOO <U+0001F5E3><U+0001F5E3><U+0001F5E3><U+0001F5E3>  #757  #fire #va https://t.co/nejQho8rc6"                                                                                    
## [4] "Eagle is burning in the eyes of the eagle <U+0001F525> \nThis is his battle <U+0001F608><U+2764>\n\n#NandhaGopalaKumaran !! <U+0001F60E>\n\n #NGKFeverBegins<U+270C>\n\n #SuriyaSivakumarAnnan<U+0001F49B>\n\n #CultClassic <U+0001F525><U+0001F6A9> \n\n#NGK #FIRE <U+0001F4AF><U+0001F525> https://t.co/5V9pDXd4h0"
## [5] "We have a GoFundMe page set up here: https://t.co/zDaNtual5d\n\n#HudsonValley #fire #newyork #pleasantvalley #library"                                                                                         
## [6] "RT @LS_Health: Did you know there are 1,786 #Fire-related #recalls available at https://t.co/b3djLvfhqP this November?"

And here are the last 6 tweets in our time frame.

tail(cf$text)

## [1] "RT @grizatlcp: The #Hillfire has an insane growth rate - 10,000 acres for a fire less than 2 hours old. In Ventura County in Southern Cali.…"
## [2] "RT @RealJamesWoods: Followers: please retweet this number 530-538-7911 for rescue emergencies #CampFire 911 operators overwhelmed https://t…"
## [3] "RT @RealJamesWoods: Pet owners needing shelter for small animals <U+0001F447>#CampFire https://t.co/1WlDf3EPts"                              
## [4] "@_Dgirl7 Dark Paradise"                                                                                                                      
## [5] "RT @RealJamesWoods: Missing: 80 year old Peggy Mccrea #Paradise off Pearson RD. Call 707-845-2590 #CampFireJamesWoods #CampFire https://t.c…"
## [6] "RT @guardian: California: tens of thousands evacuated as wildfire explodes in size https://t.co/qNTp8mt5Ij"

We’ll come back to this.

Extract keywords from the tweet

Example - tweets that mention Sherrif Honea.

Honea.idx <- grep("honea", cf$text, ignore.case=TRUE) 
Only_honea_data <- cf[Honea.idx, ]
Only_honea_data$text[1]

## [1] "“It’s a very dangerous and very serious situation,” Butte County Sheriff Kory Honea told The Associated Press. “I’m driving through fire as we speak.\"  https://t.co/zhRarnMFME"

Hashtags

Since multiple hashtags can occur in a single tweet.

head(cf$hashtags)

## [[1]]
## [1] "fire"
## 
## [[2]]
## [1] "fire"    "weather"
## 
## [[3]]
## [1] "fire" "va"  
## 
## [[4]]
## [1] "NandhaGopalaKumaran" "NGKFeverBegins"     
## 
## [[5]]
## [1] "HudsonValley"   "fire"           "newyork"        "pleasantvalley"
## [5] "library"       
## 
## [[6]]
## [1] "Fire"    "recalls"

We need to do a little data wrangling first to get these tweets into their own data frame. (This code takes a few seconds to complete.)

cf.tags <- unlist(strsplit(as.character(unlist(cf$hashtags)),'^c\\(|,|"|\\)'))
cf.tags <- tolower(trimws(cf.tags) )
cf.tags[cf.tags==""] <- NA
cf.tags <- na.omit(cf.tags)

top.20.tags <- as_data_frame(table(cf.tags)) %>% arrange(desc(n)) %>% slice(1:20)

## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

ggplot(top.20.tags, aes(x = reorder(cf.tags, -n), y=n)) +
  geom_bar(stat="identity", fill="darkslategray")+
  theme_minimal() + coord_flip() + 
  xlab("#Hashtags") + ylab("Count")

Users

Who are the top tweeting accounts?

top.20.users <- cf %>% group_by(screen_name) %>% summarise(n=n()) %>% arrange(desc(n)) %>% slice(1:20)

ggplot(top.20.users, aes(x = reorder(screen_name, -n), y=n)) +
  geom_bar(stat="identity", fill="darkslategray")+
  theme_minimal() + coord_flip() + 
  xlab("#Hashtags") + ylab("Count")

Actions taken on tweets

Quotes and retweets

How many tweets do we have that are quotes, or retweets? We may want to only analyze original tweets.

# this freq() function comes from the suummarytools package
freq(cf$is_quote)

## Frequencies  
## cf$is_quote  
## Type: Logical  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##       FALSE   60184     82.73          82.73     82.73          82.73
##        TRUE   12564     17.27         100.00     17.27         100.00
##        <NA>       0                               0.00         100.00
##       Total   72748    100.00         100.00    100.00         100.00

freq(cf$is_retweet)

## Frequencies  
## cf$is_retweet  
## Type: Logical  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##       FALSE   17871     24.57          24.57     24.57          24.57
##        TRUE   54877     75.43         100.00     75.43         100.00
##        <NA>       0                               0.00         100.00
##       Total   72748    100.00         100.00    100.00         100.00

Marked as favorite

summary(cf$favorite_count)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00     2.82     0.00 15767.00

The average number of times a tweet is marked favorite is 2… but one tweet was marked as favorite over 15k times?

plot.fav  <- cf %>% filter(favorite_count>1) %>% ggplot(aes(x=favorite_count)) + geom_histogram()
plot.rt  <- cf %>% filter(retweet_count>1) %>% ggplot(aes(x=retweet_count)) + geom_histogram()
plot.quo  <- cf %>% filter(quote_count>1) %>% ggplot(aes(x=quote_count)) + geom_histogram()
plot.rply  <- cf %>% filter(reply_count>1) %>% ggplot(aes(x=reply_count)) + geom_histogram()

gridExtra::grid.arrange(plot.fav, plot.rt, plot.quo, plot.rply, nrow=2)

All of these mostly have 0 actions, and some with a very large number of actions. Decisions on how to proceed will need to be made.

Geotagging

Create a simple map where one point represents a tweet.

library(maps)
USA <- map_data("state") 

library(ggrepel)
ggplot() +
  geom_polygon(data = USA, aes(x=long, y = lat, group = group), fill="grey", alpha=0.2) +
  ylim(20,50) + xlim(-125, -60)+ coord_map() + theme_void() + borders("state") + 
  geom_point(data=cf, aes(x=lng, y=lat))

## Warning: Removed 71423 rows containing missing values (geom_point).

Note that this map does not show the volume of tweets per location. If there are multiple tweets from the same location, the points will be plotted on each other. We can modify that later with a better plot.

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggrepel_0.8.1      summarytools_0.9.6 maps_3.3.0        
## [4] rtweet_0.7.0       ggplot2_3.2.1      dplyr_0.8.99.9002 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.0.0   xfun_0.9           purrr_0.3.3       
##  [4] pander_0.6.3       tcltk_3.6.2        colorspace_1.4-1  
##  [7] vctrs_0.2.99.9011  generics_0.0.2     htmltools_0.3.6   
## [10] yaml_2.2.1         base64enc_0.1-3    rlang_0.4.5.9000  
## [13] pillar_1.4.3       glue_1.4.0         withr_2.1.2       
## [16] pryr_0.1.4         matrixStats_0.55.0 lifecycle_0.2.0   
## [19] plyr_1.8.4         stringr_1.4.0      munsell_0.5.0     
## [22] gtable_0.3.0       mapproj_1.2.7      codetools_0.2-16  
## [25] evaluate_0.14      labeling_0.3       knitr_1.24        
## [28] fansi_0.4.1        Rcpp_1.0.4         scales_1.0.0      
## [31] backports_1.1.5    checkmate_1.9.4    magick_2.3        
## [34] jsonlite_1.6.1     rapportools_1.0    gridExtra_2.3     
## [37] digest_0.6.25      stringi_1.4.6      grid_3.6.2        
## [40] cli_2.0.2          tools_3.6.2        magrittr_1.5      
## [43] lazyeval_0.2.2     tibble_3.0.0       crayon_1.3.4      
## [46] tidyr_1.0.2        pkgconfig_2.0.3    ellipsis_0.3.0    
## [49] lubridate_1.7.4    assertthat_0.2.1   rmarkdown_1.15    
## [52] httr_1.4.1         R6_2.4.1           compiler_3.6.2

Example EDA for Campfire Tweet data