This document shows you how to explore some variables within the sample campfire data set of tweets. See campfire data info for more information on what other data is available.
The data is saved as a RDS
, which is a file format specific to R. Similar to a .sav
file for SPSS or a .dta
file for Stata.
Here are some things you may want to do to explore the data.
We could plot the total number of tweets as a time series, which is somewhat helpful since we have gotten all of the tweets until noon on the first day, then only random samples of 500 here and there until the end of the year.
Okay.. that’s because we pulled full timelines since like 2009 for the town of paradise, and Chico FD. Let’s look at post-fire. This is important b/c i’m interested in how frequent users like these tweetd before, during, and after the fire.
This means you will have to filter your dates before analysis to get tweets about the actual CampFire!
First we use group_by
and summarize
to count the number of tweets per hour
cf_real %>% group_by(tweet_hour) %>%
summarize(tweet_count=n()) %>%
ggplot(aes(x=tweet_hour, y=tweet_count)) + geom_col()
The large amount of space between these blocks is due to this being a sample of data that does not cover the entire time frame that is showing.
The contents of a tweet is contained in the text
variable. Here are the first 6 tweets in our time frame (11/7 at 6am):
## [1] "Is #fire a solid, a liquid, or a gas? - Elizabeth Cox https://t.co/yxZtJ0R0cv"
## [2] "Northern California back into #fire #weather, high winds, low humidity https://t.co/g4Mwc0Vrac - today through Friday morning https://t.co/C3HD4oBopA"
## [3] "KEEP STREAMING!!! ON SPOTIFY TOO <U+0001F5E3><U+0001F5E3><U+0001F5E3><U+0001F5E3> #757 #fire #va https://t.co/nejQho8rc6"
## [4] "Eagle is burning in the eyes of the eagle <U+0001F525> \nThis is his battle <U+0001F608><U+2764>\n\n#NandhaGopalaKumaran !! <U+0001F60E>\n\n #NGKFeverBegins<U+270C>\n\n #SuriyaSivakumarAnnan<U+0001F49B>\n\n #CultClassic <U+0001F525><U+0001F6A9> \n\n#NGK #FIRE <U+0001F4AF><U+0001F525> https://t.co/5V9pDXd4h0"
## [5] "We have a GoFundMe page set up here: https://t.co/zDaNtual5d\n\n#HudsonValley #fire #newyork #pleasantvalley #library"
## [6] "RT @LS_Health: Did you know there are 1,786 #Fire-related #recalls available at https://t.co/b3djLvfhqP this November?"
And here are the last 6 tweets in our time frame.
## [1] "RT @grizatlcp: The #Hillfire has an insane growth rate - 10,000 acres for a fire less than 2 hours old. In Ventura County in Southern Cali.…"
## [2] "RT @RealJamesWoods: Followers: please retweet this number 530-538-7911 for rescue emergencies #CampFire 911 operators overwhelmed https://t…"
## [3] "RT @RealJamesWoods: Pet owners needing shelter for small animals <U+0001F447>#CampFire https://t.co/1WlDf3EPts"
## [4] "@_Dgirl7 Dark Paradise"
## [5] "RT @RealJamesWoods: Missing: 80 year old Peggy Mccrea #Paradise off Pearson RD. Call 707-845-2590 #CampFireJamesWoods #CampFire https://t.c…"
## [6] "RT @guardian: California: tens of thousands evacuated as wildfire explodes in size https://t.co/qNTp8mt5Ij"
We’ll come back to this.
Example - tweets that mention Sherrif Honea.
Honea.idx <- grep("honea", cf$text, ignore.case=TRUE)
Only_honea_data <- cf[Honea.idx, ]
Only_honea_data$text[1]
## [1] "“It’s a very dangerous and very serious situation,” Butte County Sheriff Kory Honea told The Associated Press. “I’m driving through fire as we speak.\" https://t.co/zhRarnMFME"
Who are the top tweeting accounts?
How many tweets do we have that are quotes, or retweets? We may want to only analyze original tweets.
## Frequencies
## cf$is_quote
## Type: Logical
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## FALSE 60184 82.73 82.73 82.73 82.73
## TRUE 12564 17.27 100.00 17.27 100.00
## <NA> 0 0.00 100.00
## Total 72748 100.00 100.00 100.00 100.00
## Frequencies
## cf$is_retweet
## Type: Logical
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## FALSE 17871 24.57 24.57 24.57 24.57
## TRUE 54877 75.43 100.00 75.43 100.00
## <NA> 0 0.00 100.00
## Total 72748 100.00 100.00 100.00 100.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 2.82 0.00 15767.00
The average number of times a tweet is marked favorite is 2… but one tweet was marked as favorite over 15k times?
plot.fav <- cf %>% filter(favorite_count>1) %>% ggplot(aes(x=favorite_count)) + geom_histogram()
plot.rt <- cf %>% filter(retweet_count>1) %>% ggplot(aes(x=retweet_count)) + geom_histogram()
plot.quo <- cf %>% filter(quote_count>1) %>% ggplot(aes(x=quote_count)) + geom_histogram()
plot.rply <- cf %>% filter(reply_count>1) %>% ggplot(aes(x=reply_count)) + geom_histogram()
gridExtra::grid.arrange(plot.fav, plot.rt, plot.quo, plot.rply, nrow=2)
All of these mostly have 0 actions, and some with a very large number of actions. Decisions on how to proceed will need to be made.
Create a simple map where one point represents a tweet.
library(maps)
USA <- map_data("state")
library(ggrepel)
ggplot() +
geom_polygon(data = USA, aes(x=long, y = lat, group = group), fill="grey", alpha=0.2) +
ylim(20,50) + xlim(-125, -60)+ coord_map() + theme_void() + borders("state") +
geom_point(data=cf, aes(x=lng, y=lat))
## Warning: Removed 71423 rows containing missing values (geom_point).
Note that this map does not show the volume of tweets per location. If there are multiple tweets from the same location, the points will be plotted on each other. We can modify that later with a better plot.
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggrepel_0.8.1 summarytools_0.9.6 maps_3.3.0
## [4] rtweet_0.7.0 ggplot2_3.2.1 dplyr_0.8.99.9002
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.0.0 xfun_0.9 purrr_0.3.3
## [4] pander_0.6.3 tcltk_3.6.2 colorspace_1.4-1
## [7] vctrs_0.2.99.9011 generics_0.0.2 htmltools_0.3.6
## [10] yaml_2.2.1 base64enc_0.1-3 rlang_0.4.5.9000
## [13] pillar_1.4.3 glue_1.4.0 withr_2.1.2
## [16] pryr_0.1.4 matrixStats_0.55.0 lifecycle_0.2.0
## [19] plyr_1.8.4 stringr_1.4.0 munsell_0.5.0
## [22] gtable_0.3.0 mapproj_1.2.7 codetools_0.2-16
## [25] evaluate_0.14 labeling_0.3 knitr_1.24
## [28] fansi_0.4.1 Rcpp_1.0.4 scales_1.0.0
## [31] backports_1.1.5 checkmate_1.9.4 magick_2.3
## [34] jsonlite_1.6.1 rapportools_1.0 gridExtra_2.3
## [37] digest_0.6.25 stringi_1.4.6 grid_3.6.2
## [40] cli_2.0.2 tools_3.6.2 magrittr_1.5
## [43] lazyeval_0.2.2 tibble_3.0.0 crayon_1.3.4
## [46] tidyr_1.0.2 pkgconfig_2.0.3 ellipsis_0.3.0
## [49] lubridate_1.7.4 assertthat_0.2.1 rmarkdown_1.15
## [52] httr_1.4.1 R6_2.4.1 compiler_3.6.2