This document shows you what data is available from a search of historical tweets. See demo tweet EDA for more demonstrations of how you can explore and visualize this data.
The search query used was "lang:en #CampFire OR #campfire OR #fire OR paradise OR @CALFIRE_ButteCo OR concow OR Pulga OR Magalia"
That is: only english language tweets, with each of the listed hashtags, or CalFire for Butte county @ mentioned.
Let’s read in the sample data.
We are currently a 72748 sampled tweets between 2013-11-07 15:31:57 and 2018-12-31 21:12:06.
Okay.. that’s because we pulled full timelines since like 2009 for the town of paradise, and Chico FD. Let’s look at post-fire. This is important b/c i’m interested in how frequent users like these tweetd before, during, and after the fire.
This means you will have to filter your dates before analysis to get tweets about the actual CampFire!
This sample contains tweets from the following time ranges:
Plus a random sample of 22 hours between 11/09/18 and 12/31/18. (more to come) > Disclaimer: The hour blocks were randomly sampled. Not all tweets were obtained from that hour. > e.g. 11/18/18 16:54 - 16:59, 12/29 10:39-10:59.
We can use the glimpse()
function from dplyr
to see the variable names, data types, and examples of data values.
## Rows: 72,748
## Columns: 97
## $ user_id <chr> "141875161", "779816688644526080", "22...
## $ status_id <chr> "1060184727515226112", "10601844918035...
## $ created_at <dttm> 2018-11-07 14:58:30, 2018-11-07 14:57...
## $ screen_name <chr> "munashe12", "VisionsOfNapa", "jaying7...
## $ text <chr> "Is #fire a solid, a liquid, or a gas?...
## $ source <chr> "Twitter for Android", "Twitter Web Cl...
## $ reply_to_status_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_user_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ is_quote <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ is_retweet <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, ...
## $ favorite_count <int> 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 4, 2, 0...
## $ retweet_count <int> 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 2, 4, 0,...
## $ quote_count <int> 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ reply_count <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hashtags <list> ["fire", <"fire", "weather">, <"fire"...
## $ symbols <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ urls_url <list> ["youtu.be/YV8TT9LRBrY", "weather.gov...
## $ urls_t.co <list> ["https://t.co/yxZtJ0R0cv", "https://...
## $ urls_expanded_url <list> ["https://youtu.be/YV8TT9LRBrY", "htt...
## $ media_url <list> [NA, NA, "http://pbs.twimg.com/media/...
## $ media_t.co <list> [NA, NA, "https://t.co/nejQho8rc6", N...
## $ media_expanded_url <list> [NA, NA, "https://twitter.com/KiingDG...
## $ media_type <list> [NA, NA, "photo", NA, NA, NA, NA, "ph...
## $ ext_media_url <list> [NA, NA, "http://pbs.twimg.com/media/...
## $ ext_media_t.co <list> [NA, NA, "https://t.co/nejQho8rc6", N...
## $ ext_media_expanded_url <list> [NA, NA, "https://twitter.com/KiingDG...
## $ ext_media_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mentions_user_id <list> ["1435717328", "910623276778405888", ...
## $ mentions_screen_name <list> ["Txtxndx", "ai6yrham", "KiingDG_", "...
## $ quoted_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_text <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_created_at <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_source <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_favorite_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_retweet_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_user_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_followers_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_friends_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_statuses_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_verified <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ retweet_status_id <chr> "1060095946887979008", "10601825853246...
## $ retweet_text <chr> "Is #fire a solid, a liquid, or a gas?...
## $ retweet_created_at <dttm> 2018-11-07 09:05:44, 2018-11-07 14:50...
## $ retweet_source <chr> "Twitter Web Client", "Twitter Web Cli...
## $ retweet_favorite_count <int> 0, 2, 4, 135, NA, NA, NA, NA, 8, 117, ...
## $ retweet_retweet_count <int> 1, 4, 2, 63, NA, NA, NA, NA, 20, 85, N...
## $ retweet_user_id <chr> "1435717328", "910623276778405888", "2...
## $ retweet_screen_name <chr> "Txtxndx", "ai6yrham", "KiingDG_", "sr...
## $ retweet_name <chr> "Tatenda", "AI6YR", "DG \U0001f60e\U00...
## $ retweet_followers_count <int> 815, 13312, 886, 1722, NA, NA, NA, NA,...
## $ retweet_friends_count <int> 5002, 837, 963, 201, NA, NA, NA, NA, 5...
## $ retweet_statuses_count <int> 3584, 43535, 7346, 20370, NA, NA, NA, ...
## $ retweet_location <chr> "Lille, France", "DM04", "Chesapeake, ...
## $ retweet_description <chr> "\"Music is Life. That’s why our heart...
## $ retweet_verified <lgl> FALSE, FALSE, FALSE, FALSE, NA, NA, NA...
## $ place_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ coords_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, N...
## $ bbox_coords <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <N...
## $ status_url <chr> "https://twitter.com/munashe12/status/...
## $ name <chr> "munashe musoni", "Visions of Napa", "...
## $ location <chr> "Zimbabwe", "Napa, California", "Somew...
## $ description <chr> "Doctor of pharmacy(pharmd), MBA Marke...
## $ url <chr> NA, NA, NA, NA, NA, "http://www.Living...
## $ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ followers_count <int> 105, 603, 707, 163, 279, 293, 397, 39,...
## $ friends_count <int> 139, 598, 633, 292, 254, 7, 488, 319, ...
## $ listed_count <int> 1, 2, 1, 0, 3, 253, 1, 0, 5, 52, 1, 41...
## $ statuses_count <int> 667, 7135, 1874, 23455, 334, 131307, 4...
## $ favourites_count <int> 507, 16213, 1189, 71897, 875, 0, 41, 1...
## $ account_created_at <dttm> 2010-05-09 08:25:26, 2016-09-24 22:56...
## $ verified <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ profile_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ profile_expanded_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ account_lang <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ profile_banner_url <chr> NA, "https://pbs.twimg.com/profile_ban...
## $ profile_background_url <chr> "http://abs.twimg.com/images/themes/th...
## $ profile_image_url <chr> "http://pbs.twimg.com/profile_images/6...
## $ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ lng <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ created_at_pst <dttm> 2018-11-07 06:58:30, 2018-11-07 06:57...
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 20...
## $ month <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11...
## $ day <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,...
## $ hour <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
## $ minute <int> 58, 57, 57, 56, 56, 55, 55, 54, 53, 52...
## $ tweet_min <dttm> 2018-11-07 06:59:00, 2018-11-07 06:58...
## $ tweet_hour <dttm> 2018-11-07 07:00:00, 2018-11-07 07:00...
Below is a list of data types contained in this data set, and example variables that are that type.
<chr>
character variables, or strings: screen_name
and source
<dttm>
date-time: created_at
, and retweet_created_at
.<dbl>
, <int>
numeric (numbers): favorite_count
, retweet_count
<list>
list: hashtags
, geo_coords
Lists are special data types that contain multiple entries per record. Here is what the hashtags
variable looks like for the top 6 records.
## [[1]]
## [1] "fire"
##
## [[2]]
## [1] "fire" "weather"
##
## [[3]]
## [1] "fire" "va"
##
## [[4]]
## [1] "NandhaGopalaKumaran" "NGKFeverBegins"
##
## [[5]]
## [1] "HudsonValley" "fire" "newyork" "pleasantvalley"
## [5] "library"
##
## [[6]]
## [1] "Fire" "recalls"
Date-time measures <dttm>
can be tricky to work with. I have already created a few simpler date/time variables for you. First note that Twitter stores all dates and times at the coordinated universal time UTC (or GMT).
All times are stored as 24 hour (military) time. Midnight is 00:00, Noon is 12:00, your class runs from 19:00 to 22:00.
## # A tibble: 6 x 8
## created_at created_at_pst month day hour minute
## <dttm> <dttm> <dbl> <int> <int> <int>
## 1 2013-11-07 23:31:57 2013-11-07 15:31:57 11 7 15 31
## 2 2013-11-08 01:29:53 2013-11-07 17:29:53 11 7 17 29
## 3 2013-11-08 03:09:23 2013-11-07 19:09:23 11 7 19 9
## 4 2013-11-09 16:53:45 2013-11-09 08:53:45 11 9 8 53
## 5 2013-11-19 23:20:40 2013-11-19 15:20:40 11 19 15 20
## 6 2013-11-26 22:40:05 2013-11-26 14:40:05 11 26 14 40
## # ... with 2 more variables: tweet_min <dttm>, tweet_hour <dttm>
created_at
This is at UTC. 15:00 (3pm) UTC is 07:00 (7am) Pacific timecreated_at_pst
This is our local time. This is the one we want to use.lubridate
package work the appropriate magic for now, and check later when we get tweets that span the time change.year
/month
/day
/hour
/minute
- separated components of the date and timetweet_min
Tweets are recorded to the second. This variable rounds the tweet to the nearest minute.The location of the tweet is stored in the variables lat
and lng
. better plot.
How much location data do we even have?
##
## FALSE TRUE
## 1501 71247
Very little. Only 1 records out of the 72748 contain location data. This is important to remember when we think about generalizability. Not everyone has their location tracking turned on.
The media variables let you look at any videos, pictures or gifs that are attached to tweets. It’s important to note that not all of the tweets have media information.
isna <- ifelse(is.na(cf$media_url), 0, 1)
percentWithMedia <- round((sum(isna)/nrow(cf)) * 100, digits = 2)
For this particular version of the dataset, we have 72748 tweets and only 34.84% have media data.
The following examples use one media story in particular that I found interesting.
* media_url: contains the url for the media (picture or video) + Can be viewed online.
index <- which(grepl("http://pbs.twimg.com/media/Driv9H9U4AAMdoV.jpg", cf$media_url))
(url <- cf$media_url[index][[1]])
## [1] "http://pbs.twimg.com/media/Driv9H9U4AAMdoV.jpg"
http://pbs.twimg.com/media/Driv9H9U4AAMdoV.jpg
# An example of a media url (chose line 10 because it is not NA)
(url <- cf$ext_media_url[index][[1]])
## [1] "http://pbs.twimg.com/media/Driv9H9U4AAMdoV.jpg"
http://pbs.twimg.com/media/Driv9H9U4AAMdoV.jpg
## [1] "https://t.co/OdPj8DGgs8"
# An example of a media url (chose line 10 because it is not NA)
(url <- cf$ext_media_t.co[index][[1]])
## [1] "https://t.co/OdPj8DGgs8"
# An example of a media url (chose line 10 because it is not NA)
(url <- cf$media_expanded_url[index][[1]])
## [1] "https://twitter.com/abcWNN/status/1060791248095600640/video/1"
https://twitter.com/abcWNN/status/1060791248095600640/video/1
# An example of a media url (chose line 10 because it is not NA)
(url <- cf$ext_media_expanded_url[index][[1]])
## [1] "https://twitter.com/abcWNN/status/1060791248095600640/video/1"
https://twitter.com/abcWNN/status/1060791248095600640/video/1
## [1] "photo"
Accounts of public interest, the authenticity of which is denoted by a blue checkmark or “badge”. Typically, these accounts are maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas.
##
## FALSE TRUE
## 67802 4946
Rtweet itself will not return any data that will indicate whether the account is an official news organization. In order to determine if the twitter account is an official news organization we can consider the number of followers an account has, whether it is verified, and whether it indicates it’s a news outlet through their account description and/or username. While this will give us decent results, we also know that there will be room for error due to the potential for missing news organizations as well as including some accounts that are not news organizations.
news_orgs <- cf %>%
users_data() %>%
distinct(screen_name, .keep_all = TRUE) %>%
filter(str_detect(screen_name, "news|News") | str_detect(description, "news|News")) %>%
filter(verified=="TRUE") %>%
arrange(desc(followers_count))
head(news_orgs)
## # A tibble: 6 x 20
## user_id screen_name name location description url protected
## <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 759251 CNN CNN <NA> "It’s our ~ http~ FALSE
## 2 807095 nytimes The ~ New Yor~ "News tips~ http~ FALSE
## 3 287854~ ABC ABC ~ New Yor~ "All the n~ http~ FALSE
## 4 487118~ XHNews Chin~ Headqua~ "China ins~ <NA> FALSE
## 5 145119~ HuffPost Huff~ <NA> "At HuffPo~ http~ FALSE
## 6 6017542 BreakingNe~ Brea~ NYC, LA~ <NA> http~ FALSE
## # ... with 13 more variables: followers_count <int>, friends_count <int>,
## # listed_count <int>, statuses_count <int>, favourites_count <int>,
## # account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## # profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.2.1 stringr_1.4.0 rtweet_0.7.0 dplyr_0.8.99.9002
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.2
## [4] tools_3.6.2 digest_0.6.25 lubridate_1.7.4
## [7] jsonlite_1.6.1 evaluate_0.14 lifecycle_0.2.0
## [10] tibble_3.0.0 gtable_0.3.0 pkgconfig_2.0.3
## [13] rlang_0.4.5.9000 cli_2.0.2 yaml_2.2.1
## [16] xfun_0.9 withr_2.1.2 httr_1.4.1
## [19] knitr_1.24 generics_0.0.2 vctrs_0.2.99.9011
## [22] grid_3.6.2 tidyselect_1.0.0 glue_1.4.0
## [25] R6_2.4.1 fansi_0.4.1 rmarkdown_1.15
## [28] purrr_0.3.3 magrittr_1.5 scales_1.0.0
## [31] ellipsis_0.3.0 htmltools_0.3.6 assertthat_0.2.1
## [34] colorspace_1.4-1 labeling_0.3 utf8_1.1.4
## [37] stringi_1.4.6 lazyeval_0.2.2 munsell_0.5.0
## [40] crayon_1.3.4