Covid19 data was collected using a combination of collecting data using the rtweet
package, and by hydrating
previously identified tweets relating to the pandemic. See this tweet sampler file for information on how ID’s were randomly selected. The hydratd data comes from this repository by Emily Chen and Dr. Ferrara at USC. This database starting on 01-28-2020.
Load libraries and set options.
Set the working directory, then read in the current data and examine the structure of a tweet about COVID19.
There are 410564 tweets in this data set across the following languages:
##
## am ar bg bn bo ca ckb cs cy da
## 4 3004 6 30 1 1423 4 154 65 140
## de dv el en es et eu fa fi fr
## 2638 7 176 262818 39814 299 69 334 195 9933
## gu hi ht hu hy in is it iw ja
## 36 2926 241 32 1 13127 18 5013 37 16414
## km kn ko lo lt lv ml mr my ne
## 2 33 2221 1 250 54 35 102 5 49
## nl no or pa pl ps pt ro ru sd
## 1089 66 14 5 435 6 11225 211 698 7
## si sl sr sv ta te th tl tr uk
## 14 103 28 221 413 89 11544 2503 4022 73
## und ur vi zh
## 13941 358 161 1627
Here is the current timeframe of collected tweets, and the amount of tweets collected. Some groups specifically collected tweets from @WHO and @CDCgov back until December.
n.days <- cv %>% group_by(month, day) %>% count()
n.hours <- cv %>% group_by(month, day, hour) %>% count()
We have approximately 174 tweets per hour, and 2975 per day.
Counts of tweets per day since the bulk of the data collection began on Jan 28
recent <- cv %>% filter(month>1, month<6)
ggplot(recent, aes(x=tweet_day)) + geom_bar() +
scale_x_datetime(breaks = "2 weeks")
Twitter stores all dates and times at the coordinated universal time UTC (or GMT).
All times are stored as 24 hour (military) time. Midnight is 00:00, Noon is 12:00, your class runs from 19:00 to 22:00.
created_at_UTC
This is at UTC. 15:00 (3pm) UTC is 07:00 (7am) Pacific timeyear
/month
/day
/hour
/minute
- separated components of the date and timetweet_min
Tweets are recorded to the second. This variable rounds the tweet to the nearest minute.We can use the glimpse()
function from dplyr
to see the variable names, data types, and examples of data values.
## Rows: 410,564
## Columns: 101
## $ user_id <chr> "146569971", "146569971", "146569971",...
## $ tweet_id <dbl> 1.201605e+18, 1.201903e+18, 1.201922e+...
## $ created_at_UTC <dttm> 2019-12-02 20:50:15, 2019-12-03 16:37...
## $ screen_name <chr> "CDCgov", "CDCgov", "CDCgov", "CDCgov"...
## $ text <chr> "@ErickLeeOrtiz Wash hands for 20 seco...
## $ source <chr> "Twitter Web App", "Sprout Social", "T...
## $ display_text_width <dbl> 125, 140, NA, 140, NA, NA, NA, 140, 14...
## $ reply_to_status_id <dbl> 1.200502e+18, NA, NA, NA, NA, NA, NA, ...
## $ reply_to_user_id <dbl> 1700262326, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_screen_name <chr> "ErickLeeOrtiz", NA, NA, NA, NA, NA, N...
## $ is_quote <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALS...
## $ is_retweet <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE,...
## $ favorite_count <int> 1, 50, 0, 137, 0, 0, 0, 20, 99, 0, 129...
## $ retweet_count <int> 8, 43, 0, 90, 0, 0, 0, 28, 48, 8, 120,...
## $ quote_count <int> 0, 1, 0, 17, 0, 0, 0, 1, 12, 0, 23, 0,...
## $ reply_count <int> 0, 0, 0, 7, 0, 0, 0, 2, 4, 0, 12, 0, 0...
## $ hashtags <chr> "", "", "HealthforAll", "EndHIVEpidemi...
## $ symbols <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ urls_url <list> ["twitter.com/i/web/status/1…", "twit...
## $ urls_t.co <list> ["https://t.co/0eCIf2sWRP", "https://...
## $ urls_expanded_url <list> ["https://twitter.com/i/web/status/12...
## $ media_t.co <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ media_expanded_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ media_type <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_t.co <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_expanded_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mentions_user_id <list> ["1700262326", NA, "14499829", NA, "1...
## $ mentions_screen_name <list> ["ErickLeeOrtiz", NA, "WHO", NA, "CDC...
## $ lang <chr> "en", "en", "en", "en", "en", "en", "e...
## $ quoted_status_id <chr> NA, "1201572495603712003", NA, NA, NA,...
## $ quoted_text <chr> NA, "Each year, millions of children g...
## $ quoted_created_at <dttm> NA, 2019-12-02 18:43:01, NA, NA, NA, ...
## $ quoted_source <chr> NA, "Sprout Social", NA, NA, NA, NA, N...
## $ quoted_favorite_count <int> NA, 9, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_retweet_count <int> NA, 11, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_user_id <chr> NA, "16616061", NA, NA, NA, NA, NA, NA...
## $ quoted_screen_name <chr> NA, "CDCFlu", NA, NA, NA, NA, NA, NA, ...
## $ quoted_name <chr> NA, "CDC Flu", NA, NA, NA, NA, NA, NA,...
## $ quoted_followers_count <int> NA, 891446, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_friends_count <int> NA, 152, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_statuses_count <int> NA, 7046, NA, NA, NA, NA, NA, NA, NA, ...
## $ quoted_location <chr> NA, "Atlanta, GA", NA, NA, NA, NA, NA,...
## $ quoted_description <chr> NA, "Flu-related updates from the Cent...
## $ quoted_verified <lgl> NA, TRUE, NA, NA, NA, NA, NA, NA, NA, ...
## $ retweet_status_id <dbl> NA, NA, 1.192365e+18, NA, 1.201935e+18...
## $ retweet_text <chr> NA, NA, "We want YOU!\nCalling all fil...
## $ retweet_created_at <dttm> NA, NA, 2019-11-07 08:55:51, NA, 2019...
## $ retweet_source <chr> NA, NA, "Twitter Web App", NA, "Twitte...
## $ retweet_favorite_count <int> NA, NA, 465, NA, 39, 60, 23, NA, NA, N...
## $ retweet_retweet_count <int> NA, NA, 294, NA, 28, 48, 33, NA, NA, N...
## $ retweet_user_id <chr> NA, NA, "14499829", NA, "135265904", "...
## $ retweet_screen_name <chr> NA, NA, "WHO", NA, "CDCTobaccoFree", "...
## $ retweet_name <chr> NA, NA, "World Health Organization (WH...
## $ retweet_followers_count <int> NA, NA, 7388912, NA, 35652, 8724, 1026...
## $ retweet_friends_count <int> NA, NA, 1718, NA, 179, 543, 248, NA, N...
## $ retweet_statuses_count <int> NA, NA, 50118, NA, 7291, 1816, 3955, N...
## $ retweet_location <chr> NA, NA, "Geneva, Switzerland", NA, "At...
## $ retweet_description <chr> NA, NA, "We are the #UnitedNations’ he...
## $ retweet_verified <lgl> NA, NA, TRUE, NA, TRUE, TRUE, FALSE, N...
## $ place_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ coords_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, N...
## $ bbox_coords <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <N...
## $ tweet_url <chr> "https://twitter.com/CDCgov/status/120...
## $ name <chr> "CDC", "CDC", "CDC", "CDC", "CDC", "CD...
## $ user_location <chr> "Atlanta, GA", "Atlanta, GA", "Atlanta...
## $ user_description <chr> "CDC's official Twitter source for dai...
## $ url <chr> "http://www.cdc.gov", "http://www.cdc....
## $ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ user_followers_count <int> 2549814, 2549814, 2549814, 2549814, 25...
## $ user_friends_count <int> 266, 266, 266, 266, 266, 266, 266, 266...
## $ user_listed_count <int> 17162, 17162, 17162, 17162, 17162, 171...
## $ user_statuses_count <int> 26566, 26566, 26566, 26566, 26566, 265...
## $ user_favourites_count <int> 522, 522, 522, 522, 522, 522, 522, 522...
## $ account_created_at <dttm> 2010-05-21 19:40:40, 2010-05-21 19:40...
## $ verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
## $ profile_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ profile_expanded_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ account_lang <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ profile_banner_url <chr> "https://pbs.twimg.com/profile_banners...
## $ profile_background_url <chr> "http://abs.twimg.com/images/themes/th...
## $ profile_image_url <chr> "http://pbs.twimg.com/profile_images/8...
## $ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ lng <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ urls <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ possibly_sensitive <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ user_time_zone <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 20...
## $ month <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
## $ day <int> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4,...
## $ hour <int> 20, 16, 17, 18, 20, 20, 20, 20, 20, 21...
## $ minute <int> 50, 37, 51, 3, 30, 31, 31, 40, 47, 0, ...
## $ tweet_min <dttm> 2019-12-02 20:50:00, 2019-12-03 16:37...
## $ tweet_hour <dttm> 2019-12-02 21:00:00, 2019-12-03 17:00...
## $ tweet_day <dttm> 2019-12-03, 2019-12-04, 2019-12-04, 2...
Below is a list of data types contained in this data set, and example variables that are that type.
<chr>
character variables, or strings: screen_name
and source
<dttm>
date-time: created_at_UTC
<dbl>
, <int>
numeric (numbers): favorite_count
, retweet_count
The hashtags in this data set are in their own variable, each separated by a space. Notice not all tweets contain hashtags.
## [1] "" ""
## [3] "HealthforAll" "EndHIVEpidemic HIV VitalSigns"
## [5] "" "flu"
The location of the tweet is stored in the variables lat
and long
.
How much location data do we even have?
##
## FALSE TRUE
## 375 410189
Almost none.
library(maps)
world <- map_data("world")
library(ggrepel)
ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.2) +
coord_map() + theme_void() +
geom_point(data=cv, aes(x=lng, y=lat))
And the data that we do have, doesn’t seem to be accurate.
The media variables let you look at any videos, pictures or gifs that are attached to tweets. It’s important to note that not all of the tweets have media information.
Accounts of public interest, the authenticity of which is denoted by a blue checkmark or “badge”. Typically, these accounts are maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas.
##
## FALSE TRUE
## 397624 12939
Last run on 2020-04-30 10:23:33
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggrepel_0.8.1 maps_3.3.0 ggplot2_3.2.1 dplyr_0.8.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 pillar_1.4.3 compiler_3.6.1 tools_3.6.1
## [5] digest_0.6.21 rtweet_0.7.0 jsonlite_1.6.1 evaluate_0.14
## [9] lifecycle_0.2.0 tibble_3.0.0 gtable_0.3.0 pkgconfig_2.0.3
## [13] rlang_0.4.5 cli_2.0.1 mapproj_1.2.7 yaml_2.2.1
## [17] xfun_0.10 withr_2.1.2 stringr_1.4.0 httr_1.4.1
## [21] knitr_1.25 vctrs_0.2.4 grid_3.6.1 tidyselect_0.2.5
## [25] glue_1.3.1 R6_2.4.0 fansi_0.4.0 rmarkdown_1.18
## [29] purrr_0.3.3 magrittr_1.5 scales_1.0.0 ellipsis_0.3.0
## [33] htmltools_0.4.0 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
## [37] utf8_1.1.4 stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
## [41] crayon_1.3.4