Covid19 data was collected using a combination of collecting data using the rtweet package, and by hydrating previously identified tweets relating to the pandemic. See this tweet sampler file for information on how ID’s were randomly selected. The hydratd data comes from this repository by Emily Chen and Dr. Ferrara at USC. This database starting on 01-28-2020.

Search queries used (when not using the hydrator)

Recent tweets (3/30/20 - 4/7/20) “fema AND (covid19 OR coronavirus OR covid OR covid19 OR covd OR pandemic OR outbreak OR covid-19)”
Full timelines for @WHO and @CDCgov since December.

Load libraries and set options.

library(dplyr); library(ggplot2)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Set the working directory, then read in the current data and examine the structure of a tweet about COVID19.

# will have to change this
#wd <- "G:/My Drive/Teaching/POLS 624 S20/Twitter project docs/data" 
#cv <- readRDS(file.path(wd, "covid-tweets-2020-04-06.Rds"))
cv <- readRDS("../data/covid-tweets-2020-04-30.Rds")

Overall info

There are 410564 tweets in this data set across the following languages:

table(cv$lang)

## 
##     am     ar     bg     bn     bo     ca    ckb     cs     cy     da 
##      4   3004      6     30      1   1423      4    154     65    140 
##     de     dv     el     en     es     et     eu     fa     fi     fr 
##   2638      7    176 262818  39814    299     69    334    195   9933 
##     gu     hi     ht     hu     hy     in     is     it     iw     ja 
##     36   2926    241     32      1  13127     18   5013     37  16414 
##     km     kn     ko     lo     lt     lv     ml     mr     my     ne 
##      2     33   2221      1    250     54     35    102      5     49 
##     nl     no     or     pa     pl     ps     pt     ro     ru     sd 
##   1089     66     14      5    435      6  11225    211    698      7 
##     si     sl     sr     sv     ta     te     th     tl     tr     uk 
##     14    103     28    221    413     89  11544   2503   4022     73 
##    und     ur     vi     zh 
##  13941    358    161   1627

Dates and Times

Here is the current timeframe of collected tweets, and the amount of tweets collected. Some groups specifically collected tweets from @WHO and @CDCgov back until December.

rtweet::ts_plot(cv)

n.days <- cv %>% group_by(month, day) %>% count()
n.hours <- cv %>% group_by(month, day, hour) %>% count()

We have approximately 174 tweets per hour, and 2975 per day.

Counts of tweets per day since the bulk of the data collection began on Jan 28

recent <- cv %>% filter(month>1, month<6)
ggplot(recent, aes(x=tweet_day)) + geom_bar() + 
      scale_x_datetime(breaks = "2 weeks")

Twitter stores all dates and times at the coordinated universal time UTC (or GMT).

All times are stored as 24 hour (military) time. Midnight is 00:00, Noon is 12:00, your class runs from 19:00 to 22:00.

created_at_UTC This is at UTC. 15:00 (3pm) UTC is 07:00 (7am) Pacific time
year/month/day/hour/minute - separated components of the date and time
tweet_min Tweets are recorded to the second. This variable rounds the tweet to the nearest minute.

Full variable list

We can use the glimpse() function from dplyr to see the variable names, data types, and examples of data values.

glimpse(cv)

## Rows: 410,564
## Columns: 101
## $ user_id                 <chr> "146569971", "146569971", "146569971",...
## $ tweet_id                <dbl> 1.201605e+18, 1.201903e+18, 1.201922e+...
## $ created_at_UTC          <dttm> 2019-12-02 20:50:15, 2019-12-03 16:37...
## $ screen_name             <chr> "CDCgov", "CDCgov", "CDCgov", "CDCgov"...
## $ text                    <chr> "@ErickLeeOrtiz Wash hands for 20 seco...
## $ source                  <chr> "Twitter Web App", "Sprout Social", "T...
## $ display_text_width      <dbl> 125, 140, NA, 140, NA, NA, NA, 140, 14...
## $ reply_to_status_id      <dbl> 1.200502e+18, NA, NA, NA, NA, NA, NA, ...
## $ reply_to_user_id        <dbl> 1700262326, NA, NA, NA, NA, NA, NA, NA...
## $ reply_to_screen_name    <chr> "ErickLeeOrtiz", NA, NA, NA, NA, NA, N...
## $ is_quote                <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALS...
## $ is_retweet              <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, TRUE,...
## $ favorite_count          <int> 1, 50, 0, 137, 0, 0, 0, 20, 99, 0, 129...
## $ retweet_count           <int> 8, 43, 0, 90, 0, 0, 0, 28, 48, 8, 120,...
## $ quote_count             <int> 0, 1, 0, 17, 0, 0, 0, 1, 12, 0, 23, 0,...
## $ reply_count             <int> 0, 0, 0, 7, 0, 0, 0, 2, 4, 0, 12, 0, 0...
## $ hashtags                <chr> "", "", "HealthforAll", "EndHIVEpidemi...
## $ symbols                 <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ urls_url                <list> ["twitter.com/i/web/status/1…", "twit...
## $ urls_t.co               <list> ["https://t.co/0eCIf2sWRP", "https://...
## $ urls_expanded_url       <list> ["https://twitter.com/i/web/status/12...
## $ media_t.co              <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ media_expanded_url      <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ media_type              <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_url           <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_t.co          <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_expanded_url  <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mentions_user_id        <list> ["1700262326", NA, "14499829", NA, "1...
## $ mentions_screen_name    <list> ["ErickLeeOrtiz", NA, "WHO", NA, "CDC...
## $ lang                    <chr> "en", "en", "en", "en", "en", "en", "e...
## $ quoted_status_id        <chr> NA, "1201572495603712003", NA, NA, NA,...
## $ quoted_text             <chr> NA, "Each year, millions of children g...
## $ quoted_created_at       <dttm> NA, 2019-12-02 18:43:01, NA, NA, NA, ...
## $ quoted_source           <chr> NA, "Sprout Social", NA, NA, NA, NA, N...
## $ quoted_favorite_count   <int> NA, 9, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_retweet_count    <int> NA, 11, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_user_id          <chr> NA, "16616061", NA, NA, NA, NA, NA, NA...
## $ quoted_screen_name      <chr> NA, "CDCFlu", NA, NA, NA, NA, NA, NA, ...
## $ quoted_name             <chr> NA, "CDC Flu", NA, NA, NA, NA, NA, NA,...
## $ quoted_followers_count  <int> NA, 891446, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_friends_count    <int> NA, 152, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_statuses_count   <int> NA, 7046, NA, NA, NA, NA, NA, NA, NA, ...
## $ quoted_location         <chr> NA, "Atlanta, GA", NA, NA, NA, NA, NA,...
## $ quoted_description      <chr> NA, "Flu-related updates from the Cent...
## $ quoted_verified         <lgl> NA, TRUE, NA, NA, NA, NA, NA, NA, NA, ...
## $ retweet_status_id       <dbl> NA, NA, 1.192365e+18, NA, 1.201935e+18...
## $ retweet_text            <chr> NA, NA, "We want YOU!\nCalling all fil...
## $ retweet_created_at      <dttm> NA, NA, 2019-11-07 08:55:51, NA, 2019...
## $ retweet_source          <chr> NA, NA, "Twitter Web App", NA, "Twitte...
## $ retweet_favorite_count  <int> NA, NA, 465, NA, 39, 60, 23, NA, NA, N...
## $ retweet_retweet_count   <int> NA, NA, 294, NA, 28, 48, 33, NA, NA, N...
## $ retweet_user_id         <chr> NA, NA, "14499829", NA, "135265904", "...
## $ retweet_screen_name     <chr> NA, NA, "WHO", NA, "CDCTobaccoFree", "...
## $ retweet_name            <chr> NA, NA, "World Health Organization (WH...
## $ retweet_followers_count <int> NA, NA, 7388912, NA, 35652, 8724, 1026...
## $ retweet_friends_count   <int> NA, NA, 1718, NA, 179, 543, 248, NA, N...
## $ retweet_statuses_count  <int> NA, NA, 50118, NA, 7291, 1816, 3955, N...
## $ retweet_location        <chr> NA, NA, "Geneva, Switzerland", NA, "At...
## $ retweet_description     <chr> NA, NA, "We are the #UnitedNations’ he...
## $ retweet_verified        <lgl> NA, NA, TRUE, NA, TRUE, TRUE, FALSE, N...
## $ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ coords_coords           <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, N...
## $ bbox_coords             <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <N...
## $ tweet_url               <chr> "https://twitter.com/CDCgov/status/120...
## $ name                    <chr> "CDC", "CDC", "CDC", "CDC", "CDC", "CD...
## $ user_location           <chr> "Atlanta, GA", "Atlanta, GA", "Atlanta...
## $ user_description        <chr> "CDC's official Twitter source for dai...
## $ url                     <chr> "http://www.cdc.gov", "http://www.cdc....
## $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ user_followers_count    <int> 2549814, 2549814, 2549814, 2549814, 25...
## $ user_friends_count      <int> 266, 266, 266, 266, 266, 266, 266, 266...
## $ user_listed_count       <int> 17162, 17162, 17162, 17162, 17162, 171...
## $ user_statuses_count     <int> 26566, 26566, 26566, 26566, 26566, 265...
## $ user_favourites_count   <int> 522, 522, 522, 522, 522, 522, 522, 522...
## $ account_created_at      <dttm> 2010-05-21 19:40:40, 2010-05-21 19:40...
## $ verified                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
## $ profile_url             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ profile_expanded_url    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners...
## $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/th...
## $ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/8...
## $ lat                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ lng                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ urls                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ possibly_sensitive      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ user_time_zone          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ year                    <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 20...
## $ month                   <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
## $ day                     <int> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4,...
## $ hour                    <int> 20, 16, 17, 18, 20, 20, 20, 20, 20, 21...
## $ minute                  <int> 50, 37, 51, 3, 30, 31, 31, 40, 47, 0, ...
## $ tweet_min               <dttm> 2019-12-02 20:50:00, 2019-12-03 16:37...
## $ tweet_hour              <dttm> 2019-12-02 21:00:00, 2019-12-03 17:00...
## $ tweet_day               <dttm> 2019-12-03, 2019-12-04, 2019-12-04, 2...

Below is a list of data types contained in this data set, and example variables that are that type.

<chr>character variables, or strings: screen_name and source
<dttm> date-time: created_at_UTC
<dbl>, <int> numeric (numbers): favorite_count, retweet_count

The hashtags in this data set are in their own variable, each separated by a space. Notice not all tweets contain hashtags.

head(cv$hashtags)

## [1] ""                              ""                             
## [3] "HealthforAll"                  "EndHIVEpidemic HIV VitalSigns"
## [5] ""                              "flu"

Geotagging

The location of the tweet is stored in the variables lat and long.

How much location data do we even have?

table(is.na(cv$lng))

## 
##  FALSE   TRUE 
##    375 410189

Almost none.

library(maps)
world <- map_data("world") 

library(ggrepel)
ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill="grey", alpha=0.2) +
  coord_map() + theme_void() + 
  geom_point(data=cv, aes(x=lng, y=lat))

And the data that we do have, doesn’t seem to be accurate.

Media Variables

The media variables let you look at any videos, pictures or gifs that are attached to tweets. It’s important to note that not all of the tweets have media information.

Verified accounts

Accounts of public interest, the authenticity of which is denoted by a blue checkmark or “badge”. Typically, these accounts are maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas.

table(cv$verified)

## 
##  FALSE   TRUE 
## 397624  12939

Last run on 2020-04-30 10:23:33

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggrepel_0.8.1 maps_3.3.0    ggplot2_3.2.1 dplyr_0.8.4  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3       pillar_1.4.3     compiler_3.6.1   tools_3.6.1     
##  [5] digest_0.6.21    rtweet_0.7.0     jsonlite_1.6.1   evaluate_0.14   
##  [9] lifecycle_0.2.0  tibble_3.0.0     gtable_0.3.0     pkgconfig_2.0.3 
## [13] rlang_0.4.5      cli_2.0.1        mapproj_1.2.7    yaml_2.2.1      
## [17] xfun_0.10        withr_2.1.2      stringr_1.4.0    httr_1.4.1      
## [21] knitr_1.25       vctrs_0.2.4      grid_3.6.1       tidyselect_0.2.5
## [25] glue_1.3.1       R6_2.4.0         fansi_0.4.0      rmarkdown_1.18  
## [29] purrr_0.3.3      magrittr_1.5     scales_1.0.0     ellipsis_0.3.0  
## [33] htmltools_0.4.0  assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
## [37] utf8_1.1.4       stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0   
## [41] crayon_1.3.4