Covid19 data was collected using a combination of collecting data using the rtweet package, and by hydrating previously identified tweets relating to the pandemic. See this tweet sampler file for information on how ID’s were randomly selected. The hydratd data comes from this repository by Emily Chen and Dr. Ferrara at USC. This database starting on 01-28-2020.

Search queries used (when not using the hydrator)

  • Recent tweets (3/30/20 - 4/7/20) “fema AND (covid19 OR coronavirus OR covid OR covid19 OR covd OR pandemic OR outbreak OR covid-19)”
  • Full timelines for @WHO and @CDCgov since December.

Load libraries and set options.

Set the working directory, then read in the current data and examine the structure of a tweet about COVID19.

Overall info

There are 410564 tweets in this data set across the following languages:

Dates and Times

Here is the current timeframe of collected tweets, and the amount of tweets collected. Some groups specifically collected tweets from @WHO and @CDCgov back until December.

We have approximately 174 tweets per hour, and 2975 per day.

Counts of tweets per day since the bulk of the data collection began on Jan 28

Twitter stores all dates and times at the coordinated universal time UTC (or GMT).

All times are stored as 24 hour (military) time. Midnight is 00:00, Noon is 12:00, your class runs from 19:00 to 22:00.

  • created_at_UTC This is at UTC. 15:00 (3pm) UTC is 07:00 (7am) Pacific time
  • year/month/day/hour/minute - separated components of the date and time
  • tweet_min Tweets are recorded to the second. This variable rounds the tweet to the nearest minute.

Full variable list

We can use the glimpse() function from dplyr to see the variable names, data types, and examples of data values.

Below is a list of data types contained in this data set, and example variables that are that type.

  • <chr>character variables, or strings: screen_name and source
  • <dttm> date-time: created_at_UTC
  • <dbl>, <int> numeric (numbers): favorite_count, retweet_count

The hashtags in this data set are in their own variable, each separated by a space. Notice not all tweets contain hashtags.

The location of the tweet is stored in the variables lat and long.

How much location data do we even have?

Almost none.

And the data that we do have, doesn’t seem to be accurate.

Media Variables

The media variables let you look at any videos, pictures or gifs that are attached to tweets. It’s important to note that not all of the tweets have media information.

Verified accounts

Accounts of public interest, the authenticity of which is denoted by a blue checkmark or “badge”. Typically, these accounts are maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas.

Last run on 2020-04-30 10:23:33

