knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
library(lubridate)

Goal: I want all the Tweets!

Problem: API rate limits suck for free accounts. I can only get 100 tweets per call, and 50 calls.
Solution: Let someone else build the database of tweet IDs for the topic of your choice and then hydrate the content of the tweets
Problem: There are still TENS OF THOUSANDS of tweets per hour on this topic, and ‘hydrating’ the tweets still counts against your API limit.
Solution: Randomly sample units of time (hours? minutes?), get ID’s for that unit of time, then randomly sample ID’s within that time frame. Then hydrate those ID’s only.

And that, dear readers, is what this post is all about.

Data

The data comes from this repository by Emily Chen and Dr. Ferrara at USC.

Time frame

Each file is organized by the hour starting on 01-28-2020.

Use the lubridate package to create time intervals (days) for the time period of interest. As of 3/27/2020 the tweets collected only go through 03-12-17 17:00 UTC.

start <- ymd_hm("2020-01-28 0:00")
end   <- ymd_hm("2020-03-12 17:00")

hour_list <- seq(start, end, by="hour")
head(hour_list)

## [1] "2020-01-28 00:00:00 UTC" "2020-01-28 01:00:00 UTC"
## [3] "2020-01-28 02:00:00 UTC" "2020-01-28 03:00:00 UTC"
## [5] "2020-01-28 04:00:00 UTC" "2020-01-28 05:00:00 UTC"

Random sampling

Each sandbox account gets 50 requests per month, each request can gather 100 tweets. So in production, we will want to choose 50 time periods. For this demo, we are going to choose one. We also need to set a seed to ensure this is reproducible.

set.seed(10242011) # The date I rescued Riley :) 
(x <- sample(hour_list, 1))

## [1] "2020-02-05 11:00:00 UTC"

Download files

The data files in github are organized by year-month-day-hour. The URL’s look like https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-02/coronavirus-tweet-id-2020-02-01-00.txt. So we need to create a URL like this for our sample.

stem <- "https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/" # All file will have this part
mo <- format(x, "%Y-%m") # Extract year and month out of sampled date 2020-02
hr <- format(x, "%Y-%m-%d-%H") # Format date to match URl structure. 
get.url <- paste0(stem, mo, "/coronavirus-tweet-id-", hr, ".txt")
get.url

## [1] "https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-02/coronavirus-tweet-id-2020-02-05-11.txt"

Great! Right format. Now let’s see if we can dynamically download it.

twt.ids <- read.table(get.url)
options(digits=20) #this is so it will print out all digits. 
head(twt.ids$V1)

## [1] 1225011182517379072 1225011182399918080 1225011182643355648
## [4] 1225011182639222784 1225011182710480896 1225011182819520512

We still have the rate limiting problem, there are 27465 tweets in this randomly chosen hour.

dim(twt.ids)

## [1] 27465     1

Our max is 100 tweets per call. Let’s randomly select 100 ID’s from this list. Every time we use random sampling, we need to set a seed for reproducibility.

set.seed(10242011)
samp.twts <- sample(twt.ids$V1,100, replace=FALSE)

Save list to hydrate later

This list needs to be saved as a .txt file so it can be read into a hydrator. Actually hydrating the tweet is part of another tutorial. I will save this text file in the data folder with the tweet ID hour that was chosen.