knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
library(lubridate)
And that, dear readers, is what this post is all about.
The data comes from this repository by Emily Chen and Dr. Ferrara at USC.
Each file is organized by the hour starting on 01-28-2020.
Use the lubridate
package to create time intervals (days) for the time period of interest. As of 3/27/2020 the tweets collected only go through 03-12-17 17:00 UTC.
start <- ymd_hm("2020-01-28 0:00")
end <- ymd_hm("2020-03-12 17:00")
hour_list <- seq(start, end, by="hour")
head(hour_list)
## [1] "2020-01-28 00:00:00 UTC" "2020-01-28 01:00:00 UTC"
## [3] "2020-01-28 02:00:00 UTC" "2020-01-28 03:00:00 UTC"
## [5] "2020-01-28 04:00:00 UTC" "2020-01-28 05:00:00 UTC"
Each sandbox account gets 50 requests per month, each request can gather 100 tweets. So in production, we will want to choose 50 time periods. For this demo, we are going to choose one. We also need to set a seed to ensure this is reproducible.
set.seed(10242011) # The date I rescued Riley :)
(x <- sample(hour_list, 1))
## [1] "2020-02-05 11:00:00 UTC"
The data files in github are organized by year-month-day-hour. The URL’s look like https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-02/coronavirus-tweet-id-2020-02-01-00.txt
. So we need to create a URL like this for our sample.
stem <- "https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/" # All file will have this part
mo <- format(x, "%Y-%m") # Extract year and month out of sampled date 2020-02
hr <- format(x, "%Y-%m-%d-%H") # Format date to match URl structure.
get.url <- paste0(stem, mo, "/coronavirus-tweet-id-", hr, ".txt")
get.url
## [1] "https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-02/coronavirus-tweet-id-2020-02-05-11.txt"
Great! Right format. Now let’s see if we can dynamically download it.
twt.ids <- read.table(get.url)
options(digits=20) #this is so it will print out all digits.
head(twt.ids$V1)
## [1] 1225011182517379072 1225011182399918080 1225011182643355648
## [4] 1225011182639222784 1225011182710480896 1225011182819520512
We still have the rate limiting problem, there are 27465 tweets in this randomly chosen hour.
dim(twt.ids)
## [1] 27465 1
Our max is 100 tweets per call. Let’s randomly select 100 ID’s from this list. Every time we use random sampling, we need to set a seed for reproducibility.
set.seed(10242011)
samp.twts <- sample(twt.ids$V1,100, replace=FALSE)
This list needs to be saved as a .txt
file so it can be read into a hydrator. Actually hydrating the tweet is part of another tutorial. I will save this text file in the data
folder with the tweet ID hour that was chosen.
filename <- paste0("../data/sample-tweets-from", hr, ".txt")
write.table(samp.twts, filename, row.names = FALSE, col.names = FALSE)