This is a record of how the COVID19 tweets were selected. See this tutorial for motivation and explanation of the sampling scheme.

These code chunks are not executed. This file must be ran manually.

knitr::opts_chunk$set(echo = TRUE, eval=FALSE, warning=FALSE, message=FALSE)
library(lubridate);library(foreach); library(httr)
wd <- "G:/My Drive/Teaching/POLS 624 S20/Twitter project docs/data/covid-raw/" # will have to change this

Define dates

Define the time frame from which the data is available from the repository. Split into hour blocks, and separate out those hours that are in the past 30 days.

start <- ymd_hm("2020-01-28 0:00")
end   <- ymd_hm("2020-03-12 17:00")
hour_list <- seq(start, end, by="hour")

past30 <- today()-days(30)
past30_list <- hour_list[hour_list > past30]

Get prior data

Read in lists of ID’s that have already been hydrated.

id.files <- list.files(path=paste0(wd,"have_been_hydrated"), pattern='.txt')

all.ids <- foreach(j =1:length(id.files), .combine=rbind) %do% {
  read.table(file.path(paste0(wd,"have_been_hydrated"), id.files[j]))
}

Create sampling frame of times.

Write function to go to the github repo and sample s ID’s from num.time.blocks hour-long time frames.

stem <- "https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/" 

create.time.block.url <- function(seed=12345, num.time.blocks=1){
  set.seed(seed)
  x <- sample(past30_list, num.time.blocks)
  mo <- format(x, "%Y-%m") 
  hr <- format(x, "%Y-%m-%d-%H") 
  get.url <- paste0(stem, mo, "/coronavirus-tweet-id-", hr, ".txt")  
  return(get.url)
}

Sample ID’s

This is where you change 1) the seed, 2) the number of time blocks and 3) the number of ID’s sampled per block.

sampled.hours <- create.time.block.url(seed=2020, num.time.blocks = 200)

get.ids <- foreach(i=1:length(sampled.hours), .combine=c) %do% {
  bad.url <- identical(status_code(HEAD(sampled.hours[i])),404L)
  print(paste0("File ", i, ". Bad url ", bad.url))
  if(!bad.url){
    sample(read.table(sampled.hours[i])$V1, size=100, replace=FALSE)  # size of sample per block
  }
}

Exclude duplicate IDs

xclude.these <- which(get.ids %in% all.ids$V1)
get.nodup.ids <- get.ids[-xclude.these]

Write out to a folder to be hydrated.

write.table(get.nodup.ids, paste0(wd,"to_hydrate/rad_0328.txt"), row.names = FALSE, col.names = FALSE)
  • If too big of a list, split into separate more managable files to clean.
separate.lists <- matrix(c(get.nodup.ids, NA, NA), nrow=2000, ncol=10, byrow=TRUE)
for(j in 1:NCOL(separate.lists)){
  write.table(separate.lists[,j], paste0(wd,"to_hydrate/list_", j, ".txt"), row.names = FALSE, col.names = FALSE)  
}

Important pre-hydrating note

What sucks is that for some reason it’s writing some ID’s in scientific notation (1.2E+18), which can’t be hydrated. If I format my.ids then it turns it into a character. which also can’t be read. If I send it to csv, it truncates. So after writing to txt. must manually go through and remove +18’s 😭

Hydrate Tweets

Doesn’t seem to be pinging my 30 day request list….weird, but I’m not complaining. Again again! Regenerate ID list. Use above code chunk.


Samples with seeds obtained so far.