This is a record of how the COVID19 tweets were selected. See this tutorial for motivation and explanation of the sampling scheme.

These code chunks are not executed. This file must be ran manually.

knitr::opts_chunk$set(echo = TRUE, eval=FALSE, warning=FALSE, message=FALSE)
library(lubridate);library(foreach); library(httr)
wd <- "G:/My Drive/Teaching/POLS 624 S20/Twitter project docs/data/covid-raw/" # will have to change this

Define dates

Define the time frame from which the data is available from the repository. Split into hour blocks, and separate out those hours that are in the past 30 days.

start <- ymd_hm("2020-01-28 0:00")
end   <- ymd_hm("2020-03-12 17:00")
hour_list <- seq(start, end, by="hour")

past30 <- today()-days(30)
past30_list <- hour_list[hour_list > past30]

Get prior data

Read in lists of ID’s that have already been hydrated.

id.files <- list.files(path=paste0(wd,"have_been_hydrated"), pattern='.txt')

all.ids <- foreach(j =1:length(id.files), .combine=rbind) %do% {
  read.table(file.path(paste0(wd,"have_been_hydrated"), id.files[j]))
}

Create sampling frame of times.

Write function to go to the github repo and sample s ID’s from num.time.blocks hour-long time frames.

stem <- "https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/" 

create.time.block.url <- function(seed=12345, num.time.blocks=1){
  set.seed(seed)
  x <- sample(past30_list, num.time.blocks)
  mo <- format(x, "%Y-%m") 
  hr <- format(x, "%Y-%m-%d-%H") 
  get.url <- paste0(stem, mo, "/coronavirus-tweet-id-", hr, ".txt")  
  return(get.url)
}

Sample ID’s

pull a sample of ID’s from repo
- this includes a check to see if the file exists before trying to download
remove id’s that were checked before
write out to text file.

This is where you change 1) the seed, 2) the number of time blocks and 3) the number of ID’s sampled per block.

sampled.hours <- create.time.block.url(seed=2020, num.time.blocks = 200)

get.ids <- foreach(i=1:length(sampled.hours), .combine=c) %do% {
  bad.url <- identical(status_code(HEAD(sampled.hours[i])),404L)
  print(paste0("File ", i, ". Bad url ", bad.url))
  if(!bad.url){
    sample(read.table(sampled.hours[i])$V1, size=100, replace=FALSE)  # size of sample per block
  }
}

Exclude duplicate IDs

xclude.these <- which(get.ids %in% all.ids$V1)
get.nodup.ids <- get.ids[-xclude.these]

Write out to a folder to be hydrated.

write.table(get.nodup.ids, paste0(wd,"to_hydrate/rad_0328.txt"), row.names = FALSE, col.names = FALSE)

If too big of a list, split into separate more managable files to clean.

separate.lists <- matrix(c(get.nodup.ids, NA, NA), nrow=2000, ncol=10, byrow=TRUE)
for(j in 1:NCOL(separate.lists)){
  write.table(separate.lists[,j], paste0(wd,"to_hydrate/list_", j, ".txt"), row.names = FALSE, col.names = FALSE)  
}

Important pre-hydrating note

What sucks is that for some reason it’s writing some ID’s in scientific notation (1.2E+18), which can’t be hydrated. If I format my.ids then it turns it into a character. which also can’t be read. If I send it to csv, it truncates. So after writing to txt. must manually go through and remove +18’s 😭

Hydrate Tweets

Doesn’t seem to be pinging my 30 day request list….weird, but I’m not complaining. Again again! Regenerate ID list. Use above code chunk.

Samples with seeds obtained so far.

seed: 10242011 file: rad_0327.txt note: this was actually on a sample that was BEFORE 30 days. I had < instead of > in past30_list
seed: 1067 file: rad_0327_2.txt
seed: 8675309 file: rad_0327_3.txt
seed: 2020 file: list_1.txt through list_10.txt

COVID19 Data collection