Download csv file from webpage after submitting form from dropdown using rvest package in R

I am working on a webscraping project to download various csv files from this webpage: https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link

I would like to be able to programmatically choose the various reported quarters on the drop down list, hit submit (note that the URL for the page doesnt change for each different quarter) and then "Download CSV" for each of the quarters.

As a disclaimer, I am a novice to rvest and below is my attempt at the solution:

  1. I first checked this site and found a relevant post Using r to navigate and scrape a webpage with drop down html forms

  2. It looks like they use the following code to get a form for what the inputs need to be to refresh an HTML table:

    pgsession <- html_session(url)
    pgform <-html_form(pgsession)[[3]]
    filled_form <-set_values(pgform,
            "team" = "ALL",
            "week" = "1",
            "pos"  = "ALL",
            "year" = "2015"      
     )
    
     submit_form(session=pgsession,form=filled_form, POST=url)
    
  3. I tried doing that for the site above and I get the following instead

     > html_form(html_session("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link"))
     [[1]]
     <form> '<unnamed>' (GET )
      <input text> '': 
      <select> '' [1/7]
    
     [[2]]
     <form> 'frm_registration' (POST /filer/registration)
      <input hidden> 'permalink': blue-harbour-group-lp
      <input hidden> 'registration_type': register
      <input text> 'user_email': 
    
     [[3]]
     <form> 'frm-report-error' (POST /filer/report_error)
      <input hidden> 'permalink': blue-harbour-group-lp
      <input text> 'user_name': 
      <input text> 'user_email': 
      <textarea> 'comments' [0 char]
      <textarea> 'g-recaptcha-response' [0 char]
    
  4. I dont quite see the same set up and the only form it seems with a drop down option is [1] with [1/7] options, but I dont know what that is referring to.

  5. Comparing the source code for both sites, it seems like I have a "form-control" class that I should be extracting? How do I do that?

  6. Finally, after refreshing the table with a different quarter, how do I download the CSV? Is it possible to read the csv from the website directly without downloading the file?

Thanks!

Asked By: qwer
||

Answer #1:

+1 for using Developer Tools. That tool/skill will serve you well.

You should seriously consider using the API. But you can use httr and rvest together for this (and I verified it's not against the site rules):

library(rvest)
library(httr)
library(tidyverse)

We'll get the page first since we need to scrape the popup menu data:

pg <- read_html("https://whalewisdom.com/filer/blue-harbour-group-lp#/tabholdings_tab_link")

qtr_nodes <- html_nodes(pg, "select[id='quarter_one'] option")

data_frame(
  qtr = html_text(qtr_nodes),
  value = html_attr(qtr_nodes, "value")
) %>% 
  filter(!grepl("ubscri", qtr)) -> qtrs

qtrs
## # A tibble: 10 x 2
##                           qtr value
##                         <chr> <chr>
##  1 Current Combined 13F/13D/G    -1
##  2       Q3 2017 13F Filings     67
##  3       Q2 2017 13F Filings     66
##  4       Q1 2017 13F Filings     65
##  5       Q4 2016 13F Filings     64
##  6       Q3 2016 13F Filings     63
##  7       Q2 2016 13F Filings     62
##  8       Q1 2016 13F Filings     61
##  9       Q4 2015 13F Filings     60
## 10       Q3 2015 13F Filings     59

^^ is a translation table from pretty name to the popup value. The value is necessary for submitting the XHR request that happens behind the scenes.

Let's make a function to simulate that XHR request:

get_qtr <- function(qtr) {

  GET(
    url = "https://whalewisdom.com/filer/holdings", 
    httr::add_headers(
      Host = "whalewisdom.com", 
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:58.0) Gecko/20100101 Firefox/58.0", 
      Accept = "application/json, text/javascript, */*; q=0.01", 
      `Accept-Language` = "en-US,en;q=0.5", 
      Referer = "https://whalewisdom.com/filer/blue-harbour-group-lp", 
      `X-Requested-With` = "XMLHttpRequest", 
      Connection = "keep-alive"
    ),
    query = list(
      q1 = qtr, 
      id = "384", type_filter = "1,2,3,4", symbol = "", 
      change_filter = "1,2,3,4,5", minimum_ranking = "", minimum_shares = "", 
      is_etf = "0", sc = "true", `_search` = "false", rows = "25", 
      page = "1", sidx = "current_ranking", sord = "asc"
    )
  ) -> res

  stop_for_status(res)

  res <- content(res)

  map_df(res$rows, ~map(.x, ~ifelse(is.null(.x), NA, .x)))

}

We're just passing in value to the qtr parameter, but you could add params for the other bits, too.

Now, use the translation table above to get a randomly chosen set of data:

qtr_65 <- get_qtr(65)

glimpse(qtr_65)
## Observations: 19
## Variables: 24
## $ id                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ name                          <chr> "Investors Bancorp Inc", "Xilinx, Inc", "BWX TECHNOLOGIES INC", "AGCO Corp", "A...
## $ symbol                        <chr> "ISBC", "XLNX", "BWXT", "AGCO", "AVT", "WBMD", "AKAM", "RDC", "FFIV", "ADNT", "...
## $ permalink                     <chr> "isbc", "xlnx", "bwxt", "agco", "avt", "wbmd", "akam", "rdc", "ffiv", "adnt", "...
## $ security_type                 <chr> "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "SH", "...
## $ stock_id                      <int> 5284, 930, 7803, 600, 375, 838, 3527, 3658, 26, 198034, 5045, 72934, 812, 4116,...
## $ source_date                   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ source_type                   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""
## $ sector                        <chr> "FINANCE", "INFORMATION TECHNOLOGY", "INDUSTRIALS", "INDUSTRIALS", "INFORMATION...
## $ industry                      <chr> "TRUSTS & THRIFTS", "SEMICONDUCTORS", "ELECTRICAL EQUIPMENT", "MACHINERY", "ELE...
## $ current_shares                <int> 29582428, 6058693, 5287927, 3813700, 4363874, 3361336, 2855493, 10542812, 93835...
## $ previous_shares               <int> 29582428, 7514437, 10561086, 6835700, 5415074, 1795914, 2474193, 10542812, 8599...
## $ shares_change                 <int> 0, -1455744, -5273159, -3022000, -1051200, 1565422, 381300, 0, 78373, 1675570, ...
## $ position_change_type          <chr> NA, "reduction", "reduction", "reduction", "reduction", "addition", "addition",...
## $ percent_shares_change         <chr> "0.0", "-19.3726", "-49.9301", "-44.2091", "-19.4125", "87.1658", "15.4111", "0...
## $ current_ranking               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 999999, 999999, 999999
## $ previous_ranking              <int> 3, 1, 2, 4, 5, 11, 7, 6, 8, 999999, 10, 9, 999999, 12, 14, 13, 15, 16, 17
## $ current_percent_of_portfolio  <dbl> 15.3264, 12.6366, 9.0686, 8.2689, 7.1946, 6.3798, 6.1419, 5.9180, 4.8200, 4.387...
## $ previous_percent_of_portfolio <dbl> 13.7987, 15.1687, 14.0194, 13.2249, 8.6205, 2.9767, 5.5165, 6.6592, 4.1615, NA,...
## $ current_mv                    <chr> "425395000.0", "350738000.0", "251705000.0", "229508000.0", "199691000.0", "177...
## $ previous_mv                   <chr> "412675000.0", "453647000.0", "419275000.0", "395514000.0", "257812000.0", "890...
## $ percent_ownership             <chr> "26.3298285", "2.4338473", "5.3287767", "4.5501506", "3.3856140", "8.6721677", ...
## $ quarter_first_owned           <chr> "Q1 2014", "Q1 2015", "Q4 2013", "Q2 2014", "Q2 2015", "Q4 2016", "Q3 2016", "Q...
## $ quarter_id_owned              <int> 53, 57, 52, 54, 58, 64, 63, 53, 61, 65, 47, 63, 65, 64, 64, 64, 64, 60, 64

I have no idea if ^^ is what's in the CSV since I'm not registering for an account, but you can verify and hopefully modify.

Answered By: hrbrmstr
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .



# More Articles