In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium
This article is originally published at https://rud.is/b
The development version of splashr
now support authenticated connections to Splash API instances. Just specify user
and pass
on the initial splashr::splash()
call to use your scraping setup a bit more safely. For those not familiar with splashr
and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html()
, splashr
renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.
If you have a large number of sites to scrape, working with splashr
and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker
and docker-compose
to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker
and docker-compose
handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050
) to access the Splash API and you get a bonus port of 8036
to watch in your browser (the HAProxy stats page).
This works well when combined with furrr
which is an R package that makes parallel operations very tidy.
One way to use purrr
, splashr
and Aquarium might look like this:
library(splashr)
library(HARtools)
library(urltools)
library(furrr)
library(tidyverse)
list_of_urls_with_unique_urls < - c("http://...", "http://...", ...)
make_a_splash <- function(org_url) {
splash(
host = "ip/name of system you started aquarium on",
user = "your splash api username",
pass = "your splash api password"
) %>%
splash_response_body(TRUE) %>% # we want to get all the content
splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from
splash_go(org_url) %>%
splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated
splash_har()
}
safe_splash < - safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe
plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection
future_map(sites, ~{
org <- safe_splash(.x) # go get it!
if (is.null(org$result)) {
sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages
} else {
HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE
har = org$result,
file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate()
)
sprintf("Successfully retrieved %s", .x)
}
}) -> results
(Those with a keen eye will grok why splashr
supports Splash API basic authentication, now)
The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr()
likes to check for proper objects) to check for errors with something like:
flatten_chr(results) %>%
keep(str_detect, "Error")
## [1] "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)"
## [2] "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)"
## [3] "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)"
## [4] "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)"
## [5] "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"
Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…)
. It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.
If you were stuck in the splashr
/Splash slow-lane before, give this a try to help save you some time and frustration.
Thanks for visiting r-craft.org
This article is originally published at https://rud.is/b
Please visit source website for post related comments.