Webscraping Starter Kit

Dan | Oct 10, 2022 min read

I often want to scrape a website, but I don’t know what packages which be required to scrape a given website. Some websites require a headless browser for some part of the webscraping process, often to acquire a cookie, token, or header that is rendered by JavaScript or network requests that are difficult to mock using simple requests. I hate when sites require a headless browser because it means introducing additional large dependencies, like a chromedriver. This post intends to make it easier to get set up for this type of development, regardless of whether a website requires just requests or a headless browser.

A couple basic files will get you setup with a development environment for scraping websites that can handle basic websites or more complex websites that require headless browsers. While I always prefer to use requests, this code takes the burden off of finding new boilerplate webdriver code every time.

install: python3 -m pip install pipenv &&
python3 -m pipenv install

# Pipfile

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
requestium = "*"
requests = "*"
webdriver-manager = "*"
bs4 = "*"
loguru = "*"

[dev-packages]

[requires]
python_version = "3.9"

With the dependencies installed, you can get a functional webscraper set up with just a few lines of code. The requestium Session class is a drop-in replacement for the requests Session class, with the added functionality of interacting with the webdriver.

# for selenium 3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from requestium import Session, Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(ChromeDriverManager().install())
s = Session(driver=driver)

# demonstrate hybrid navigation: using chromedriver to complete CAPTCHA and then pass cookies to requests Session to 
# execute the rest of the code as requests
base_url = 'https://eaccess.dccourts.gov/eaccess/search.page.3.3'

s.driver.get(base_url)


element = WebDriverWait(s.driver, 25).until(
        EC.presence_of_element_located((By.ID, "caseDscr"))
    )

s.transfer_driver_cookies_to_session()

This Session class provides the functionality of a requests session and a webdriver which operates as a subclass. As an example, we can attempt to scrape the DC Superior Court’s website, which, as of October 2022, included a CAPTCHA that must be completed before you can access the court records. This access is controlled by cookies held by the browser session. In the below example, we navigate to the website using the webdriver, and then wait. We are waiting until the user has completed the CAPTCHA. Once the CAPTCHA is complete, a specific element will be visible (here identified by id “caseDscr”) and the script will continue execution. The requestium Session will then transfer cookies from the webdriver to the requests Session and you can execute requests against the website without further webdriver activity. In this specific use case, you could even close the Chrome browser because the cookies will continue to function properly regardless.