Guide

Web Scraping with Ruby: The Complete Guide

|14 min read

Ruby has a mature, elegant scraping ecosystem. Nokogiri handles HTML parsing with CSS selectors and XPath, HTTParty makes HTTP requests dead simple, and Mechanize simulates browser sessions for authenticated scraping. This guide covers everything from your first Nokogiri scraper to production pipelines with SnapRender.

What you will learn

1.Nokogiri + HTTParty basics
2.CSS selectors and XPath
3.Form submission with Mechanize
4.Rate limiting and retries
5.Data storage (CSV, JSON, SQLite)
6.Handling JS-rendered pages
7.Anti-bot bypass with SnapRender
8.Structured data extraction

1. Your first scraper: Nokogiri + HTTParty

The standard Ruby scraping stack uses two gems: nokogiri for HTML parsing and httparty for HTTP requests. Install them:

terminal
#E8A0BF">gem #E8A0BF">install nokogiri httparty

Here is a complete scraper that extracts book titles and prices from a practice website:

scraper.rb
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'nokogiri'

# Fetch the page
response = HTTParty.get(
  #A8D4A0">"https://books.toscrape.com/",
  headers: { #A8D4A0">"User-Agent" => #A8D4A0">"Mozilla/5.0 (compatible; MyBot/1.0)" }
)

# Parse the HTML
doc = Nokogiri::HTML(response.body)

# Extract all book titles and prices
books = doc.css(#A8D4A0">"article.product_pod").#E8A0BF">map #E8A0BF">do |article|
  {
    title: article.css(#A8D4A0">"h3 a").attr(#A8D4A0">"title").value,
    price: article.css(#A8D4A0">".price_color").text.strip
  }
#E8A0BF">end

books.first(5).#E8A0BF">each #E8A0BF">do |book|
  #E8A0BF">puts #A8D4A0">"#{book[:title]}: #{book[:price]}"
#E8A0BF">end

Nokogiri uses the same CSS selectors you already know from front-end development. The css() method returns all matches, while at_css() returns the first match (like querySelector in JavaScript).

2. Setting proper headers

Ruby's default user-agent is a dead giveaway. Servers see the HTTParty identifier and block you immediately. Always set realistic browser headers:

headers.rb
#E8A0BF">require #A8D4A0">'httparty'

headers = {
  #A8D4A0">"User-Agent" => #A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
    #A8D4A0">"AppleWebKit/537.36 (KHTML, like Gecko) " \
    #A8D4A0">"Chrome/124.0.0.0 Safari/537.36",
  #A8D4A0">"Accept" => #A8D4A0">"text/html,application/xhtml+xml",
  #A8D4A0">"Accept-Language" => #A8D4A0">"en-US,en;q=0.9",
  #A8D4A0">"Referer" => #A8D4A0">"https://www.google.com/"
}

response = HTTParty.get(
  #A8D4A0">"https://example.com/products",
  headers: headers,
  timeout: 10
)
#E8A0BF">puts response.code

Pro tip

Keep a YAML file of 20+ real browser user-agents and rotate them randomly. Update quarterly — outdated Chrome versions are easily detected by anti-bot systems.

3. Authenticated scraping with Mechanize

When you need to log in, submit forms, or maintain session cookies, the mechanize gem simulates a full browser session:

terminal
#E8A0BF">gem #E8A0BF">install mechanize
mechanize_scraper.rb
#E8A0BF">require #A8D4A0">'mechanize'

agent = Mechanize.#E8A0BF">new
agent.user_agent_alias = #A8D4A0">"Windows Chrome"

# Login to a site
page = agent.get(#A8D4A0">"https://example.com/login")
form = page.form_with(id: #A8D4A0">"login-form")
form.field_with(name: #A8D4A0">"email").value = #A8D4A0">"user@example.com"
form.field_with(name: #A8D4A0">"password").value = #A8D4A0">"password123"
result = agent.submit(form)

# Now scrape authenticated pages
dashboard = agent.get(#A8D4A0">"https://example.com/dashboard")
data = dashboard.search(#A8D4A0">".data-row").#E8A0BF">map #E8A0BF">do |row|
  {
    name: row.at_css(#A8D4A0">".name").text.strip,
    value: row.at_css(#A8D4A0">".value").text.strip
  }
#E8A0BF">end

#E8A0BF">puts data.inspect

4. Rate limiting and retries

Scraping without rate limiting gets your IP banned fast. Always add delays between requests and handle 429 responses:

rate_limit.rb
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'nokogiri'

urls = (1..50).#E8A0BF">map { |i| #A8D4A0">"https://example.com/page/#{i}" }
results = []

urls.each_with_index #E8A0BF">do |url, idx|
  #E8A0BF">begin
    response = HTTParty.get(url, headers: headers, timeout: 10)

    #E8A0BF">if response.code == 429
      #E8A0BF">puts #A8D4A0">"Rate limited on #{url}, backing off..."
      sleep(30)
      response = HTTParty.get(url, headers: headers, timeout: 10)
    #E8A0BF">end

    #E8A0BF">if response.code == 200
      doc = Nokogiri::HTML(response.body)
      # ... extract data
      results << data
    #E8A0BF">end

  #E8A0BF">rescue StandardError => e
    #E8A0BF">puts #A8D4A0">"Error on #{url}: #{e.message}"
  #E8A0BF">end

  # Random delay between requests
  sleep(rand(1.0..3.0))
  #E8A0BF">puts #A8D4A0">"Progress: #{idx + 1}/#{urls.length}"
#E8A0BF">end

#E8A0BF">puts #A8D4A0">"Scraped #{results.length} pages"

5. Storing scraped data

Ruby has built-in CSV and JSON support, plus excellent SQLite bindings:

storage.rb
#E8A0BF">require #A8D4A0">'csv'
#E8A0BF">require #A8D4A0">'json'

# Save to CSV
CSV.open(#A8D4A0">"products.csv", #A8D4A0">"w") #E8A0BF">do |csv|
  csv << [#A8D4A0">"name", #A8D4A0">"price", #A8D4A0">"url"]
  products.#E8A0BF">each #E8A0BF">do |product|
    csv << [product[:name], product[:price], product[:url]]
  #E8A0BF">end
#E8A0BF">end

# Save to JSON
File.write(#A8D4A0">"products.json", JSON.pretty_generate(products))

# Save to SQLite
#E8A0BF">require #A8D4A0">'sqlite3'
db = SQLite3::Database.#E8A0BF">new(#A8D4A0">"products.db")
db.execute <<-SQL
  CREATE TABLE IF NOT EXISTS products (
    name TEXT, price TEXT, url TEXT
  )
SQL
products.#E8A0BF">each #E8A0BF">do |product|
  db.execute(
    #A8D4A0">"INSERT INTO products VALUES (?, ?, ?)",
    [product[:name], product[:price], product[:url]]
  )
#E8A0BF">end

6. Handling JavaScript pages with SnapRender

Nokogiri and HTTParty cannot execute JavaScript. React, Vue, and Angular apps return empty shells. Instead of managing headless browsers locally, use SnapRender to get fully-rendered content via a simple API call:

Render as markdown

Get any JavaScript-rendered page as clean markdown. Perfect for LLM pipelines or content extraction.

render.rb
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'json'

# Render any page as clean markdown (handles JS)
response = HTTParty.post(
  #A8D4A0">"https://api.snaprender.dev/v1/render",
  headers: {
    #A8D4A0">"x-api-key" => #A8D4A0">"sr_live_YOUR_KEY",
    #A8D4A0">"Content-Type" => #A8D4A0">"application/json"
  },
  body: {
    url: #A8D4A0">"https://example.com/spa-page",
    format: #A8D4A0">"markdown"
  }.to_json
)

data = JSON.parse(response.body)
#E8A0BF">puts data[#A8D4A0">"data"][#A8D4A0">"markdown"]

Extract structured data

Use CSS selectors to pull specific fields. Returns clean JSON — no parsing needed.

extract.rb
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'json'

# Extract structured data with CSS selectors
response = HTTParty.post(
  #A8D4A0">"https://api.snaprender.dev/v1/extract",
  headers: {
    #A8D4A0">"x-api-key" => #A8D4A0">"sr_live_YOUR_KEY",
    #A8D4A0">"Content-Type" => #A8D4A0">"application/json"
  },
  body: {
    url: #A8D4A0">"https://example.com/products/widget-pro",
    selectors: {
      name: #A8D4A0">"h1.product-title",
      price: #A8D4A0">".price-current",
      rating: #A8D4A0">".star-rating",
      description: #A8D4A0">".product-description #E8A0BF">p",
      in_stock: #A8D4A0">".availability-status"
    }
  }.to_json
)

#E8A0BF">puts JSON.pretty_generate(JSON.parse(response.body))

Bypass anti-bot protection

For Cloudflare, DataDome, or similar protections, add the use_flaresolverr flag:

bypass.rb
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'json'

# Bypass Cloudflare / anti-bot protection
response = HTTParty.post(
  #A8D4A0">"https://api.snaprender.dev/v1/render",
  headers: {
    #A8D4A0">"x-api-key" => #A8D4A0">"sr_live_YOUR_KEY",
    #A8D4A0">"Content-Type" => #A8D4A0">"application/json"
  },
  body: {
    url: #A8D4A0">"https://protected-site.com/data",
    format: #A8D4A0">"markdown",
    use_flaresolverr: #E8A0BF">true
  }.to_json
)

data = JSON.parse(response.body)
#E8A0BF">puts data[#A8D4A0">"data"][#A8D4A0">"markdown"]

Comparison: when to use what

ApproachBest forLimitation
Nokogiri + HTTPartyStatic HTML pagesNo JS rendering
MechanizeForms, login sessionsNo JS rendering
Ferrum (headless Chrome)JS pages, local controlHigh RAM, maintenance
SnapRender APIJS + anti-bot + scaleAPI cost at high volume

Skip the browser infrastructure

SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. Just send a URL from your Ruby script and get results back as JSON.

Get Your API Key — Free

Frequently asked questions

Nokogiri is the gold standard for HTML/XML parsing in Ruby. It is fast (built on libxml2), has excellent CSS selector and XPath support, and handles malformed HTML gracefully. For static pages, Nokogiri + HTTParty is the most popular combination in the Ruby ecosystem.

Standard Ruby HTTP libraries (HTTParty, Net::HTTP, Faraday) cannot execute JavaScript. You need either a headless browser gem like Ferrum or Watir, or an API like SnapRender that renders JavaScript server-side and returns the fully-rendered HTML or markdown.

Nokogiri is a parser — it takes HTML and lets you query it with CSS selectors or XPath. Mechanize is a browser simulator that can follow links, submit forms, and maintain cookies. Use Nokogiri for simple scraping, Mechanize when you need to interact with pages (login, pagination via forms).

Add sleep() calls between requests (1-3 seconds is a good baseline), implement exponential backoff on 429 responses, rotate User-Agent strings, and consider using a proxy pool for large-scale scraping. The Typhoeus gem supports concurrent requests with configurable concurrency limits.

For HTTP requests, the language barely matters — network I/O is the bottleneck. Ruby's HTTParty and Faraday are comparable in speed to Python's Requests. For parsing, Nokogiri (C-backed) is actually faster than BeautifulSoup. For concurrent scraping, Ruby's Typhoeus or async gems perform well.