1. Your first scraper: Nokogiri + HTTParty
The standard Ruby scraping stack uses two gems: nokogiri for HTML parsing and httparty for HTTP requests. Install them:
#E8A0BF">gem #E8A0BF">install nokogiri httpartyHere is a complete scraper that extracts book titles and prices from a practice website:
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'nokogiri'
# Fetch the page
response = HTTParty.get(
#A8D4A0">"https://books.toscrape.com/",
headers: { #A8D4A0">"User-Agent" => #A8D4A0">"Mozilla/5.0 (compatible; MyBot/1.0)" }
)
# Parse the HTML
doc = Nokogiri::HTML(response.body)
# Extract all book titles and prices
books = doc.css(#A8D4A0">"article.product_pod").#E8A0BF">map #E8A0BF">do |article|
{
title: article.css(#A8D4A0">"h3 a").attr(#A8D4A0">"title").value,
price: article.css(#A8D4A0">".price_color").text.strip
}
#E8A0BF">end
books.first(5).#E8A0BF">each #E8A0BF">do |book|
#E8A0BF">puts #A8D4A0">"#{book[:title]}: #{book[:price]}"
#E8A0BF">endNokogiri uses the same CSS selectors you already know from front-end development. The css() method returns all matches, while at_css() returns the first match (like querySelector in JavaScript).
2. Setting proper headers
Ruby's default user-agent is a dead giveaway. Servers see the HTTParty identifier and block you immediately. Always set realistic browser headers:
#E8A0BF">require #A8D4A0">'httparty'
headers = {
#A8D4A0">"User-Agent" => #A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
#A8D4A0">"AppleWebKit/537.36 (KHTML, like Gecko) " \
#A8D4A0">"Chrome/124.0.0.0 Safari/537.36",
#A8D4A0">"Accept" => #A8D4A0">"text/html,application/xhtml+xml",
#A8D4A0">"Accept-Language" => #A8D4A0">"en-US,en;q=0.9",
#A8D4A0">"Referer" => #A8D4A0">"https://www.google.com/"
}
response = HTTParty.get(
#A8D4A0">"https://example.com/products",
headers: headers,
timeout: 10
)
#E8A0BF">puts response.codePro tip
Keep a YAML file of 20+ real browser user-agents and rotate them randomly. Update quarterly — outdated Chrome versions are easily detected by anti-bot systems.
3. Authenticated scraping with Mechanize
When you need to log in, submit forms, or maintain session cookies, the mechanize gem simulates a full browser session:
#E8A0BF">gem #E8A0BF">install mechanize#E8A0BF">require #A8D4A0">'mechanize'
agent = Mechanize.#E8A0BF">new
agent.user_agent_alias = #A8D4A0">"Windows Chrome"
# Login to a site
page = agent.get(#A8D4A0">"https://example.com/login")
form = page.form_with(id: #A8D4A0">"login-form")
form.field_with(name: #A8D4A0">"email").value = #A8D4A0">"user@example.com"
form.field_with(name: #A8D4A0">"password").value = #A8D4A0">"password123"
result = agent.submit(form)
# Now scrape authenticated pages
dashboard = agent.get(#A8D4A0">"https://example.com/dashboard")
data = dashboard.search(#A8D4A0">".data-row").#E8A0BF">map #E8A0BF">do |row|
{
name: row.at_css(#A8D4A0">".name").text.strip,
value: row.at_css(#A8D4A0">".value").text.strip
}
#E8A0BF">end
#E8A0BF">puts data.inspect4. Rate limiting and retries
Scraping without rate limiting gets your IP banned fast. Always add delays between requests and handle 429 responses:
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'nokogiri'
urls = (1..50).#E8A0BF">map { |i| #A8D4A0">"https://example.com/page/#{i}" }
results = []
urls.each_with_index #E8A0BF">do |url, idx|
#E8A0BF">begin
response = HTTParty.get(url, headers: headers, timeout: 10)
#E8A0BF">if response.code == 429
#E8A0BF">puts #A8D4A0">"Rate limited on #{url}, backing off..."
sleep(30)
response = HTTParty.get(url, headers: headers, timeout: 10)
#E8A0BF">end
#E8A0BF">if response.code == 200
doc = Nokogiri::HTML(response.body)
# ... extract data
results << data
#E8A0BF">end
#E8A0BF">rescue StandardError => e
#E8A0BF">puts #A8D4A0">"Error on #{url}: #{e.message}"
#E8A0BF">end
# Random delay between requests
sleep(rand(1.0..3.0))
#E8A0BF">puts #A8D4A0">"Progress: #{idx + 1}/#{urls.length}"
#E8A0BF">end
#E8A0BF">puts #A8D4A0">"Scraped #{results.length} pages"5. Storing scraped data
Ruby has built-in CSV and JSON support, plus excellent SQLite bindings:
#E8A0BF">require #A8D4A0">'csv'
#E8A0BF">require #A8D4A0">'json'
# Save to CSV
CSV.open(#A8D4A0">"products.csv", #A8D4A0">"w") #E8A0BF">do |csv|
csv << [#A8D4A0">"name", #A8D4A0">"price", #A8D4A0">"url"]
products.#E8A0BF">each #E8A0BF">do |product|
csv << [product[:name], product[:price], product[:url]]
#E8A0BF">end
#E8A0BF">end
# Save to JSON
File.write(#A8D4A0">"products.json", JSON.pretty_generate(products))
# Save to SQLite
#E8A0BF">require #A8D4A0">'sqlite3'
db = SQLite3::Database.#E8A0BF">new(#A8D4A0">"products.db")
db.execute <<-SQL
CREATE TABLE IF NOT EXISTS products (
name TEXT, price TEXT, url TEXT
)
SQL
products.#E8A0BF">each #E8A0BF">do |product|
db.execute(
#A8D4A0">"INSERT INTO products VALUES (?, ?, ?)",
[product[:name], product[:price], product[:url]]
)
#E8A0BF">end6. Handling JavaScript pages with SnapRender
Nokogiri and HTTParty cannot execute JavaScript. React, Vue, and Angular apps return empty shells. Instead of managing headless browsers locally, use SnapRender to get fully-rendered content via a simple API call:
Render as markdown
Get any JavaScript-rendered page as clean markdown. Perfect for LLM pipelines or content extraction.
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'json'
# Render any page as clean markdown (handles JS)
response = HTTParty.post(
#A8D4A0">"https://api.snaprender.dev/v1/render",
headers: {
#A8D4A0">"x-api-key" => #A8D4A0">"sr_live_YOUR_KEY",
#A8D4A0">"Content-Type" => #A8D4A0">"application/json"
},
body: {
url: #A8D4A0">"https://example.com/spa-page",
format: #A8D4A0">"markdown"
}.to_json
)
data = JSON.parse(response.body)
#E8A0BF">puts data[#A8D4A0">"data"][#A8D4A0">"markdown"]Extract structured data
Use CSS selectors to pull specific fields. Returns clean JSON — no parsing needed.
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'json'
# Extract structured data with CSS selectors
response = HTTParty.post(
#A8D4A0">"https://api.snaprender.dev/v1/extract",
headers: {
#A8D4A0">"x-api-key" => #A8D4A0">"sr_live_YOUR_KEY",
#A8D4A0">"Content-Type" => #A8D4A0">"application/json"
},
body: {
url: #A8D4A0">"https://example.com/products/widget-pro",
selectors: {
name: #A8D4A0">"h1.product-title",
price: #A8D4A0">".price-current",
rating: #A8D4A0">".star-rating",
description: #A8D4A0">".product-description #E8A0BF">p",
in_stock: #A8D4A0">".availability-status"
}
}.to_json
)
#E8A0BF">puts JSON.pretty_generate(JSON.parse(response.body))Bypass anti-bot protection
For Cloudflare, DataDome, or similar protections, add the use_flaresolverr flag:
#E8A0BF">require #A8D4A0">'httparty'
#E8A0BF">require #A8D4A0">'json'
# Bypass Cloudflare / anti-bot protection
response = HTTParty.post(
#A8D4A0">"https://api.snaprender.dev/v1/render",
headers: {
#A8D4A0">"x-api-key" => #A8D4A0">"sr_live_YOUR_KEY",
#A8D4A0">"Content-Type" => #A8D4A0">"application/json"
},
body: {
url: #A8D4A0">"https://protected-site.com/data",
format: #A8D4A0">"markdown",
use_flaresolverr: #E8A0BF">true
}.to_json
)
data = JSON.parse(response.body)
#E8A0BF">puts data[#A8D4A0">"data"][#A8D4A0">"markdown"]Comparison: when to use what
| Approach | Best for | Limitation |
|---|---|---|
| Nokogiri + HTTParty | Static HTML pages | No JS rendering |
| Mechanize | Forms, login sessions | No JS rendering |
| Ferrum (headless Chrome) | JS pages, local control | High RAM, maintenance |
| SnapRender API | JS + anti-bot + scale | API cost at high volume |
Skip the browser infrastructure
SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. Just send a URL from your Ruby script and get results back as JSON.
Get Your API Key — FreeFrequently asked questions
Nokogiri is the gold standard for HTML/XML parsing in Ruby. It is fast (built on libxml2), has excellent CSS selector and XPath support, and handles malformed HTML gracefully. For static pages, Nokogiri + HTTParty is the most popular combination in the Ruby ecosystem.
Standard Ruby HTTP libraries (HTTParty, Net::HTTP, Faraday) cannot execute JavaScript. You need either a headless browser gem like Ferrum or Watir, or an API like SnapRender that renders JavaScript server-side and returns the fully-rendered HTML or markdown.
Nokogiri is a parser — it takes HTML and lets you query it with CSS selectors or XPath. Mechanize is a browser simulator that can follow links, submit forms, and maintain cookies. Use Nokogiri for simple scraping, Mechanize when you need to interact with pages (login, pagination via forms).
Add sleep() calls between requests (1-3 seconds is a good baseline), implement exponential backoff on 429 responses, rotate User-Agent strings, and consider using a proxy pool for large-scale scraping. The Typhoeus gem supports concurrent requests with configurable concurrency limits.
For HTTP requests, the language barely matters — network I/O is the bottleneck. Ruby's HTTParty and Faraday are comparable in speed to Python's Requests. For parsing, Nokogiri (C-backed) is actually faster than BeautifulSoup. For concurrent scraping, Ruby's Typhoeus or async gems perform well.