1. Setting up your Rust scraper
Add these dependencies to your Cargo.toml:
[dependencies]
reqwest = { version = "0.12", features = ["json"] }
scraper = "0.20"
tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"Here is a complete scraper that extracts book titles and prices:
use reqwest;
use scraper::{Html, Selector};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = reqwest::Client::builder()
.user_agent("Mozilla/5.0 (compatible; MyBot/1.0)")
.build()?;
let body = client
.get("https://books.toscrape.com/")
.send()
.await?
.text()
.await?;
let document = Html::parse_document(&body);
let book_sel = Selector::parse("article.product_pod").unwrap();
let title_sel = Selector::parse("h3 a").unwrap();
let price_sel = Selector::parse(".price_color").unwrap();
for book in document.select(&book_sel) {
let title = book
.select(&title_sel)
.next()
.and_then(|el| el.value().attr("title"))
.unwrap_or("N/A");
let price = book
.select(&price_sel)
.next()
.map(|el| el.text().collect::<String>())
.unwrap_or_default();
println!("{}: {}", title, price.trim());
}
Ok(())
}The scraper crate uses the same CSS selectors you know from front-end development. select() returns an iterator over all matches, and you chain .next() for the first match.
2. Async concurrent scraping
Rust's async runtime lets you scrape hundreds of pages concurrently on a single thread. Use JoinSet to manage concurrent tasks:
use reqwest::Client;
use scraper::{Html, Selector};
use tokio::task::JoinSet;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::builder()
.user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
.build()?;
let urls: Vec<String> = (1..=20)
.map(|i| format!("https://example.com/page/{}", i))
.collect();
let mut set = JoinSet::new();
for url in urls {
let client = client.clone();
set.spawn(async move {
let resp = client.get(&url).send().await?;
let body = resp.text().await?;
let doc = Html::parse_document(&body);
// ... extract data
Ok::<_, reqwest::Error>((url, doc.html().len()))
});
}
while let Some(result) = set.join_next().await {
match result? {
Ok((url, size)) => println!("{}: {} bytes", url, size),
Err(e) => eprintln!("Error: {}", e),
}
}
Ok(())
}Pro tip
Use a semaphore (tokio::sync::Semaphore) to cap concurrency at 10-20 requests. This prevents overwhelming target servers and avoids IP bans.
3. Error handling and retries
Production scrapers need exponential backoff and proper error handling. Rust's type system makes error handling explicit:
use std::time::Duration;
use tokio::time::sleep;
async fn fetch_with_retry(
client: &reqwest::Client,
url: &str,
max_retries: u32,
) -> Result<String, reqwest::Error> {
let mut attempts = 0;
loop {
match client.get(url).send().await {
Ok(resp) if resp.status().is_success() => {
return resp.text().await;
}
Ok(resp) if resp.status() == 429 => {
attempts += 1;
if attempts >= max_retries {
return resp.text().await;
}
let wait = Duration::from_secs(2u64.pow(attempts));
eprintln!("Rate limited on {}, waiting {:?}", url, wait);
sleep(wait).await;
}
Ok(resp) => {
eprintln!("HTTP {} on {}", resp.status(), url);
attempts += 1;
if attempts >= max_retries {
return resp.text().await;
}
sleep(Duration::from_secs(1)).await;
}
Err(e) => {
attempts += 1;
if attempts >= max_retries {
return Err(e);
}
sleep(Duration::from_secs(1)).await;
}
}
}
}4. Storing scraped data
Use serde for serialization. Rust's derive macros make JSON and CSV export trivial:
use serde::{Deserialize, Serialize};
use std::fs;
#[derive(Debug, Serialize, Deserialize)]
struct Product {
name: String,
price: String,
url: String,
}
fn save_to_json(products: &[Product], path: &str) {
let json = serde_json::to_string_pretty(products).unwrap();
fs::write(path, json).unwrap();
println!("Saved {} products to {}", products.len(), path);
}
fn save_to_csv(products: &[Product], path: &str) {
let mut wtr = csv::Writer::from_path(path).unwrap();
for product in products {
wtr.serialize(product).unwrap();
}
wtr.flush().unwrap();
println!("Saved {} products to {}", products.len(), path);
}5. Handling JavaScript pages with SnapRender
reqwest and scraper cannot execute JavaScript. React, Vue, and Angular apps return empty shells. Instead of managing headless browsers, use SnapRender to get fully-rendered content via a simple HTTP call:
Render as markdown
Get any JavaScript-rendered page as clean markdown. Perfect for LLM pipelines or content extraction.
use reqwest::Client;
use serde_json::json;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
// Render any JS-heavy page as clean markdown
let resp = client
.post("https://api.snaprender.dev/v1/render")
.header("x-api-key", "sr_live_YOUR_KEY")
.json(&json!({
"url": "https://example.com/spa-page",
"format": "markdown"
}))
.send()
.await?;
let data: serde_json::Value = resp.json().await?;
println!("{}", data["data"]["markdown"]);
Ok(())
}Extract structured data
Use CSS selectors to pull specific fields. Returns clean JSON — no parsing needed.
use reqwest::Client;
use serde_json::json;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
// Extract structured data with CSS selectors
let resp = client
.post("https://api.snaprender.dev/v1/extract")
.header("x-api-key", "sr_live_YOUR_KEY")
.json(&json!({
"url": "https://example.com/products/widget-pro",
"selectors": {
"name": "h1.product-title",
"price": ".price-current",
"rating": ".star-rating",
"description": ".product-description p",
"in_stock": ".availability-status"
}
}))
.send()
.await?;
let data: serde_json::Value = resp.json().await?;
println!("{}", serde_json::to_string_pretty(&data)?);
Ok(())
}Comparison: when to use what
| Approach | Best for | Limitation |
|---|---|---|
| reqwest + scraper | Static HTML, high perf | No JS rendering |
| headless_chrome crate | JS pages, local control | High RAM, complex setup |
| fantoccini (WebDriver) | Browser automation | Slow, resource-heavy |
| SnapRender API | JS + anti-bot + scale | API cost at high volume |
Skip the browser infrastructure
SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. Just send a URL from your Rust program and get results back as JSON.
Get Your API Key — FreeFrequently asked questions
Rust is excellent for web scraping when performance matters. The reqwest + scraper crate combo is fast, memory-safe, and compiles to a single binary with no runtime dependencies. For high-volume scraping (millions of pages), Rust outperforms Python and Node.js significantly.
Standard Rust HTTP crates (reqwest, ureq) cannot execute JavaScript. You need either a headless browser binding like headless_chrome or fantoccini, or an API like SnapRender that renders JavaScript server-side and returns the fully-rendered HTML or markdown.
The scraper crate is Rust's equivalent of BeautifulSoup or Nokogiri. It parses HTML into a tree structure and lets you query it using CSS selectors. It is built on the html5ever parser (the same one Firefox uses), making it extremely fast and spec-compliant.
reqwest is Rust's most popular HTTP client, similar to Python's requests library. It supports async/await natively, handles cookies, redirects, and TLS, and is significantly faster due to Rust's zero-cost abstractions. The async runtime (tokio) enables massive concurrency without the GIL bottleneck.
Use async (reqwest with tokio) for any scraper that hits multiple URLs. Async lets you run hundreds of concurrent requests on a single thread, dramatically reducing total scrape time. Use blocking only for simple one-off scripts where adding tokio feels like overkill.