Guide

Web Scraping with Rust: The Complete Guide

|14 min read

Rust gives you C-level performance with memory safety guarantees. The reqwest crate handles HTTP, scraper parses HTML with CSS selectors, and tokio enables massive async concurrency. This guide covers everything from your first scraper to production pipelines with SnapRender.

What you will learn

1.reqwest + scraper basics
2.CSS selectors in Rust
3.Async concurrent scraping
4.Error handling and retries
5.Data storage (JSON, CSV)
6.Handling JS-rendered pages
7.Anti-bot bypass with SnapRender
8.Structured data extraction

1. Setting up your Rust scraper

Add these dependencies to your Cargo.toml:

Cargo.toml
[dependencies]
reqwest = { version = "0.12", features = ["json"] }
scraper = "0.20"
tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"

Here is a complete scraper that extracts book titles and prices:

src/main.rs
use reqwest;
use scraper::{Html, Selector};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = reqwest::Client::builder()
        .user_agent("Mozilla/5.0 (compatible; MyBot/1.0)")
        .build()?;

    let body = client
        .get("https://books.toscrape.com/")
        .send()
        .await?
        .text()
        .await?;

    let document = Html::parse_document(&body);
    let book_sel = Selector::parse("article.product_pod").unwrap();
    let title_sel = Selector::parse("h3 a").unwrap();
    let price_sel = Selector::parse(".price_color").unwrap();

    for book in document.select(&book_sel) {
        let title = book
            .select(&title_sel)
            .next()
            .and_then(|el| el.value().attr("title"))
            .unwrap_or("N/A");

        let price = book
            .select(&price_sel)
            .next()
            .map(|el| el.text().collect::<String>())
            .unwrap_or_default();

        println!("{}: {}", title, price.trim());
    }

    Ok(())
}

The scraper crate uses the same CSS selectors you know from front-end development. select() returns an iterator over all matches, and you chain .next() for the first match.

2. Async concurrent scraping

Rust's async runtime lets you scrape hundreds of pages concurrently on a single thread. Use JoinSet to manage concurrent tasks:

concurrent.rs
use reqwest::Client;
use scraper::{Html, Selector};
use tokio::task::JoinSet;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::builder()
        .user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
        .build()?;

    let urls: Vec<String> = (1..=20)
        .map(|i| format!("https://example.com/page/{}", i))
        .collect();

    let mut set = JoinSet::new();

    for url in urls {
        let client = client.clone();
        set.spawn(async move {
            let resp = client.get(&url).send().await?;
            let body = resp.text().await?;
            let doc = Html::parse_document(&body);
            // ... extract data
            Ok::<_, reqwest::Error>((url, doc.html().len()))
        });
    }

    while let Some(result) = set.join_next().await {
        match result? {
            Ok((url, size)) => println!("{}: {} bytes", url, size),
            Err(e) => eprintln!("Error: {}", e),
        }
    }

    Ok(())
}

Pro tip

Use a semaphore (tokio::sync::Semaphore) to cap concurrency at 10-20 requests. This prevents overwhelming target servers and avoids IP bans.

3. Error handling and retries

Production scrapers need exponential backoff and proper error handling. Rust's type system makes error handling explicit:

retry.rs
use std::time::Duration;
use tokio::time::sleep;

async fn fetch_with_retry(
    client: &reqwest::Client,
    url: &str,
    max_retries: u32,
) -> Result<String, reqwest::Error> {
    let mut attempts = 0;

    loop {
        match client.get(url).send().await {
            Ok(resp) if resp.status().is_success() => {
                return resp.text().await;
            }
            Ok(resp) if resp.status() == 429 => {
                attempts += 1;
                if attempts >= max_retries {
                    return resp.text().await;
                }
                let wait = Duration::from_secs(2u64.pow(attempts));
                eprintln!("Rate limited on {}, waiting {:?}", url, wait);
                sleep(wait).await;
            }
            Ok(resp) => {
                eprintln!("HTTP {} on {}", resp.status(), url);
                attempts += 1;
                if attempts >= max_retries {
                    return resp.text().await;
                }
                sleep(Duration::from_secs(1)).await;
            }
            Err(e) => {
                attempts += 1;
                if attempts >= max_retries {
                    return Err(e);
                }
                sleep(Duration::from_secs(1)).await;
            }
        }
    }
}

4. Storing scraped data

Use serde for serialization. Rust's derive macros make JSON and CSV export trivial:

storage.rs
use serde::{Deserialize, Serialize};
use std::fs;

#[derive(Debug, Serialize, Deserialize)]
struct Product {
    name: String,
    price: String,
    url: String,
}

fn save_to_json(products: &[Product], path: &str) {
    let json = serde_json::to_string_pretty(products).unwrap();
    fs::write(path, json).unwrap();
    println!("Saved {} products to {}", products.len(), path);
}

fn save_to_csv(products: &[Product], path: &str) {
    let mut wtr = csv::Writer::from_path(path).unwrap();
    for product in products {
        wtr.serialize(product).unwrap();
    }
    wtr.flush().unwrap();
    println!("Saved {} products to {}", products.len(), path);
}

5. Handling JavaScript pages with SnapRender

reqwest and scraper cannot execute JavaScript. React, Vue, and Angular apps return empty shells. Instead of managing headless browsers, use SnapRender to get fully-rendered content via a simple HTTP call:

Render as markdown

Get any JavaScript-rendered page as clean markdown. Perfect for LLM pipelines or content extraction.

render.rs
use reqwest::Client;
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();

    // Render any JS-heavy page as clean markdown
    let resp = client
        .post("https://api.snaprender.dev/v1/render")
        .header("x-api-key", "sr_live_YOUR_KEY")
        .json(&json!({
            "url": "https://example.com/spa-page",
            "format": "markdown"
        }))
        .send()
        .await?;

    let data: serde_json::Value = resp.json().await?;
    println!("{}", data["data"]["markdown"]);

    Ok(())
}

Extract structured data

Use CSS selectors to pull specific fields. Returns clean JSON — no parsing needed.

extract.rs
use reqwest::Client;
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();

    // Extract structured data with CSS selectors
    let resp = client
        .post("https://api.snaprender.dev/v1/extract")
        .header("x-api-key", "sr_live_YOUR_KEY")
        .json(&json!({
            "url": "https://example.com/products/widget-pro",
            "selectors": {
                "name": "h1.product-title",
                "price": ".price-current",
                "rating": ".star-rating",
                "description": ".product-description p",
                "in_stock": ".availability-status"
            }
        }))
        .send()
        .await?;

    let data: serde_json::Value = resp.json().await?;
    println!("{}", serde_json::to_string_pretty(&data)?);

    Ok(())
}

Comparison: when to use what

ApproachBest forLimitation
reqwest + scraperStatic HTML, high perfNo JS rendering
headless_chrome crateJS pages, local controlHigh RAM, complex setup
fantoccini (WebDriver)Browser automationSlow, resource-heavy
SnapRender APIJS + anti-bot + scaleAPI cost at high volume

Skip the browser infrastructure

SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. Just send a URL from your Rust program and get results back as JSON.

Get Your API Key — Free

Frequently asked questions

Rust is excellent for web scraping when performance matters. The reqwest + scraper crate combo is fast, memory-safe, and compiles to a single binary with no runtime dependencies. For high-volume scraping (millions of pages), Rust outperforms Python and Node.js significantly.

Standard Rust HTTP crates (reqwest, ureq) cannot execute JavaScript. You need either a headless browser binding like headless_chrome or fantoccini, or an API like SnapRender that renders JavaScript server-side and returns the fully-rendered HTML or markdown.

The scraper crate is Rust's equivalent of BeautifulSoup or Nokogiri. It parses HTML into a tree structure and lets you query it using CSS selectors. It is built on the html5ever parser (the same one Firefox uses), making it extremely fast and spec-compliant.

reqwest is Rust's most popular HTTP client, similar to Python's requests library. It supports async/await natively, handles cookies, redirects, and TLS, and is significantly faster due to Rust's zero-cost abstractions. The async runtime (tokio) enables massive concurrency without the GIL bottleneck.

Use async (reqwest with tokio) for any scraper that hits multiple URLs. Async lets you run hundreds of concurrent requests on a single thread, dramatically reducing total scrape time. Use blocking only for simple one-off scripts where adding tokio feels like overkill.