Guide

Web Scraping with Java: The Complete Guide

|16 min read

Java is the enterprise choice for web scraping. Jsoup provides fast, reliable HTML parsing with CSS selectors, the built-in HttpClient handles HTTP/2 and async requests, and Selenium WebDriver offers full browser automation. This guide covers everything from basic Jsoup parsing to production-grade scraping pipelines with SnapRender.

What you will learn

1.Jsoup HTML parsing basics
2.HttpClient custom requests
3.Selenium WebDriver automation
4.CSS selectors and DOM traversal
5.Rate limiting and retries
6.Handling JS-rendered pages
7.Anti-bot bypass with SnapRender
8.Structured data extraction

1. HTML parsing with Jsoup

Jsoup handles both HTTP fetching and HTML parsing. Add it to your Maven or Gradle project:

pom.xml
<!-- Maven -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>
Scraper.java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Scraper {
    public static void main(String[] args) throws Exception {
        // Fetch and parse in one step
        Document doc = Jsoup.connect("https://books.toscrape.com/")
            .userAgent("Mozilla/5.0 (compatible; MyBot/1.0)")
            .timeout(10000)
            .get();

        // Extract book titles and prices with CSS selectors
        Elements articles = doc.select("article.product_pod");

        for (Element article : articles) {
            String title = article.select("h3 a").attr("title");
            String price = article.select(".price_color").text();
            System.out.printf("%s: %s%n", title, price);
        }
    }
}

Jsoup's CSS selector engine supports the same selectors you use in front-end development. The select() method returns all matches, while selectFirst() returns the first.

2. Custom requests with HttpClient

Java 11+ includes a modern HTTP client with HTTP/2 support, async requests, and full header control:

CustomScraper.java
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class CustomScraper {
    public static void main(String[] args) throws Exception {
        // Java 11+ HttpClient with custom headers
        HttpClient client = HttpClient.newBuilder()
            .followRedirects(HttpClient.Redirect.NORMAL)
            .build();

        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create("https://example.com/products"))
            .header("User-Agent",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                + "Chrome/124.0.0.0 Safari/537.36")
            .header("Accept", "text/html,application/xhtml+xml")
            .header("Accept-Language", "en-US,en;q=0.9")
            .header("Referer", "https://www.google.com/")
            .GET()
            .build();

        HttpResponse<String> response = client.send(
            request,
            HttpResponse.BodyHandlers.ofString()
        );

        System.out.println("Status: " + response.statusCode());

        // Parse the response with Jsoup
        Document doc = Jsoup.parse(response.body());
        // ... extract data
    }
}

3. JavaScript rendering with Selenium

When you need to scrape JavaScript-rendered SPAs, Selenium WebDriver automates a real browser:

SeleniumScraper.java
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.List;

public class SeleniumScraper {
    public static void main(String[] args) {
        // Configure headless Chrome
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless=new");
        options.addArguments("--disable-gpu");
        options.addArguments("--no-sandbox");

        WebDriver driver = new ChromeDriver(options);

        try {
            // Navigate and wait for JS to render
            driver.get("https://example.com/spa-page");

            // Wait for content (explicit wait recommended)
            Thread.sleep(3000);

            // Extract data from the rendered DOM
            List<WebElement> cards = driver.findElements(
                By.cssSelector(".product-card")
            );

            for (WebElement card : cards) {
                String name = card.findElement(
                    By.cssSelector(".name")).getText();
                String price = card.findElement(
                    By.cssSelector(".price")).getText();
                System.out.printf("%s: %s%n", name, price);
            }
        } finally {
            driver.quit();
        }
    }
}

Selenium pain points

  • !Each Chrome instance uses 200-400 MB RAM — scaling requires significant infrastructure
  • !ChromeDriver versions must match Chrome versions exactly — breaks on auto-updates
  • !Anti-bot systems detect Selenium's browser fingerprint (navigator.webdriver flag)
  • !Startup time is 2-5 seconds per browser instance — slow for batch scraping

4. Rate limiting and retries

Without rate limiting, servers will ban your IP. Add delays and handle 429 responses:

RateLimitedScraper.java
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ThreadLocalRandom;

public class RateLimitedScraper {
    public static void main(String[] args) throws Exception {
        List<String> urls = new ArrayList<>();
        for (int i = 1; i <= 50; i++) {
            urls.add("https://example.com/page/" + i);
        }

        List<Object> results = new ArrayList<>();

        for (int i = 0; i < urls.size(); i++) {
            String url = urls.get(i);
            try {
                var response = Jsoup.connect(url)
                    .userAgent(getRandomUserAgent())
                    .timeout(10000)
                    .execute();

                if (response.statusCode() == 429) {
                    System.out.println("Rate limited, backing off...");
                    Thread.sleep(30000);
                    response = Jsoup.connect(url).execute();
                }

                if (response.statusCode() == 200) {
                    var doc = response.parse();
                    // ... extract data
                    results.add(data);
                }

            } catch (Exception e) {
                System.err.printf("Error on %s: %s%n", url, e.getMessage());
            }

            // Random delay: 1-3 seconds
            long delay = ThreadLocalRandom.current().nextLong(1000, 3000);
            Thread.sleep(delay);
            System.out.printf("Progress: %d/%d%n", i + 1, urls.size());
        }

        System.out.printf("Scraped %d pages%n", results.size());
    }
}

5. The easier way: SnapRender API

Skip Selenium entirely. SnapRender renders JavaScript pages server-side and returns the result via a simple HttpClient call:

Render as markdown

SnapRenderExample.java
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class SnapRenderExample {
    public static void main(String[] args) throws Exception {
        // Render any page as clean markdown (handles JS)
        HttpClient client = HttpClient.newHttpClient();

        String json = """
            {
                "url": "https://example.com/spa-page",
                "format": "markdown"
            }
            """;

        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create("https://api.snaprender.dev/v1/render"))
            .header("x-api-key", "sr_live_YOUR_KEY")
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(json))
            .build();

        HttpResponse<String> response = client.send(
            request,
            HttpResponse.BodyHandlers.ofString()
        );

        System.out.println(response.body());
    }
}

Extract structured data

Extract.java
// Extract structured data with CSS selectors
String json = """
    {
        "url": "https://example.com/products/widget-pro",
        "selectors": {
            "name": "h1.product-title",
            "price": ".price-current",
            "rating": ".star-rating",
            "description": ".product-description p",
            "in_stock": ".availability-status"
        }
    }
    """;

HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.snaprender.dev/v1/extract"))
    .header("x-api-key", "sr_live_YOUR_KEY")
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(json))
    .build();

HttpResponse<String> response = client.send(
    request,
    HttpResponse.BodyHandlers.ofString()
);
System.out.println(response.body());

Bypass anti-bot protection

Bypass.java
// Bypass Cloudflare / anti-bot protection
String json = """
    {
        "url": "https://protected-site.com/data",
        "format": "markdown",
        "use_flaresolverr": true
    }
    """;

HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.snaprender.dev/v1/render"))
    .header("x-api-key", "sr_live_YOUR_KEY")
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(json))
    .build();

HttpResponse<String> response = client.send(
    request,
    HttpResponse.BodyHandlers.ofString()
);
// Returns fully rendered content even behind
// Cloudflare, DataDome, etc.
System.out.println(response.body());

Comparison: when to use what

ApproachBest forLimitation
JsoupStatic HTML, fast parsingNo JS rendering
HttpClient + JsoupCustom headers, HTTP/2No JS rendering
Selenium WebDriverJS pages, full automationHigh RAM, slow, fragile
SnapRender APIJS + anti-bot + scaleAPI cost at high volume

Skip the browser infrastructure

SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. One HttpClient call from your Java app — no Selenium, no ChromeDriver versioning headaches.

Get Your API Key — Free

Frequently asked questions

Java is excellent for enterprise-grade web scraping. Jsoup provides fast, reliable HTML parsing with CSS selectors, the built-in HttpClient (Java 11+) handles async requests efficiently, and Selenium WebDriver offers full browser automation. Java scrapers benefit from the JVM's mature concurrency model and extensive logging/monitoring ecosystem.

Jsoup is the most popular Java library for HTML parsing and manipulation. It provides a clean API for fetching URLs, parsing HTML, extracting data with CSS selectors or DOM traversal, and cleaning user-submitted HTML. It handles malformed HTML gracefully and is used in production by thousands of companies.

Use Jsoup for static HTML pages — it is much faster and lighter. Use Selenium only when you need JavaScript execution (SPAs, dynamic content). For JS-rendered pages at scale, consider SnapRender's API instead of Selenium to avoid managing browser infrastructure.

Java's HttpClient and Jsoup cannot execute JavaScript. Options: (1) Selenium WebDriver with headless Chrome, (2) HtmlUnit (lightweight headless browser with limited JS support), or (3) SnapRender's API which renders JavaScript server-side and returns HTML/markdown via a simple HTTP call.

Java is more verbose but offers better performance for large-scale scraping, stronger typing that catches bugs at compile time, and superior concurrency primitives (CompletableFuture, virtual threads in Java 21+). Python is faster to prototype and has more scraping libraries. For production enterprise scrapers, Java is often the better choice.