1. HTML parsing with Jsoup
Jsoup handles both HTTP fetching and HTML parsing. Add it to your Maven or Gradle project:
<!-- Maven -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Scraper {
public static void main(String[] args) throws Exception {
// Fetch and parse in one step
Document doc = Jsoup.connect("https://books.toscrape.com/")
.userAgent("Mozilla/5.0 (compatible; MyBot/1.0)")
.timeout(10000)
.get();
// Extract book titles and prices with CSS selectors
Elements articles = doc.select("article.product_pod");
for (Element article : articles) {
String title = article.select("h3 a").attr("title");
String price = article.select(".price_color").text();
System.out.printf("%s: %s%n", title, price);
}
}
}Jsoup's CSS selector engine supports the same selectors you use in front-end development. The select() method returns all matches, while selectFirst() returns the first.
2. Custom requests with HttpClient
Java 11+ includes a modern HTTP client with HTTP/2 support, async requests, and full header control:
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class CustomScraper {
public static void main(String[] args) throws Exception {
// Java 11+ HttpClient with custom headers
HttpClient client = HttpClient.newBuilder()
.followRedirects(HttpClient.Redirect.NORMAL)
.build();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://example.com/products"))
.header("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
+ "Chrome/124.0.0.0 Safari/537.36")
.header("Accept", "text/html,application/xhtml+xml")
.header("Accept-Language", "en-US,en;q=0.9")
.header("Referer", "https://www.google.com/")
.GET()
.build();
HttpResponse<String> response = client.send(
request,
HttpResponse.BodyHandlers.ofString()
);
System.out.println("Status: " + response.statusCode());
// Parse the response with Jsoup
Document doc = Jsoup.parse(response.body());
// ... extract data
}
}3. JavaScript rendering with Selenium
When you need to scrape JavaScript-rendered SPAs, Selenium WebDriver automates a real browser:
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.List;
public class SeleniumScraper {
public static void main(String[] args) {
// Configure headless Chrome
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless=new");
options.addArguments("--disable-gpu");
options.addArguments("--no-sandbox");
WebDriver driver = new ChromeDriver(options);
try {
// Navigate and wait for JS to render
driver.get("https://example.com/spa-page");
// Wait for content (explicit wait recommended)
Thread.sleep(3000);
// Extract data from the rendered DOM
List<WebElement> cards = driver.findElements(
By.cssSelector(".product-card")
);
for (WebElement card : cards) {
String name = card.findElement(
By.cssSelector(".name")).getText();
String price = card.findElement(
By.cssSelector(".price")).getText();
System.out.printf("%s: %s%n", name, price);
}
} finally {
driver.quit();
}
}
}Selenium pain points
- !Each Chrome instance uses 200-400 MB RAM — scaling requires significant infrastructure
- !ChromeDriver versions must match Chrome versions exactly — breaks on auto-updates
- !Anti-bot systems detect Selenium's browser fingerprint (navigator.webdriver flag)
- !Startup time is 2-5 seconds per browser instance — slow for batch scraping
4. Rate limiting and retries
Without rate limiting, servers will ban your IP. Add delays and handle 429 responses:
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ThreadLocalRandom;
public class RateLimitedScraper {
public static void main(String[] args) throws Exception {
List<String> urls = new ArrayList<>();
for (int i = 1; i <= 50; i++) {
urls.add("https://example.com/page/" + i);
}
List<Object> results = new ArrayList<>();
for (int i = 0; i < urls.size(); i++) {
String url = urls.get(i);
try {
var response = Jsoup.connect(url)
.userAgent(getRandomUserAgent())
.timeout(10000)
.execute();
if (response.statusCode() == 429) {
System.out.println("Rate limited, backing off...");
Thread.sleep(30000);
response = Jsoup.connect(url).execute();
}
if (response.statusCode() == 200) {
var doc = response.parse();
// ... extract data
results.add(data);
}
} catch (Exception e) {
System.err.printf("Error on %s: %s%n", url, e.getMessage());
}
// Random delay: 1-3 seconds
long delay = ThreadLocalRandom.current().nextLong(1000, 3000);
Thread.sleep(delay);
System.out.printf("Progress: %d/%d%n", i + 1, urls.size());
}
System.out.printf("Scraped %d pages%n", results.size());
}
}5. The easier way: SnapRender API
Skip Selenium entirely. SnapRender renders JavaScript pages server-side and returns the result via a simple HttpClient call:
Render as markdown
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
public class SnapRenderExample {
public static void main(String[] args) throws Exception {
// Render any page as clean markdown (handles JS)
HttpClient client = HttpClient.newHttpClient();
String json = """
{
"url": "https://example.com/spa-page",
"format": "markdown"
}
""";
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://api.snaprender.dev/v1/render"))
.header("x-api-key", "sr_live_YOUR_KEY")
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(json))
.build();
HttpResponse<String> response = client.send(
request,
HttpResponse.BodyHandlers.ofString()
);
System.out.println(response.body());
}
}Extract structured data
// Extract structured data with CSS selectors
String json = """
{
"url": "https://example.com/products/widget-pro",
"selectors": {
"name": "h1.product-title",
"price": ".price-current",
"rating": ".star-rating",
"description": ".product-description p",
"in_stock": ".availability-status"
}
}
""";
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://api.snaprender.dev/v1/extract"))
.header("x-api-key", "sr_live_YOUR_KEY")
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(json))
.build();
HttpResponse<String> response = client.send(
request,
HttpResponse.BodyHandlers.ofString()
);
System.out.println(response.body());Bypass anti-bot protection
// Bypass Cloudflare / anti-bot protection
String json = """
{
"url": "https://protected-site.com/data",
"format": "markdown",
"use_flaresolverr": true
}
""";
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://api.snaprender.dev/v1/render"))
.header("x-api-key", "sr_live_YOUR_KEY")
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(json))
.build();
HttpResponse<String> response = client.send(
request,
HttpResponse.BodyHandlers.ofString()
);
// Returns fully rendered content even behind
// Cloudflare, DataDome, etc.
System.out.println(response.body());Comparison: when to use what
| Approach | Best for | Limitation |
|---|---|---|
| Jsoup | Static HTML, fast parsing | No JS rendering |
| HttpClient + Jsoup | Custom headers, HTTP/2 | No JS rendering |
| Selenium WebDriver | JS pages, full automation | High RAM, slow, fragile |
| SnapRender API | JS + anti-bot + scale | API cost at high volume |
Skip the browser infrastructure
SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. One HttpClient call from your Java app — no Selenium, no ChromeDriver versioning headaches.
Get Your API Key — FreeFrequently asked questions
Java is excellent for enterprise-grade web scraping. Jsoup provides fast, reliable HTML parsing with CSS selectors, the built-in HttpClient (Java 11+) handles async requests efficiently, and Selenium WebDriver offers full browser automation. Java scrapers benefit from the JVM's mature concurrency model and extensive logging/monitoring ecosystem.
Jsoup is the most popular Java library for HTML parsing and manipulation. It provides a clean API for fetching URLs, parsing HTML, extracting data with CSS selectors or DOM traversal, and cleaning user-submitted HTML. It handles malformed HTML gracefully and is used in production by thousands of companies.
Use Jsoup for static HTML pages — it is much faster and lighter. Use Selenium only when you need JavaScript execution (SPAs, dynamic content). For JS-rendered pages at scale, consider SnapRender's API instead of Selenium to avoid managing browser infrastructure.
Java's HttpClient and Jsoup cannot execute JavaScript. Options: (1) Selenium WebDriver with headless Chrome, (2) HtmlUnit (lightweight headless browser with limited JS support), or (3) SnapRender's API which renders JavaScript server-side and returns HTML/markdown via a simple HTTP call.
Java is more verbose but offers better performance for large-scale scraping, stronger typing that catches bugs at compile time, and superior concurrency primitives (CompletableFuture, virtual threads in Java 21+). Python is faster to prototype and has more scraping libraries. For production enterprise scrapers, Java is often the better choice.