Guide

Web Scraping with PHP: The Complete Guide

|15 min read

PHP powers 77% of websites with a known server-side language, and its scraping capabilities are often overlooked. cURL is built right into the language, DOMDocument handles HTML parsing natively, and Goutte provides a high-level scraping API on top of Symfony components. This guide takes you from basic cURL requests to production-grade scraping with SnapRender.

What you will learn

1.cURL + DOMDocument basics
2.XPath queries for data extraction
3.Goutte for high-level scraping
4.Form submission and sessions
5.Rate limiting and retries
6.Handling JS-rendered pages
7.Anti-bot bypass with SnapRender
8.Structured data extraction API

1. Basic scraping with cURL + DOMDocument

PHP has cURL built in — no external dependencies needed for HTTP requests. Combine it with DOMDocument and DOMXPath for HTML parsing:

scraper.php
<?php
// Fetch the page with cURL
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => "https://books.toscrape.com/",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER => [
        "User-Agent: Mozilla/5.0 (compatible; MyBot/1.0)"
    ]
]);
$html = curl_exec($ch);
curl_close($ch);

// Parse with DOMDocument
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

// Extract book titles and prices
$articles = $xpath->query("//article[contains(@class, 'product_pod')]");
$books = [];

foreach ($articles as $article) {
    $title = $xpath->query(".//h3/a/@title", $article)->item(0)->value;
    $price = $xpath->query(".//*[contains(@class, 'price_color')]", $article)
        ->item(0)->textContent;
    $books[] = ["title" => $title, "price" => trim($price)];
}

foreach (array_slice($books, 0, 5) as $book) {
    echo "{$book['title']}: {$book['price']}\n";
}

2. High-level scraping with Goutte

Goutte wraps Symfony's BrowserKit and DomCrawler into a clean scraping API. It supports CSS selectors, link following, and form submission:

terminal
composer require fabpot/goutte
goutte_scraper.php
<?php
require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();
$client->setHeader('User-Agent',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0');

// Fetch and parse in one step
$crawler = $client->request('GET', 'https://books.toscrape.com/');

// Extract with CSS selectors (just like jQuery)
$books = $crawler->filter('article.product_pod')->each(function ($node) {
    return [
        'title' => $node->filter('h3 a')->attr('title'),
        'price' => trim($node->filter('.price_color')->text()),
    ];
});

foreach (array_slice($books, 0, 5) as $book) {
    echo "{$book['title']}: {$book['price']}\n";
}

// Follow links
$nextPage = $client->click($crawler->selectLink('next')->link());

// Submit forms
$crawler = $client->request('GET', 'https://example.com/login');
$form = $crawler->selectButton('Log in')->form([
    'email' => 'user@example.com',
    'password' => 'password123',
]);
$client->submit($form);

3. Rate limiting and retries

Without rate limiting, you will get IP-banned fast. Add delays and handle 429 responses:

rate_limit.php
<?php
$urls = [];
for ($i = 1; $i <= 50; $i++) {
    $urls[] = "https://example.com/page/{$i}";
}

$results = [];

foreach ($urls as $index => $url) {
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_HTTPHEADER => $headers,
        CURLOPT_TIMEOUT => 10,
    ]);

    $html = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode === 429) {
        echo "Rate limited on {$url}, backing off...\n";
        sleep(30);
        // Retry...
    }

    if ($httpCode === 200) {
        // Parse and extract data
        $results[] = $data;
    }

    // Random delay: 1-3 seconds
    usleep(random_int(1000000, 3000000));
    echo "Progress: " . ($index + 1) . "/" . count($urls) . "\n";
}

echo "Scraped " . count($results) . " pages\n";

4. JavaScript pages with SnapRender

cURL and DOMDocument cannot execute JavaScript. React, Vue, and Angular apps return empty HTML shells. SnapRender renders the page server-side and returns the result via a simple cURL call:

Render as markdown

render.php
<?php
// Render any page as clean markdown (handles JS)
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => "https://api.snaprender.dev/v1/render",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        "x-api-key: sr_live_YOUR_KEY",
        "Content-Type: application/json"
    ],
    CURLOPT_POSTFIELDS => json_encode([
        "url" => "https://example.com/spa-page",
        "format" => "markdown"
    ])
]);

$response = curl_exec($ch);
curl_close($ch);

$data = json_decode($response, true);
echo $data["data"]["markdown"];

Extract structured data

extract.php
<?php
// Extract structured data with CSS selectors
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => "https://api.snaprender.dev/v1/extract",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        "x-api-key: sr_live_YOUR_KEY",
        "Content-Type: application/json"
    ],
    CURLOPT_POSTFIELDS => json_encode([
        "url" => "https://example.com/products/widget-pro",
        "selectors" => [
            "name" => "h1.product-title",
            "price" => ".price-current",
            "rating" => ".star-rating",
            "description" => ".product-description p",
            "in_stock" => ".availability-status"
        ]
    ])
]);

$response = curl_exec($ch);
curl_close($ch);

print_r(json_decode($response, true));

Bypass anti-bot protection

bypass.php
<?php
// Bypass Cloudflare / anti-bot protection
$ch = curl_init();
curl_setopt_array($ch, [
    CURLOPT_URL => "https://api.snaprender.dev/v1/render",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        "x-api-key: sr_live_YOUR_KEY",
        "Content-Type: application/json"
    ],
    CURLOPT_POSTFIELDS => json_encode([
        "url" => "https://protected-site.com/data",
        "format" => "markdown",
        "use_flaresolverr" => true
    ])
]);

$response = curl_exec($ch);
curl_close($ch);

$data = json_decode($response, true);
echo $data["data"]["markdown"];

Comparison: when to use what

ApproachBest forLimitation
cURL + DOMDocumentStatic HTML, no dependenciesVerbose, no JS
GoutteForms, CSS selectors, sessionsNo JS rendering
php-webdriver (Selenium)JS pages, local controlHigh RAM, slow
SnapRender APIJS + anti-bot + scaleAPI cost at high volume

Skip the browser infrastructure

SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. One cURL call from your PHP script — no Selenium, no browser binaries.

Get Your API Key — Free

Frequently asked questions

PHP is excellent for web scraping, especially if your stack is already PHP-based. cURL is built in, DOMDocument handles HTML parsing natively, and Goutte provides a high-level scraping API. PHP scrapers integrate naturally into Laravel, Symfony, and WordPress projects.

Goutte is an HTTP scraping library built on Symfony components (BrowserKit + DomCrawler). It provides a clean API for making requests, following links, submitting forms, and extracting data with CSS selectors. Think of it as Python's BeautifulSoup + Requests combined into one package.

Native PHP cannot execute JavaScript. You have three options: (1) use a headless browser via php-webdriver (Selenium), (2) find the underlying API endpoints that serve JSON data, or (3) use SnapRender's API which handles rendering and returns fully-rendered HTML or markdown via a simple cURL call.

PHP has built-in DOMDocument and DOMXPath classes. Load HTML with loadHTML(), then query with XPath expressions or getElementsByTagName(). For CSS selectors, use the Symfony DomCrawler component. DOMDocument handles malformed HTML reasonably well with libxml error suppression.

Yes. Laravel's HTTP client (built on Guzzle) handles requests, and you can use Symfony DomCrawler or Goutte for parsing. Laravel's queue system (jobs + workers) is ideal for large-scale scraping pipelines with rate limiting, retries, and failure handling built in.