1. Basic scraping with cURL + DOMDocument
PHP has cURL built in — no external dependencies needed for HTTP requests. Combine it with DOMDocument and DOMXPath for HTML parsing:
<?php
// Fetch the page with cURL
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => "https://books.toscrape.com/",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
"User-Agent: Mozilla/5.0 (compatible; MyBot/1.0)"
]
]);
$html = curl_exec($ch);
curl_close($ch);
// Parse with DOMDocument
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
// Extract book titles and prices
$articles = $xpath->query("//article[contains(@class, 'product_pod')]");
$books = [];
foreach ($articles as $article) {
$title = $xpath->query(".//h3/a/@title", $article)->item(0)->value;
$price = $xpath->query(".//*[contains(@class, 'price_color')]", $article)
->item(0)->textContent;
$books[] = ["title" => $title, "price" => trim($price)];
}
foreach (array_slice($books, 0, 5) as $book) {
echo "{$book['title']}: {$book['price']}\n";
}2. High-level scraping with Goutte
Goutte wraps Symfony's BrowserKit and DomCrawler into a clean scraping API. It supports CSS selectors, link following, and form submission:
composer require fabpot/goutte<?php
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$client->setHeader('User-Agent',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0');
// Fetch and parse in one step
$crawler = $client->request('GET', 'https://books.toscrape.com/');
// Extract with CSS selectors (just like jQuery)
$books = $crawler->filter('article.product_pod')->each(function ($node) {
return [
'title' => $node->filter('h3 a')->attr('title'),
'price' => trim($node->filter('.price_color')->text()),
];
});
foreach (array_slice($books, 0, 5) as $book) {
echo "{$book['title']}: {$book['price']}\n";
}
// Follow links
$nextPage = $client->click($crawler->selectLink('next')->link());
// Submit forms
$crawler = $client->request('GET', 'https://example.com/login');
$form = $crawler->selectButton('Log in')->form([
'email' => 'user@example.com',
'password' => 'password123',
]);
$client->submit($form);3. Rate limiting and retries
Without rate limiting, you will get IP-banned fast. Add delays and handle 429 responses:
<?php
$urls = [];
for ($i = 1; $i <= 50; $i++) {
$urls[] = "https://example.com/page/{$i}";
}
$results = [];
foreach ($urls as $index => $url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => $headers,
CURLOPT_TIMEOUT => 10,
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode === 429) {
echo "Rate limited on {$url}, backing off...\n";
sleep(30);
// Retry...
}
if ($httpCode === 200) {
// Parse and extract data
$results[] = $data;
}
// Random delay: 1-3 seconds
usleep(random_int(1000000, 3000000));
echo "Progress: " . ($index + 1) . "/" . count($urls) . "\n";
}
echo "Scraped " . count($results) . " pages\n";4. JavaScript pages with SnapRender
cURL and DOMDocument cannot execute JavaScript. React, Vue, and Angular apps return empty HTML shells. SnapRender renders the page server-side and returns the result via a simple cURL call:
Render as markdown
<?php
// Render any page as clean markdown (handles JS)
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => "https://api.snaprender.dev/v1/render",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => [
"x-api-key: sr_live_YOUR_KEY",
"Content-Type: application/json"
],
CURLOPT_POSTFIELDS => json_encode([
"url" => "https://example.com/spa-page",
"format" => "markdown"
])
]);
$response = curl_exec($ch);
curl_close($ch);
$data = json_decode($response, true);
echo $data["data"]["markdown"];Extract structured data
<?php
// Extract structured data with CSS selectors
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => "https://api.snaprender.dev/v1/extract",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => [
"x-api-key: sr_live_YOUR_KEY",
"Content-Type: application/json"
],
CURLOPT_POSTFIELDS => json_encode([
"url" => "https://example.com/products/widget-pro",
"selectors" => [
"name" => "h1.product-title",
"price" => ".price-current",
"rating" => ".star-rating",
"description" => ".product-description p",
"in_stock" => ".availability-status"
]
])
]);
$response = curl_exec($ch);
curl_close($ch);
print_r(json_decode($response, true));Bypass anti-bot protection
<?php
// Bypass Cloudflare / anti-bot protection
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => "https://api.snaprender.dev/v1/render",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => [
"x-api-key: sr_live_YOUR_KEY",
"Content-Type: application/json"
],
CURLOPT_POSTFIELDS => json_encode([
"url" => "https://protected-site.com/data",
"format" => "markdown",
"use_flaresolverr" => true
])
]);
$response = curl_exec($ch);
curl_close($ch);
$data = json_decode($response, true);
echo $data["data"]["markdown"];Comparison: when to use what
| Approach | Best for | Limitation |
|---|---|---|
| cURL + DOMDocument | Static HTML, no dependencies | Verbose, no JS |
| Goutte | Forms, CSS selectors, sessions | No JS rendering |
| php-webdriver (Selenium) | JS pages, local control | High RAM, slow |
| SnapRender API | JS + anti-bot + scale | API cost at high volume |
Skip the browser infrastructure
SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. One cURL call from your PHP script — no Selenium, no browser binaries.
Get Your API Key — FreeFrequently asked questions
PHP is excellent for web scraping, especially if your stack is already PHP-based. cURL is built in, DOMDocument handles HTML parsing natively, and Goutte provides a high-level scraping API. PHP scrapers integrate naturally into Laravel, Symfony, and WordPress projects.
Goutte is an HTTP scraping library built on Symfony components (BrowserKit + DomCrawler). It provides a clean API for making requests, following links, submitting forms, and extracting data with CSS selectors. Think of it as Python's BeautifulSoup + Requests combined into one package.
Native PHP cannot execute JavaScript. You have three options: (1) use a headless browser via php-webdriver (Selenium), (2) find the underlying API endpoints that serve JSON data, or (3) use SnapRender's API which handles rendering and returns fully-rendered HTML or markdown via a simple cURL call.
PHP has built-in DOMDocument and DOMXPath classes. Load HTML with loadHTML(), then query with XPath expressions or getElementsByTagName(). For CSS selectors, use the Symfony DomCrawler component. DOMDocument handles malformed HTML reasonably well with libxml error suppression.
Yes. Laravel's HTTP client (built on Guzzle) handles requests, and you can use Symfony DomCrawler or Goutte for parsing. Laravel's queue system (jobs + workers) is ideal for large-scale scraping pipelines with rate limiting, retries, and failure handling built in.