Guide

The Complete Guide to Web Scraping in Python

|18 min read

Python is the most popular language for web scraping — and for good reason. Libraries like Requests and BeautifulSoup make it easy to get started, while Playwright and SnapRender handle the hard parts: JavaScript rendering, anti-bot bypass, and structured data extraction. This guide covers everything from your first scraper to a production-grade pipeline.

What you will learn

1.Requests + BeautifulSoup basics
2.Setting proper headers
3.Handling JS-rendered pages
4.Rate limiting and retries
5.Data storage (CSV, JSON, SQLite)
6.Anti-bot bypass with SnapRender
7.CSS selector extraction
8.Legal considerations

1. Your first scraper: Requests + BeautifulSoup

The simplest web scraper in Python uses two libraries: requests to fetch pages and beautifulsoup4 to parse HTML. Install them with:

terminal
pip install requests beautifulsoup4

Here is a complete working scraper that extracts book titles and prices from a practice website:

scraper.py
#E8A0BF">import requests
#E8A0BF">from bs4 #E8A0BF">import BeautifulSoup

# Fetch the page
resp = requests.#87CEEB">get(
    #A8D4A0">"https://books.toscrape.com/",
    headers={#A8D4A0">"User-Agent": #A8D4A0">"Mozilla/5.0 (compatible; MyBot/1.0)"}
)

# Parse the HTML
soup = BeautifulSoup(resp.#87CEEB">content, #A8D4A0">"html.parser")

# Extract all book titles #E8A0BF">and prices
books = []
#E8A0BF">for article #E8A0BF">in soup.#87CEEB">select(#A8D4A0">"article.product_pod"):
    title = article.#87CEEB">select_one(#A8D4A0">"h3 a")[#A8D4A0">"title"]
    price = article.#87CEEB">select_one(#A8D4A0">".price_color").#87CEEB">text.#87CEEB">strip()
    books.#87CEEB">append({#A8D4A0">"title": title, #A8D4A0">"price": price})

#E8A0BF">for book #E8A0BF">in books[:5]:
    #E8A0BF">print(f#A8D4A0">"{book[#A8D4A0">'title']}: {book[#A8D4A0">'price']}")

This approach works great for static HTML pages — sites where the content is in the initial HTML response. But many modern websites render content with JavaScript after the page loads, which means requests.get() returns an empty shell.

2. Setting proper headers

The single biggest mistake beginners make: sending requests with Python's default user-agent. Servers see python-requests/2.31.0 and immediately block you. Always set realistic headers:

headers.py
#E8A0BF">import requests

headers = {
    #A8D4A0">"User-Agent": #A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        #A8D4A0">"AppleWebKit/537.36 (KHTML, like Gecko) "
        #A8D4A0">"Chrome/124.0.0.0 Safari/537.36",
    #A8D4A0">"Accept": #A8D4A0">"text/html,application/xhtml+xml",
    #A8D4A0">"Accept-Language": #A8D4A0">"en-US,en;q=0.9",
    #A8D4A0">"Accept-Encoding": #A8D4A0">"gzip, deflate, br",
    #A8D4A0">"Referer": #A8D4A0">"https://www.google.com/",
}

resp = requests.#87CEEB">get(
    #A8D4A0">"https://example.com/products",
    headers=headers,
    timeout=10
)
#E8A0BF">print(resp.#87CEEB">status_code)

Pro tip

Rotate user-agent strings from a list of 10-20 real browser user-agents. Update them quarterly — outdated Chrome versions are a dead giveaway. Also set the Referer header to make requests look like they came from Google.

3. Handling JavaScript-rendered pages

React, Vue, Angular, and other SPA frameworks render content client-side. When you fetch these pages with requests.get(), you get an empty <div id="root"></div> instead of actual content. You need a browser to execute the JavaScript.

Option A: Playwright (local browser)

Playwright runs a real Chromium, Firefox, or WebKit browser and gives you full DOM access after JavaScript execution:

terminal
pip install playwright && playwright install
playwright_scraper.py
#E8A0BF">from playwright.sync_api #E8A0BF">import sync_playwright

#E8A0BF">with sync_playwright() #E8A0BF">as p:
    browser = p.chromium.#87CEEB">launch(headless=#E8A0BF">True)
    page = browser.new_page()

    # Navigate #E8A0BF">and wait #E8A0BF">for content
    page.#87CEEB">goto(#A8D4A0">"https://example.com/spa-page")
    page.wait_for_selector(#A8D4A0">".product-card", timeout=10000)

    # Extract data #E8A0BF">from the rendered DOM
    products = page.evaluate(#A8D4A0">""#A8D4A0">"
        () => Array.#E8A0BF">from(
            document.#87CEEB">querySelectorAll(#A8D4A0">'.product-card')
        ).map(el => ({
            name: el.#87CEEB">querySelector(#A8D4A0">'.name')?.innerText,
            price: el.#87CEEB">querySelector(#A8D4A0">'.price')?.innerText,
        }))
    "#A8D4A0">"")

    #E8A0BF">for product #E8A0BF">in products:
        #E8A0BF">print(product)

    browser.#87CEEB">close()

Playwright pain points

  • !Each browser instance uses 200-400 MB of RAM — scaling is expensive
  • !Browser binaries need to be installed on every machine (CI, servers)
  • !Anti-bot systems detect headless Chromium fingerprints
  • !You are now managing browser infrastructure instead of building your product

Option B: Find the API (the shortcut)

Before reaching for a headless browser, check if the site loads data from an API endpoint. Open DevTools → Network tab → filter by XHR/Fetch. Many SPAs fetch JSON from internal APIs that you can call directly with requests.get(). This is faster, cheaper, and more reliable than browser rendering.

4. Rate limiting and retries

Scraping without rate limiting is the fastest way to get your IP banned. Always add delays between requests and handle HTTP 429 (Too Many Requests) responses gracefully:

rate_limit.py
#E8A0BF">import requests
#E8A0BF">import time
#E8A0BF">import random

urls = [
    #A8D4A0">"https://example.com/page/1",
    #A8D4A0">"https://example.com/page/2",
    #A8D4A0">"https://example.com/page/3",
    # ... more URLs
]

results = []
#E8A0BF">for url #E8A0BF">in urls:
    #E8A0BF">try:
        resp = requests.#87CEEB">get(url, headers=headers, timeout=10)

        #E8A0BF">if resp.#87CEEB">status_code == 429:
            # Rate limited — back off #E8A0BF">and retry
            #E8A0BF">print(f#A8D4A0">"Rate limited on {url}, waiting...")
            time.#87CEEB">sleep(30)
            resp = requests.#87CEEB">get(url, headers=headers, timeout=10)

        #E8A0BF">if resp.#87CEEB">status_code == 200:
            soup = BeautifulSoup(resp.#87CEEB">content, #A8D4A0">"html.parser")
            # ... extract data
            results.#87CEEB">append(data)

    #E8A0BF">except requests.RequestException #E8A0BF">as e:
        #E8A0BF">print(f#A8D4A0">"Error on {url}: {e}")

    # Random delay between requests (1-3 seconds)
    time.#87CEEB">sleep(random.uniform(1, 3))

#E8A0BF">print(f#A8D4A0">"Scraped {len(results)} pages")

For production scrapers, use exponential backoff: if you get rate-limited, wait 30 seconds, then 60, then 120. Also consider requests.Session() for connection reuse, which reduces overhead and looks more like a real browser.

5. Storing scraped data

Python offers multiple storage options depending on your scale and use case:

storage.py
#E8A0BF">import csv
#E8A0BF">import json

# Save to CSV
#E8A0BF">with #E8A0BF">open(#A8D4A0">"products.csv", #A8D4A0">"w", newline=#A8D4A0">"") #E8A0BF">as f:
    writer = csv.#87CEEB">DictWriter(f, fieldnames=[#A8D4A0">"name", #A8D4A0">"price", #A8D4A0">"url"])
    writer.#87CEEB">writeheader()
    #E8A0BF">for product #E8A0BF">in products:
        writer.#87CEEB">writerow(product)

# Save to JSON
#E8A0BF">with #E8A0BF">open(#A8D4A0">"products.#87CEEB">json", #A8D4A0">"w") #E8A0BF">as f:
    json.#87CEEB">dumps(products, f, indent=2)

# Save to SQLite
#E8A0BF">import sqlite3
conn = sqlite3.#87CEEB">connect(#A8D4A0">"products.db")
cursor = conn.#87CEEB">cursor()
cursor.#87CEEB">execute(#A8D4A0">""#A8D4A0">"
    CREATE TABLE IF NOT EXISTS products (
        name TEXT, price TEXT, url TEXT
    )
"#A8D4A0">"")
#E8A0BF">for product #E8A0BF">in products:
    cursor.#87CEEB">execute(
        #A8D4A0">"INSERT INTO products VALUES (?, ?, ?)",
        (product[#A8D4A0">"name"], product[#A8D4A0">"price"], product[#A8D4A0">"url"])
    )
conn.#87CEEB">commit()
1

CSV / JSON

Best for small datasets (< 10K rows). Easy to share, open in Excel, or import into other tools.

2

SQLite

Best for medium datasets. Built into Python, no server needed. Supports SQL queries, deduplication, and indexing.

3

PostgreSQL

Best for production pipelines. Full ACID compliance, concurrent writes, and integration with analytics tools.

6. The easier way: SnapRender API

For JavaScript-rendered pages and sites with anti-bot protection, SnapRender eliminates the need for local browsers entirely. Send a URL, get back rendered HTML, markdown, or structured data. No Playwright, no browser binaries, no proxy management.

Render as markdown

Get any page as clean, LLM-ready markdown. JavaScript is fully executed before content is captured.

render.py
#E8A0BF">import requests

# Render any page #E8A0BF">as clean markdown (handles JS)
resp = requests.#87CEEB">post(
    #A8D4A0">"https://api.snaprender.dev/v1/render",
    headers={#A8D4A0">"x-api-key": #A8D4A0">"sr_live_YOUR_KEY"},
    json={
        #A8D4A0">"url": #A8D4A0">"https://example.com/spa-page",
        #A8D4A0">"format": #A8D4A0">"markdown"
    }
)
#E8A0BF">print(resp.#87CEEB">json()[#A8D4A0">"data"][#A8D4A0">"markdown"])

Extract structured data

Use CSS selectors to pull exactly the fields you need. Returns clean JSON.

extract.py
#E8A0BF">import requests

# Extract structured data #E8A0BF">with CSS selectors
resp = requests.#87CEEB">post(
    #A8D4A0">"https://api.snaprender.dev/v1/extract",
    headers={#A8D4A0">"x-api-key": #A8D4A0">"sr_live_YOUR_KEY"},
    json={
        #A8D4A0">"url": #A8D4A0">"https://example.com/products/widget-pro",
        #A8D4A0">"selectors": {
            #A8D4A0">"name": #A8D4A0">"h1.product-title",
            #A8D4A0">"price": #A8D4A0">".price-current",
            #A8D4A0">"rating": #A8D4A0">".star-rating",
            #A8D4A0">"description": #A8D4A0">".product#FFB347">-description p",
            #A8D4A0">"in_stock": #A8D4A0">".availability-status"
        }
    }
)
#E8A0BF">print(resp.#87CEEB">json())

Bypass anti-bot protection

For sites behind Cloudflare, DataDome, or similar protections, add the use_flaresolverr flag. SnapRender routes the request through a real browser session that passes challenge pages automatically.

bypass.py
#E8A0BF">import requests

# Bypass Cloudflare / anti-bot protection
resp = requests.#87CEEB">post(
    #A8D4A0">"https://api.snaprender.dev/v1/render",
    headers={#A8D4A0">"x-api-key": #A8D4A0">"sr_live_YOUR_KEY"},
    json={
        #A8D4A0">"url": #A8D4A0">"https://protected-site.com/data",
        #A8D4A0">"format": #A8D4A0">"markdown",
        #A8D4A0">"use_flaresolverr": #E8A0BF">True
    }
)
# Returns the fully rendered page content
# even behind Cloudflare, DataDome, etc.
#E8A0BF">print(resp.#87CEEB">json()[#A8D4A0">"data"][#A8D4A0">"markdown"])

Example response

response.json
{
  #A8D4A0">"status": #A8D4A0">"success",
  #A8D4A0">"data": {
    #A8D4A0">"name": #A8D4A0">"Widget Pro 3000",
    #A8D4A0">"price": #A8D4A0">"$49.99",
    #A8D4A0">"rating": #A8D4A0">"4.8 out of 5",
    #A8D4A0">"description": #A8D4A0">"The most advanced widget on the market...",
    #A8D4A0">"in_stock": #A8D4A0">"In Stock"
  },
  #A8D4A0">"url": #A8D4A0">"https://example.com/products/widget-pro",
  #A8D4A0">"elapsed_ms": 1840
}

Comparison: when to use what

ApproachBest forLimitation
Requests + BS4Static HTML pagesNo JS rendering
PlaywrightJS pages, local controlHigh RAM, maintenance
API discoverySites with internal APIsNot always available
SnapRenderJS + anti-bot + scaleAPI cost at high volume

Related tutorials

Apply what you have learned to specific scraping targets:

Legal considerations

Web scraping is legal in many contexts, but boundaries exist:

  • 1.The hiQ v. LinkedIn (2022) ruling confirmed that scraping publicly available data does not violate the CFAA. This is a strong precedent in the US but does not apply globally.
  • 2.Always check robots.txt. While not legally binding everywhere, ignoring it weakens your legal position if challenged.
  • 3.Never scrape personal data for advertising, profiling, or resale. GDPR (EU), CCPA (California), and similar laws impose heavy penalties.
  • 4.Respect Terms of Service. While ToS violations are civil (not criminal) matters, they can lead to lawsuits and account bans.
  • 5.Rate-limit your requests. Scraping that degrades a site's performance could constitute a denial-of-service attack, which is criminal.
  • 6.When in doubt, consult a lawyer familiar with web scraping case law in your jurisdiction.

Start free — 100 requests/month

Get your API key in 30 seconds. Render JavaScript pages, extract structured data, and bypass anti-bot protection — all from your Python script.

Get Your API Key

Frequently asked questions

The legality of web scraping depends on what you scrape, not how. Python is just the tool. Scraping publicly available data is generally legal in the US after the hiQ v. LinkedIn ruling (2022). However, always check the target site's Terms of Service and robots.txt. Never scrape personal data, copyrighted content for redistribution, or data behind login walls without permission.

For static HTML pages, BeautifulSoup with Requests is the gold standard — simple, fast, and well-documented. For JavaScript-rendered pages, you need a headless browser like Playwright or Selenium. For production pipelines that need to handle both cases plus anti-bot bypass, SnapRender's API eliminates the browser management overhead entirely.

Standard HTTP libraries (Requests, httpx) cannot execute JavaScript. You have three options: (1) use Playwright/Selenium to run a local headless browser, (2) find the underlying API endpoints that serve the data as JSON (check the Network tab in DevTools), or (3) use SnapRender's API which handles rendering and returns the fully-rendered page as HTML, markdown, or extracted data.

Rate-limit your requests (1-3 seconds between calls), rotate user-agent strings, respect robots.txt, use residential proxies for large-scale scraping, and handle CAPTCHAs gracefully. For sites with aggressive anti-bot measures (Cloudflare, DataDome), SnapRender's use_flaresolverr flag handles the bypass automatically.

For small datasets, CSV (using the csv module) or JSON files work fine. For structured data at scale, use SQLite (built into Python) or PostgreSQL. For analytics, export to Pandas DataFrames. For production pipelines, consider a message queue (Redis, RabbitMQ) to decouple scraping from processing.