Guide

How to Avoid Getting Blocked While Scraping

|14 min read

Every scraper hits the same wall: blocks, CAPTCHAs, and empty responses. This guide covers the five layers of anti-detection — from basic headers to advanced fingerprint management — with working code for each. Or skip all of it with a managed API.

Why sites block scrapers

Websites block scrapers for legitimate reasons: protecting server resources, preventing competitive data theft, and stopping abuse. Understanding their detection methods helps you scrape responsibly.

1

Rate-based detection

Too many requests from one IP in a time window. The most basic and common block.

2

Header analysis

Missing or inconsistent HTTP headers (User-Agent, Accept, Referer). Default library headers scream "bot."

3

Browser fingerprinting

JavaScript checks for headless browser signals: navigator.webdriver, missing plugins, Chrome DevTools protocol.

4

Behavioral analysis

Real users scroll, pause, and click randomly. Bots request pages at perfect intervals with no interaction.

Layer 1: Use realistic headers

The default User-Agent for Python's requests library is python-requests/2.31.0. Every anti-bot system blocks this immediately. Set realistic headers that match a real browser.

headers.py
#E8A0BF">import requests
#E8A0BF">import random

USER_AGENTS = [
    #A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    #A8D4A0">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    #A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    #A8D4A0">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

headers = {
    #A8D4A0">"User-Agent": random.#87CEEB">choice(USER_AGENTS),
    #A8D4A0">"Accept": #A8D4A0">"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    #A8D4A0">"Accept-Language": #A8D4A0">"en-US,en;q=0.9",
    #A8D4A0">"Accept-Encoding": #A8D4A0">"gzip, deflate, br",
    #A8D4A0">"DNT": #A8D4A0">"1",
    #A8D4A0">"Connection": #A8D4A0">"keep-alive",
    #A8D4A0">"Upgrade-Insecure-Requests": #A8D4A0">"1",
}

resp = requests.#87CEEB">get(#A8D4A0">"https://example.com", headers=headers)

Layer 2: Rate limit your requests

The single most effective anti-blocking technique: slow down. Add random delays between requests to mimic human browsing patterns. Use exponential backoff when you hit 429 (Too Many Requests) responses.

rate_limit.py
#E8A0BF">import time
#E8A0BF">import random

#E8A0BF">def polite_scrape(urls, min_delay=1.0, max_delay=3.0):
    results = []
    #E8A0BF">for url in urls:
        resp = requests.#87CEEB">get(url, headers=headers)
        results.append(resp.#87CEEB">text)

        # Random delay between requests
        delay = random.#87CEEB">uniform(min_delay, max_delay)
        time.#87CEEB">sleep(delay)

    #E8A0BF">return results

# For large batches: exponential backoff on errors
#E8A0BF">def scrape_with_backoff(url, max_retries=3):
    #E8A0BF">for attempt in range(max_retries):
        resp = requests.#87CEEB">get(url, headers=headers)
        #E8A0BF">if resp.status_code == 200:
            #E8A0BF">return resp.#87CEEB">text
        #E8A0BF">if resp.status_code == 429:  # Too Many Requests
            wait = (2 ** attempt) + random.#87CEEB">uniform(0, 1)
            #E8A0BF">print(f#A8D4A0">"Rate limited. Waiting {wait:.1f}s...")
            time.#87CEEB">sleep(wait)
    #E8A0BF">return #E8A0BF">None

Layer 3: Rotate IP addresses

Even with perfect headers and rate limiting, a single IP sending hundreds of requests daily will eventually get flagged. Proxy rotation distributes requests across many IPs. See our complete proxy guide for types, costs, and setup code.

Cost reality check

  • 1.Datacenter proxies: $50-100/mo for a pool of 100 IPs
  • 2.Residential proxies: $100-300/mo for 10-20 GB of traffic
  • 3.Proxy management code: 2-4 hours of development + ongoing maintenance

Layer 4: Manage browser fingerprints

Advanced anti-bot systems (Cloudflare, DataDome, PerimeterX) check JavaScript-level browser properties. You need to override headless browser detection signals and present a consistent fingerprint that matches a real browser.

stealth.py
#E8A0BF">from playwright.sync_api #E8A0BF">import sync_playwright

#E8A0BF">def stealth_scrape(url):
    #E8A0BF">with sync_playwright() #E8A0BF">as p:
        browser = p.chromium.#87CEEB">launch(headless=#E8A0BF">True)
        context = browser.new_context(
            viewport={#A8D4A0">"width": 1920, #A8D4A0">"height": 1080},
            user_agent=#A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       #A8D4A0">"AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
            locale=#A8D4A0">"en-US",
            timezone_id=#A8D4A0">"America/New_York",
        )
        page = context.new_page()

        # Override navigator.webdriver detection
        page.add_init_script(#A8D4A0">""#A8D4A0">"
            Object.defineProperty(navigator, #A8D4A0">'webdriver', {
                get: () => undefined
            });
        "#A8D4A0">"")

        page.#87CEEB">goto(url, wait_until=#A8D4A0">"networkidle")
        content = page.#87CEEB">content()
        browser.#87CEEB">close()
        #E8A0BF">return content

This handles basic detection. But advanced systems check Canvas fingerprints, WebGL renderer strings, audio context properties, and dozens more signals. Staying ahead of these checks is a full-time maintenance burden.

Layer 5: Or just use SnapRender

All four layers above — headers, rate limiting, IP rotation, and fingerprint management — are built into SnapRender. One API call replaces hundreds of lines of anti-detection code and $50-300/mo in proxy costs.

scrape.py
#E8A0BF">import requests

# All anti#FFB347">-detection is handled #E8A0BF">for you
resp = requests.#87CEEB">post(
    #A8D4A0">"https://api.snaprender.dev/v1/render",
    headers={#A8D4A0">"x-api-key": #A8D4A0">"sr_live_YOUR_KEY"},
    json={
        #A8D4A0">"url": #A8D4A0">"https://protected-site.com/page",
        #A8D4A0">"format": #A8D4A0">"markdown",
        #A8D4A0">"use_flaresolverr": #E8A0BF">True
    }
)

# Clean markdown output, no blocks, no CAPTCHAs
#E8A0BF">print(resp.#87CEEB">json()[#A8D4A0">"data"][#A8D4A0">"markdown"])

No proxy costs

IP management is handled. No monthly proxy subscription.

No fingerprint code

Real Chromium sessions pass all browser checks automatically.

Cloudflare bypass

FlareSolverr integration handles Cloudflare challenges at no extra cost.

Stop fighting blocks

100 free requests/month. All anti-detection is built in. URL in, content out. No headers to configure, no proxies to manage, no fingerprints to spoof.

Get Your API Key

Frequently asked questions

Sending too many requests from the same IP address in a short time window. Sites detect the unnatural pattern and block the IP. Rate limiting your requests to 1-2 per second per IP is the single most effective thing you can do.

Yes. Sites check for headless browser fingerprints: navigator.webdriver being true, missing plugins, specific Chrome DevTools protocol signals, and viewport/screen size mismatches. Modern anti-bot services like Cloudflare and DataDome are very good at this detection.

The legality depends on your jurisdiction and the specific circumstances. In the US, the hiQ v. LinkedIn ruling (2022) established that scraping publicly available data is generally not a CFAA violation. However, circumventing technical access controls may raise DMCA or contractual issues. Always scrape responsibly and consult legal counsel for your specific use case.

SnapRender uses real Chromium browser sessions (not basic HTTP requests), manages a pool of clean IPs, implements proper request timing, and includes FlareSolverr integration for Cloudflare-protected sites. All the anti-detection techniques described in this article are built into the service.