Why sites block scrapers
Websites block scrapers for legitimate reasons: protecting server resources, preventing competitive data theft, and stopping abuse. Understanding their detection methods helps you scrape responsibly.
Rate-based detection
Too many requests from one IP in a time window. The most basic and common block.
Header analysis
Missing or inconsistent HTTP headers (User-Agent, Accept, Referer). Default library headers scream "bot."
Browser fingerprinting
JavaScript checks for headless browser signals: navigator.webdriver, missing plugins, Chrome DevTools protocol.
Behavioral analysis
Real users scroll, pause, and click randomly. Bots request pages at perfect intervals with no interaction.
Layer 1: Use realistic headers
The default User-Agent for Python's requests library is python-requests/2.31.0. Every anti-bot system blocks this immediately. Set realistic headers that match a real browser.
#E8A0BF">import requests
#E8A0BF">import random
USER_AGENTS = [
#A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
#A8D4A0">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
#A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
#A8D4A0">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
headers = {
#A8D4A0">"User-Agent": random.#87CEEB">choice(USER_AGENTS),
#A8D4A0">"Accept": #A8D4A0">"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
#A8D4A0">"Accept-Language": #A8D4A0">"en-US,en;q=0.9",
#A8D4A0">"Accept-Encoding": #A8D4A0">"gzip, deflate, br",
#A8D4A0">"DNT": #A8D4A0">"1",
#A8D4A0">"Connection": #A8D4A0">"keep-alive",
#A8D4A0">"Upgrade-Insecure-Requests": #A8D4A0">"1",
}
resp = requests.#87CEEB">get(#A8D4A0">"https://example.com", headers=headers)Layer 2: Rate limit your requests
The single most effective anti-blocking technique: slow down. Add random delays between requests to mimic human browsing patterns. Use exponential backoff when you hit 429 (Too Many Requests) responses.
#E8A0BF">import time
#E8A0BF">import random
#E8A0BF">def polite_scrape(urls, min_delay=1.0, max_delay=3.0):
results = []
#E8A0BF">for url in urls:
resp = requests.#87CEEB">get(url, headers=headers)
results.append(resp.#87CEEB">text)
# Random delay between requests
delay = random.#87CEEB">uniform(min_delay, max_delay)
time.#87CEEB">sleep(delay)
#E8A0BF">return results
# For large batches: exponential backoff on errors
#E8A0BF">def scrape_with_backoff(url, max_retries=3):
#E8A0BF">for attempt in range(max_retries):
resp = requests.#87CEEB">get(url, headers=headers)
#E8A0BF">if resp.status_code == 200:
#E8A0BF">return resp.#87CEEB">text
#E8A0BF">if resp.status_code == 429: # Too Many Requests
wait = (2 ** attempt) + random.#87CEEB">uniform(0, 1)
#E8A0BF">print(f#A8D4A0">"Rate limited. Waiting {wait:.1f}s...")
time.#87CEEB">sleep(wait)
#E8A0BF">return #E8A0BF">NoneLayer 3: Rotate IP addresses
Even with perfect headers and rate limiting, a single IP sending hundreds of requests daily will eventually get flagged. Proxy rotation distributes requests across many IPs. See our complete proxy guide for types, costs, and setup code.
Cost reality check
- 1.Datacenter proxies: $50-100/mo for a pool of 100 IPs
- 2.Residential proxies: $100-300/mo for 10-20 GB of traffic
- 3.Proxy management code: 2-4 hours of development + ongoing maintenance
Layer 4: Manage browser fingerprints
Advanced anti-bot systems (Cloudflare, DataDome, PerimeterX) check JavaScript-level browser properties. You need to override headless browser detection signals and present a consistent fingerprint that matches a real browser.
#E8A0BF">from playwright.sync_api #E8A0BF">import sync_playwright
#E8A0BF">def stealth_scrape(url):
#E8A0BF">with sync_playwright() #E8A0BF">as p:
browser = p.chromium.#87CEEB">launch(headless=#E8A0BF">True)
context = browser.new_context(
viewport={#A8D4A0">"width": 1920, #A8D4A0">"height": 1080},
user_agent=#A8D4A0">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
#A8D4A0">"AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
locale=#A8D4A0">"en-US",
timezone_id=#A8D4A0">"America/New_York",
)
page = context.new_page()
# Override navigator.webdriver detection
page.add_init_script(#A8D4A0">""#A8D4A0">"
Object.defineProperty(navigator, #A8D4A0">'webdriver', {
get: () => undefined
});
"#A8D4A0">"")
page.#87CEEB">goto(url, wait_until=#A8D4A0">"networkidle")
content = page.#87CEEB">content()
browser.#87CEEB">close()
#E8A0BF">return contentThis handles basic detection. But advanced systems check Canvas fingerprints, WebGL renderer strings, audio context properties, and dozens more signals. Staying ahead of these checks is a full-time maintenance burden.
Layer 5: Or just use SnapRender
All four layers above — headers, rate limiting, IP rotation, and fingerprint management — are built into SnapRender. One API call replaces hundreds of lines of anti-detection code and $50-300/mo in proxy costs.
#E8A0BF">import requests
# All anti#FFB347">-detection is handled #E8A0BF">for you
resp = requests.#87CEEB">post(
#A8D4A0">"https://api.snaprender.dev/v1/render",
headers={#A8D4A0">"x-api-key": #A8D4A0">"sr_live_YOUR_KEY"},
json={
#A8D4A0">"url": #A8D4A0">"https://protected-site.com/page",
#A8D4A0">"format": #A8D4A0">"markdown",
#A8D4A0">"use_flaresolverr": #E8A0BF">True
}
)
# Clean markdown output, no blocks, no CAPTCHAs
#E8A0BF">print(resp.#87CEEB">json()[#A8D4A0">"data"][#A8D4A0">"markdown"])No proxy costs
IP management is handled. No monthly proxy subscription.
No fingerprint code
Real Chromium sessions pass all browser checks automatically.
Cloudflare bypass
FlareSolverr integration handles Cloudflare challenges at no extra cost.
Stop fighting blocks
100 free requests/month. All anti-detection is built in. URL in, content out. No headers to configure, no proxies to manage, no fingerprints to spoof.
Get Your API KeyFrequently asked questions
Sending too many requests from the same IP address in a short time window. Sites detect the unnatural pattern and block the IP. Rate limiting your requests to 1-2 per second per IP is the single most effective thing you can do.
Yes. Sites check for headless browser fingerprints: navigator.webdriver being true, missing plugins, specific Chrome DevTools protocol signals, and viewport/screen size mismatches. Modern anti-bot services like Cloudflare and DataDome are very good at this detection.
The legality depends on your jurisdiction and the specific circumstances. In the US, the hiQ v. LinkedIn ruling (2022) established that scraping publicly available data is generally not a CFAA violation. However, circumventing technical access controls may raise DMCA or contractual issues. Always scrape responsibly and consult legal counsel for your specific use case.
SnapRender uses real Chromium browser sessions (not basic HTTP requests), manages a pool of clean IPs, implements proper request timing, and includes FlareSolverr integration for Cloudflare-protected sites. All the anti-detection techniques described in this article are built into the service.