Guide

Web Scraping for SEO

|14 min read

SEO professionals spend hundreds per month on tools like Ahrefs and SEMrush. Web scraping lets you build custom rank trackers, competitor analyzers, and content gap finders tailored to your exact needs -- at a fraction of the cost. This guide covers SERP monitoring, on-page analysis, and content gap detection.

What you will learn

1.SERP position tracking
2.Competitor on-page analysis
3.Content gap detection
4.Rank history and trends
5.Featured snippet monitoring
6.Meta tag extraction
7.Internal link analysis
8.Schema markup auditing

1. SERP position tracking

Track where your site ranks for target keywords, including featured snippets and People Also Ask boxes:

serp_tracker.py
#E8A0BF">import requests
#E8A0BF">import json
#E8A0BF">from datetime #E8A0BF">import datetime

API_KEY = #A8D4A0">"sr_live_YOUR_KEY"

#E8A0BF">def track_serp_position(keyword, domain):
    #A8D4A0">""#A8D4A0">"Track a keyword's SERP position #E8A0BF">for your domain"#A8D4A0">""
    query = keyword.replace(#A8D4A0">" ", #A8D4A0">"+")
    url = f#A8D4A0">"https://www.google.com/search?q={query}&num=50"

    resp = requests.post(
        #A8D4A0">"https://api.snaprender.dev/v1/extract",
        headers={
            #A8D4A0">"x-api-key": API_KEY,
            #A8D4A0">"Content-Type": #A8D4A0">"application/json"
        },
        json={
            #A8D4A0">"url": url,
            #A8D4A0">"selectors": {
                #A8D4A0">"titles": #A8D4A0">"#search .g h3",
                #A8D4A0">"urls": #A8D4A0">"#search .g a::attr(href)",
                #A8D4A0">"descriptions": #A8D4A0">"#search .g .VwiC3b",
                #A8D4A0">"featured_snippet": #A8D4A0">".IZ6rdc",
                #A8D4A0">"people_also_ask": #A8D4A0">".related-question-pair span"
            },
            #A8D4A0">"use_flaresolverr": true
        }
    )

    data = resp.json()[#A8D4A0">"data"]
    urls = data.get(#A8D4A0">"urls", [])

    # Find domain position
    position = #E8A0BF">None
    #E8A0BF">for i, result_url #E8A0BF">in enumerate(urls):
        #E8A0BF">if domain #E8A0BF">in result_url:
            position = i + 1
            break

    #E8A0BF">return {
        #A8D4A0">"keyword": keyword,
        #A8D4A0">"position": position,
        #A8D4A0">"total_results": len(urls),
        #A8D4A0">"top_3": [
            {#A8D4A0">"title": data[#A8D4A0">"titles"][i], #A8D4A0">"url": urls[i]}
            #E8A0BF">for i #E8A0BF">in range(min(3, len(urls)))
        ],
        #A8D4A0">"featured_snippet": data.get(#A8D4A0">"featured_snippet", #A8D4A0">""),
        #A8D4A0">"timestamp": datetime.now().isoformat()
    }

# Track multiple keywords
keywords = [
    #A8D4A0">"web scraping api",
    #A8D4A0">"headless browser api",
    #A8D4A0">"screenshot api",
    #A8D4A0">"html to pdf api",
]

results = []
#E8A0BF">for kw #E8A0BF">in keywords:
    result = track_serp_position(kw, #A8D4A0">"snaprender.dev")
    pos = result[#A8D4A0">"position"] #E8A0BF">or #A8D4A0">"Not found"
    #E8A0BF">print(f#A8D4A0">"{kw}: Position {pos}")
    results.append(result)

2. Competitor on-page analysis

Analyze competitor pages for SEO elements: title tags, headings, word count, internal links, and schema markup:

competitor_analysis.py
#E8A0BF">def analyze_competitor_page(url):
    #A8D4A0">""#A8D4A0">"Scrape on-page SEO elements #E8A0BF">from a competitor"#A8D4A0">""

    resp = requests.post(
        #A8D4A0">"https://api.snaprender.dev/v1/extract",
        headers={
            #A8D4A0">"x-api-key": API_KEY,
            #A8D4A0">"Content-Type": #A8D4A0">"application/json"
        },
        json={
            #A8D4A0">"url": url,
            #A8D4A0">"selectors": {
                #A8D4A0">"title": #A8D4A0">"title",
                #A8D4A0">"meta_description": #A8D4A0">"meta[name=#A8D4A0">'description']::attr(content)",
                #A8D4A0">"h1": #A8D4A0">"h1",
                #A8D4A0">"h2s": #A8D4A0">"h2",
                #A8D4A0">"h3s": #A8D4A0">"h3",
                #A8D4A0">"internal_links": #A8D4A0">"a[href^=#A8D4A0">'/']::attr(href)",
                #A8D4A0">"external_links": #A8D4A0">"a[href^=#A8D4A0">'http']::attr(href)",
                #A8D4A0">"images": #A8D4A0">"img::attr(alt)",
                #A8D4A0">"schema": #A8D4A0">"script[type=#A8D4A0">'application/ld+json']",
                #A8D4A0">"canonical": #A8D4A0">"link[rel=#A8D4A0">'canonical']::attr(href)",
                #A8D4A0">"word_count": #A8D4A0">"body"
            },
            #A8D4A0">"use_flaresolverr": true
        }
    )

    data = resp.json()[#A8D4A0">"data"]

    # Estimate word count #E8A0BF">from body text
    body_text = data.get(#A8D4A0">"word_count", #A8D4A0">"")
    words = len(body_text.split()) #E8A0BF">if body_text #E8A0BF">else 0

    #E8A0BF">return {
        #A8D4A0">"url": url,
        #A8D4A0">"title": data.get(#A8D4A0">"title", #A8D4A0">""),
        #A8D4A0">"meta_description": data.get(#A8D4A0">"meta_description", #A8D4A0">""),
        #A8D4A0">"h1": data.get(#A8D4A0">"h1", #A8D4A0">""),
        #A8D4A0">"h2_count": len(data.get(#A8D4A0">"h2s", [])),
        #A8D4A0">"h3_count": len(data.get(#A8D4A0">"h3s", [])),
        #A8D4A0">"internal_links": len(data.get(#A8D4A0">"internal_links", [])),
        #A8D4A0">"external_links": len(data.get(#A8D4A0">"external_links", [])),
        #A8D4A0">"images_with_alt": len([a #E8A0BF">for a #E8A0BF">in data.get(#A8D4A0">"images", []) #E8A0BF">if a]),
        #A8D4A0">"has_schema": bool(data.get(#A8D4A0">"schema")),
        #A8D4A0">"word_count": words,
    }

# Analyze top 5 ranking pages #E8A0BF">for a keyword
competitors = [
    #A8D4A0">"https://competitor1.com/web-scraping-guide",
    #A8D4A0">"https://competitor2.com/scraping-tutorial",
    #A8D4A0">"https://competitor3.com/web-scraping-api",
]

#E8A0BF">import time
analyses = []
#E8A0BF">for url #E8A0BF">in competitors:
    analysis = analyze_competitor_page(url)
    analyses.append(analysis)
    #E8A0BF">print(f#A8D4A0">"{url}")
    #E8A0BF">print(f#A8D4A0">"  Title: {analysis[#A8D4A0">'title'][:60]}...")
    #E8A0BF">print(f#A8D4A0">"  Words: {analysis[#A8D4A0">'word_count']}, H2s: {analysis[#A8D4A0">'h2_count']}")
    time.sleep(2)

3. Content gap analysis

Find keywords where competitors rank but you don't -- the biggest opportunities for new content:

content_gaps.py
#E8A0BF">import pandas #E8A0BF">as pd

#E8A0BF">def find_content_gaps(your_domain, competitor_domains, keywords):
    #A8D4A0">""#A8D4A0">"Find keywords where competitors rank but you don't"#A8D4A0">""
    gaps = []

    #E8A0BF">for keyword #E8A0BF">in keywords:
        result = track_serp_position(keyword, your_domain)
        your_pos = result[#A8D4A0">"position"]

        #E8A0BF">for comp #E8A0BF">in competitor_domains:
            comp_result = track_serp_position(keyword, comp)
            comp_pos = comp_result[#A8D4A0">"position"]

            #E8A0BF">if comp_pos #E8A0BF">and (#E8A0BF">not your_pos #E8A0BF">or comp_pos < your_pos):
                gaps.append({
                    #A8D4A0">"keyword": keyword,
                    #A8D4A0">"your_position": your_pos #E8A0BF">or #A8D4A0">"Not ranking",
                    #A8D4A0">"competitor": comp,
                    #A8D4A0">"competitor_position": comp_pos,
                    #A8D4A0">"opportunity": (your_pos #E8A0BF">or 100) - comp_pos
                })

    df = pd.DataFrame(gaps)
    df = df.sort_values(#A8D4A0">"opportunity", ascending=#E8A0BF">False)

    #E8A0BF">print(#A8D4A0">"=== Content Gap Analysis ===")
    #E8A0BF">print(f#A8D4A0">"Keywords analyzed: {len(keywords)}")
    #E8A0BF">print(f#A8D4A0">"Gaps found:        {len(gaps)}")
    #E8A0BF">print(#A8D4A0">"\n=== Top Opportunities ===")
    #E8A0BF">print(df.head(20).to_string(index=#E8A0BF">False))

    df.to_csv(#A8D4A0">"content_gaps.csv", index=#E8A0BF">False)
    #E8A0BF">return df

# Run content gap analysis
gaps = find_content_gaps(
    #A8D4A0">"snaprender.dev",
    [#A8D4A0">"scraperapi.com", #A8D4A0">"scrapingbee.com"],
    keywords
)

Pro tip

Focus on content gaps where competitors rank in positions 4-10. These keywords have proven search intent but the existing content isn't dominant -- making them easier to win with high-quality content.

4. Rank history and trends

Store ranking data over time and generate trend reports:

rank_history.py
#E8A0BF">import pandas #E8A0BF">as pd
#E8A0BF">import sqlite3

# Store rank history #E8A0BF">in SQLite
conn = sqlite3.connect(#A8D4A0">"seo_tracking.db")
conn.execute(#A8D4A0">""#A8D4A0">"
    CREATE TABLE IF NOT EXISTS rankings (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        date TEXT,
        keyword TEXT,
        position INTEGER,
        url TEXT,
        featured_snippet INTEGER
    )
"#A8D4A0">"")

#E8A0BF">def save_rankings(results):
    #A8D4A0">""#A8D4A0">"Save SERP tracking results to database"#A8D4A0">""
    date = datetime.now().strftime(#A8D4A0">"%Y-%m-%d")
    #E8A0BF">for r #E8A0BF">in results:
        conn.execute(
            #A8D4A0">"INSERT INTO rankings (date, keyword, position, url, featured_snippet) VALUES (?, ?, ?, ?, ?)",
            (date, r[#A8D4A0">"keyword"], r[#A8D4A0">"position"],
             r[#A8D4A0">"top_3"][0][#A8D4A0">"url"] #E8A0BF">if r[#A8D4A0">"top_3"] #E8A0BF">else #A8D4A0">"",
             1 #E8A0BF">if r[#A8D4A0">"featured_snippet"] #E8A0BF">else 0)
        )
    conn.commit()

#E8A0BF">def rank_trends_report():
    #A8D4A0">""#A8D4A0">"Generate weekly rank trend report"#A8D4A0">""
    df = pd.read_sql_query(#A8D4A0">"SELECT * FROM rankings", conn)

    #E8A0BF">print(#A8D4A0">"=== Rank Trends (Last 4 Weeks) ===")
    #E8A0BF">for kw #E8A0BF">in df[#A8D4A0">"keyword"].unique():
        kw_data = df[df[#A8D4A0">"keyword"] == kw].sort_values(#A8D4A0">"date")
        #E8A0BF">if len(kw_data) >= 2:
            latest = kw_data.iloc[-1][#A8D4A0">"position"]
            prev = kw_data.iloc[-2][#A8D4A0">"position"]
            #E8A0BF">if latest #E8A0BF">and prev:
                change = prev - latest  # positive = improved
                arrow = #A8D4A0">"UP" #E8A0BF">if change > 0 #E8A0BF">else #A8D4A0">"DOWN" #E8A0BF">if change < 0 #E8A0BF">else #A8D4A0">"="
                #E8A0BF">print(f#A8D4A0">"  {kw}: #{latest} ({arrow} {abs(change)})")
            #E8A0BF">else:
                #E8A0BF">print(f#A8D4A0">"  {kw}: #{latest #E8A0BF">or #A8D4A0">'N/A'}")

save_rankings(results)
rank_trends_report()

Build custom SEO tools with SnapRender

SnapRender renders JavaScript-heavy pages, bypasses bot detection, and extracts structured data. Build rank trackers and competitor analyzers with a single API.

Get Your API Key — Free

Frequently asked questions

Scraping publicly available search engine results pages and competitor websites for SEO analysis is a common industry practice. Major SEO tools like Ahrefs and SEMrush do exactly this at massive scale. Use scraped data for internal analysis and strategy, not to republish competitor content.

SERP rankings and featured snippets, competitor title tags and meta descriptions, backlink profiles from link databases, content structure (headings, word count, internal links), technical SEO elements (schema markup, page speed signals), and review/rating data for local SEO.

For rank tracking, weekly scraping provides reliable trend data. Daily tracking is useful for volatile keywords or during major algorithm updates. For content gap analysis, monthly scraping is sufficient since content strategies change slowly.

Google's ToS prohibits automated scraping, but SERP analysis is standard practice in the SEO industry. Use a rendering API like SnapRender to handle Google's bot detection, and keep request volumes reasonable. Many businesses scrape SERPs daily for rank tracking.

Paid tools (Ahrefs, SEMrush) offer polished dashboards and historical data but cost $100-400/month. Scraping gives you raw, customizable data at a fraction of the cost. Many SEO professionals combine both: paid tools for broad analysis and custom scrapers for specific competitive intelligence.