Glossary

What is robots.txt?

robots.txt is a plain text file placed at the root of a website that tells web crawlers and bots which pages or sections they are allowed — or not allowed — to access. It follows the Robots Exclusion Protocol, first proposed in 1994.

How it works

How robots.txt works

When a well-behaved crawler (like Googlebot) visits a website, it first checkshttps://example.com/robots.txt before requesting any other page. The file contains rules that specify which paths are allowed or disallowed for each bot.

Crawlers match their User-agent string against the rules in the file. If a rule set matches, the crawler follows the Allow and Disallow directives for that set. If no matching rule exists, the crawler assumes full access.

robots.txt is purely advisory. There is no technical enforcement — a bot can choose to ignore the file entirely. For actual access control, websites use authentication, IP blocking, or services like Cloudflare.

Syntax

robots.txt syntax and rules

User-agent: *

Applies to all crawlers. Use a specific bot name (e.g., Googlebot, Bingbot) to target one crawler.

Disallow: /admin/

Blocks crawlers from accessing any URL starting with /admin/. The path is case-sensitive.

Allow: /admin/public/

Overrides a Disallow for a more specific path. Useful for whitelisting subpaths within a blocked directory.

Crawl-delay: 10

Asks crawlers to wait 10 seconds between requests. Not all crawlers support this (Google ignores it; use Search Console instead).

Sitemap: /sitemap.xml

Points crawlers to the XML sitemap for efficient page discovery. Can include a full URL.

Example robots.txt

# Allow all crawlers, block admin and API
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
# Slow down aggressive bots
User-agent: AhrefsBot
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
Crawlers

Major crawlers and their user-agents

Googlebot

Google's primary web crawler. Respects robots.txt, supports Allow/Disallow but ignores Crawl-delay. Configure crawl rate in Google Search Console.

Bingbot

Microsoft Bing's crawler. Respects robots.txt including Crawl-delay. Crawl rate also configurable in Bing Webmaster Tools.

GPTBot

OpenAI's crawler for training data. Many sites now explicitly block GPTBot in robots.txt to prevent AI training on their content.

CCBot

Common Crawl's bot that builds open web archives. Frequently blocked by sites that want to limit data use for AI training datasets.

AhrefsBot / SemrushBot

SEO tool crawlers that index backlinks and site structure. Often rate-limited with Crawl-delay to reduce server load.

Custom scrapers

Most custom web scrapers do not identify themselves by user-agent and may not check robots.txt at all. This is why robots.txt is not a security measure.

Scraping ethics

robots.txt and ethical scraping

Respecting robots.txt is the baseline for ethical web scraping. While the file has no legal force on its own, courts have considered robots.txt compliance when evaluating scraping disputes. The 2022 hiQ v. LinkedInruling noted that scraping publicly accessible data is generally permissible, but deliberately circumventing stated restrictions weakens your legal position.

Best practices for ethical scraping: check robots.txt before crawling, honor Crawl-delay directives, identify your bot with a descriptive User-agent string, avoid overwhelming servers with too many requests, and do not scrape behind authentication or paywalls without permission.

SnapRender handles this for you — the API checks robots.txt by default and implements polite crawling with automatic rate limiting and retry backoff.

Frequently asked questions

Technically, robots.txt is advisory — not legally binding. However, ignoring it can get your IP blocked, trigger legal action (some courts consider robots.txt in scraping lawsuits), and harm your reputation. Ethical scrapers respect robots.txt unless they have explicit permission from the site owner.

No. robots.txt only works for bots that choose to obey it. Malicious bots and most web scrapers can still access disallowed paths. It is a polite request, not a security measure. For actual access control, use authentication, firewalls, or rate limiting.

robots.txt can prevent Googlebot from crawling a page, but if other sites link to that page, Google may still index the URL (showing it in results without a snippet). To fully remove a page from search results, use a "noindex" meta tag or X-Robots-Tag HTTP header instead.

robots.txt must be at the root of your domain: https://example.com/robots.txt. It only applies to that specific domain and protocol. A robots.txt on example.com does not affect subdomain.example.com — each needs its own file.

robots.txt tells scrapers which paths the site owner prefers not to be crawled. Ethical scrapers check robots.txt before requesting pages. SnapRender respects robots.txt by default but lets you override this when you have permission to scrape restricted paths.

Scrape ethically with one API call.

SnapRender respects robots.txt, manages rate limits, and handles JavaScript rendering. Start free.

Start Free — 100 requests/month