Back to blog

Should You Block AI Crawlers in Robots.txt? How to Decide Without Losing AI Visibility

Should you block AI crawlers in robots.txt? Learn how to decide which bots to allow, which to restrict, and how to avoid losing AI search visibility.

Brittany JiaoCrawler Guides

TL;DR

  • Do not use one blanket rule for every AI crawler. Separate crawlers by purpose: search, live retrieval, training, ads, and unknown scraping.
  • robots.txt is a policy signal, not a security control. You still need to check CDN, WAF, rate limits, and server responses.
  • The practical workflow is: audit crawler traffic -> classify bots -> decide what each bot can access -> update robots.txt/CDN rules -> verify with logs.

Contents

  1. The actual decision: block, allow, or limit
  2. Why Google I/O made this question more urgent
  3. Step 1: list the AI crawlers hitting your site
  4. Step 2: classify each crawler by purpose
  5. Step 3: check which URLs crawlers are requesting
  6. Step 4: check whether crawlers are getting blocked by your CDN or WAF
  7. Step 5: choose a crawler access policy
  8. Step 6: write robots.txt rules without blocking the wrong thing
  9. Step 7: verify the change after publishing
  10. Common mistakes to avoid
  11. Final checklist

The actual decision: block, allow, or limit

The question is usually phrased as:

Should we block AI crawlers in robots.txt?

That framing is too broad.

A better question is:

Which crawlers should be allowed to see which pages, and which crawlers should be restricted from low-value or expensive paths?

There are legitimate reasons to block or limit AI crawlers. Some bots hit parameter-heavy URLs, old pages, internal search pages, faceted navigation, and duplicate paths. Some create bandwidth or server cost issues. Some user agents are poorly documented or suspicious.

There are also legitimate reasons not to block every AI crawler. If you care about AI search visibility, answer engines, agentic discovery, or landing page validation, blocking the wrong crawler can make important pages harder for AI systems to access.

So the useful answer is not "allow all" or "block all."

The useful answer is a policy like this:

  • Allow trusted search/retrieval crawlers on important public content.
  • Block or restrict crawlers on low-value URL patterns.
  • Rate-limit abusive traffic.
  • Block unknown crawlers that ignore rules or create operational issues.
  • Verify the actual response crawlers receive after every change.

You can use a crawler reference like the CrawlConsole Web Crawlers directory to look up crawler identities, but the key work is still the audit. You need to see what is happening on your own site.

Why Google I/O made this question more urgent

The AI crawler debate became more practical after Google I/O because Google is making Search more agentic.

Google's I/O recap described a new intelligent Search box that can use text, images, files, videos, and Chrome tabs as inputs. It also described information agents in Search that can work in the background, reason across the web, and keep users updated with links to explore further.

That changes the website owner's question.

The old question was:

Can Google index this page?

The newer question is:

Can search and AI systems discover, fetch, understand, and revisit the pages that matter?

That is why the Reddit reaction makes sense. Publishers, SEOs, and site owners are worried about a web where AI answers more questions before users click. Some people are asking whether they should block crawlers. Others are asking why anyone should keep publishing fresh information if AI systems summarize it.

The operational answer is not to panic-block every bot.

It is to decide which pages still deserve machine access and then verify what happens.

For example:

  • A generic informational article may need a different policy than a product page.
  • A pricing page may need to be accessible to search and AI retrieval systems.
  • A docs page may be valuable if AI agents use it to understand your product.
  • An ad landing page may need access for validation crawlers like OAI-AdsBot.
  • Internal search, duplicate filters, and low-quality generated pages may deserve tighter restrictions.

Google I/O did not make crawler policy simpler. It made crawler policy more important.

If Search becomes more conversational, multimodal, and agent-driven, then the practical advantage goes to teams that know which automated systems can reach their content, what those systems saw, and where access should be limited.

Step 1: list the AI crawlers hitting your site

Start with logs, not opinions.

Use whatever source you have available:

  • server access logs
  • CDN logs
  • Cloudflare logs
  • Fastly logs
  • Vercel logs
  • nginx or Apache logs
  • bot management reports
  • crawler analytics

Export at least these fields:

  • timestamp
  • user agent
  • URL path
  • query string
  • status code
  • method
  • IP or ASN if available
  • cache status if available
  • country or region if available
  • referrer if available

Then filter for known crawler names.

Common examples include:

Do not stop at a simple user-agent match. User agents can be spoofed. If traffic volume is high or the crawler is making expensive requests, verify it through IP ranges, reverse DNS, CDN bot labels, or the crawler provider's documentation where possible.

A quick first-pass table should look like this:

| Crawler | Requests | Top URL pattern | Top status code | Action needed | |---|---:|---|---|---| | Googlebot | 1,240 | /blog/ | 200 | Allow | | GPTBot | 540 | /docs/ | 200 | Review | | ClaudeBot | 92 | /blog/ | 403 | Check WAF | | PerplexityBot | 48 | /pricing | 200 | Allow | | UnknownBot | 8,900 | /?sort= | 200 | Block or rate-limit |

The goal is to separate real crawler activity from noise.

Step 2: classify each crawler by purpose

Do not treat "AI crawler" as one bucket.

Different crawlers can support different jobs:

| Crawler type | What it may be used for | Typical policy | |---|---|---| | Search crawler | Finding or refreshing pages for search/answer surfaces | Allow on public useful pages | | Live retrieval crawler | Fetching a page when a user asks an AI tool about it | Allow on public useful pages | | Training crawler | Collecting data for model training | Business/legal decision | | Ad validation crawler | Checking landing pages submitted to ad systems | Allow on ad landing pages | | Unknown scraper | Unclear, spoofed, abusive, or undocumented behavior | Restrict, challenge, or block |

This distinction matters because blocking one crawler may not mean the same thing as blocking another.

For example, Anthropic describes different Claude-related crawler behaviors. Blocking Claude-User can affect user-directed retrieval, while blocking ClaudeBot is a different decision.

OpenAI also has multiple crawler contexts. OAI-AdsBot is relevant to ad landing page validation. OAI-SearchBot is a different kind of crawler decision.

Perplexity documents that PerplexityBot respects robots.txt for indexing site text.

The practical point: your policy should say more than "AI bots allowed" or "AI bots blocked."

It should say which crawler types are allowed on which page types.

Step 3: check which URLs crawlers are requesting

Next, group crawler requests by URL pattern.

This is where the decision becomes clearer.

Useful URL patterns usually include:

  • homepage
  • pricing page
  • docs
  • blog posts
  • feature pages
  • product pages
  • comparison pages
  • support pages

Low-value or risky URL patterns often include:

  • internal search pages
  • faceted navigation
  • session URLs
  • cart or checkout paths
  • logged-in pages
  • staging paths
  • duplicate parameter URLs
  • infinite calendar pages
  • sort/filter combinations
  • private or user-specific content

If a trusted crawler is mostly reading useful pages, blocking it may reduce visibility.

If a crawler is mostly hammering low-value paths, the issue may not be the crawler itself. The issue may be that your site gives bots too many wasteful URLs to crawl.

A useful audit table:

| URL pattern | Crawler examples | Status | Decision | |---|---|---|---| | /blog/* | Googlebot, OAI-SearchBot, PerplexityBot | 200 | Allow | | /docs/* | Googlebot, GPTBot, ClaudeBot | 200 | Allow | | /?filter=* | Unknown bots, GPTBot | 200 | Disallow or canonicalize | | /search?q=* | Unknown bots | 200 | Disallow | | /checkout/* | Any crawler | 403/200 | Block |

This step prevents overreaction. You may not need to block GPTBot everywhere. You may only need to block crawl traps and duplicate URL patterns.

Step 4: check whether crawlers are getting blocked by your CDN or WAF

This is the part many teams miss.

A crawler can be allowed in robots.txt and still fail to access the page.

Why?

Because the request may be blocked before it reaches your application.

Common blockers include:

  • Cloudflare bot fight mode
  • Cloudflare WAF custom rules
  • Fastly security rules
  • Akamai bot manager
  • Vercel firewall rules
  • Datadome or other bot protection tools
  • rate limiting
  • country blocks
  • JavaScript challenges
  • managed challenge pages
  • IP reputation rules

Check whether important crawlers receive:

  • 200: page loaded successfully
  • 301/302: crawler was redirected
  • 403: blocked or forbidden
  • 404: page not found
  • 429: rate limited
  • 5xx: server or edge failure

A common bad state looks like this:

  • robots.txt allows OAI-SearchBot
  • sitemap includes the URL
  • browser loads the page normally
  • CDN gives the crawler a 403
  • nobody notices because Google Analytics never records the request

That is why the verification step has to use logs or crawler analytics, not only page views.

Step 5: choose a crawler access policy

After the audit, choose one of four actions for each crawler/page pattern.

1. Allow

Use this for trusted crawlers on public pages you want discovered.

Examples:

2. Disallow in robots.txt

Use this for compliant crawlers on URL patterns you do not want crawled.

Examples:

  • internal search pages
  • filtered URLs
  • duplicate parameter paths
  • staging directories
  • thin auto-generated pages

3. Rate-limit or challenge

Use this when a crawler might be useful but is too aggressive.

Examples:

  • repeated hits to the same URL
  • expensive dynamic pages
  • bursts that affect performance
  • crawl traps

4. Block at the edge

Use this for abusive, spoofed, or non-compliant traffic.

Examples:

  • fake user agents
  • bots ignoring robots.txt
  • suspicious scraping behavior
  • obvious attack traffic
  • high-volume hits to private or expensive endpoints

The important part is documenting why each decision was made.

A good policy note looks like this:

| Crawler | Important pages | Low-value pages | Enforcement | |---|---|---|---| | Googlebot | Allow | Disallow crawl traps | robots.txt + canonical cleanup | | OAI-SearchBot | Allow docs/blog/product | Disallow search/filter URLs | robots.txt | | GPTBot | Business decision | Disallow duplicate paths | robots.txt | | Unknown scrapers | Block | Block | WAF |

Step 6: write robots.txt rules without blocking the wrong thing

Keep robots.txt simple.

Do not add a giant copied list of AI bots unless you understand what each rule does.

Example: allow useful public pages but block low-value paths for a specific crawler.

User-agent: GPTBot
Disallow: /search
Disallow: /cart
Disallow: /checkout
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /

Example: allow search/retrieval crawlers but block crawl traps.

User-agent: OAI-SearchBot
Disallow: /search
Disallow: /*?session=
Disallow: /*?sort=
Allow: /

User-agent: PerplexityBot
Disallow: /search
Disallow: /*?session=
Disallow: /*?sort=
Allow: /

Example: block a crawler from the whole site.

User-agent: ExampleBadBot
Disallow: /

Be careful with broad patterns.

A rule like this can create accidental damage:

User-agent: *
Disallow: /blog

That blocks far more than an AI crawler. It can affect normal search crawling too.

Also remember: robots.txt paths are not a privacy mechanism. If a page should not be public, do not rely on robots.txt. Use authentication, access control, or remove the page from the public web.

Step 7: verify the change after publishing

After you update robots.txt or CDN rules, verify the result.

Do not assume the change worked.

Check these items:

  1. Fetch https://example.com/robots.txt and confirm the live file is updated.
  2. Check Google Search Console robots.txt and URL Inspection where relevant.
  3. Watch logs for the next crawler visit.
  4. Confirm the crawler receives the expected status code.
  5. Confirm useful pages still return 200.
  6. Confirm blocked patterns return the expected behavior.
  7. Confirm the CDN/WAF is not silently challenging trusted crawlers.
  8. Recheck after deployment, CDN cache clears, or security rule changes.

The important verification question is:

Did the crawler receive the response you intended?

Not:

Did the page load for me in Chrome?

Those are different tests.

Common mistakes to avoid

Mistake 1: blocking every AI crawler because one crawler was noisy

A noisy scraper does not mean every documented AI crawler should be blocked. Separate known crawlers from unknown or abusive traffic.

Mistake 2: allowing every AI crawler everywhere

This can waste crawl resources and expose low-value pages. Useful public pages and parameter-heavy crawl traps should not have the same policy.

Mistake 3: only checking robots.txt

Robots.txt is not enough. CDN and WAF rules can override the practical outcome by blocking requests before they reach the site.

Mistake 4: using Google Analytics as the crawler source of truth

Many crawlers do not execute client-side analytics scripts. Use logs or crawler-specific monitoring.

Mistake 5: forgetting ad and landing page crawlers

If you run AI-native ad campaigns, crawlers like OAI-AdsBot may need access to landing pages. Blocking them can create validation problems that look unrelated to SEO.

Mistake 6: never revisiting the policy

Crawler documentation changes. New bots appear. Product priorities change. Review your policy regularly, especially after adding new pages, new CDN rules, or new AI visibility goals.

Final checklist

Before blocking an AI crawler, answer these questions:

  • Which crawler is it?
  • Is it documented?
  • What pages is it requesting?
  • Is it hitting useful pages or low-value paths?
  • What status code is it receiving?
  • Is the traffic causing real cost, performance, or security problems?
  • Is this crawler connected to search, retrieval, training, ads, or unknown scraping?
  • Could blocking it reduce AI search visibility, citations, or landing page validation?
  • Can you restrict only the bad URL patterns instead of blocking the entire crawler?
  • After the change, can you prove the crawler received the intended response?

A practical policy is usually selective:

  • Allow trusted crawlers on public pages that matter.
  • Disallow crawl traps and low-value patterns.
  • Rate-limit expensive behavior.
  • Block abusive or spoofed traffic.
  • Verify everything with logs.

For crawler identity lookup, use the Web Crawlers directory. For agent-facing site readiness, use WebMCP and the WebMCP Checker. For repeatable audit prompts, use the Prompt Library.

The bottom line: do not make AI crawler policy from fear or hype. Make it from evidence: crawler, URL, status code, purpose, and business value.