Back to blog

AI Bot Traffic Is Being Spoofed. How to Verify Real Crawlers Before You Allow Them

Learn how to verify AI bot traffic before you allow, block, or throttle crawlers. Check user agents, status codes, behavior, and crawler profiles with a practical workflow.

Brittany JiaoCrawler Guides

If your logs say an AI bot visited your website, that is a useful clue.

It is not proof.

A request can claim to be PerplexityBot, ClaudeBot, Meta-ExternalFetcher, Googlebot, Sleepbot, or another crawler. But a user-agent string is only a claim. It tells you what the request says it is, not whether the request is legitimate, useful, compliant, or safe to allow.

That matters because AI-agent and bot traffic is no longer a small edge case. More websites are seeing crawler-like visits from search engines, AI assistants, feed readers, monitoring tools, social preview systems, agentic browsers, and scrapers. Some are useful. Some are noisy. Some are spoofed.

The practical question is not:

What does this bot call itself?

The better question is:

What evidence do we have before we allow, block, throttle, or monitor it?

This guide walks through a practical workflow for verifying AI bot traffic and crawler user agents before you make access-policy decisions.

Why user-agent strings are not enough

Every crawler request usually includes a user-agent string.

That string might look like a known crawler. It might include a crawler name, product URL, browser token, or version number. For example, site owners may see crawler names like Sleepbot, LinkupBot, MediaToolkitBot, YisouSpider, or Meta-ExternalFetcher.

That is a starting point.

It is not the final answer.

User-agent strings can be:

  • incomplete
  • outdated
  • copied from a real crawler
  • changed by third-party infrastructure
  • spoofed by a scraper
  • misclassified by analytics tools
  • attached to behavior that does not match the claimed purpose

If you automatically allow anything with a trusted-looking name, a fake crawler can become a shortcut through your defenses. If you automatically block anything unfamiliar, you can cut off useful search, AI, feed, preview, or monitoring traffic.

The goal is not to trust every crawler.

The goal is to create a repeatable crawler verification workflow.

The five levels of crawler verification

Think about crawler verification in levels.

Each level adds more confidence.

| Level | Evidence | What it tells you | |---|---|---| | 1 | Claimed user agent | What the request says it is | | 2 | Requested URL and status code | What the bot tried to access and what it received | | 3 | Frequency and path behavior | Whether the crawl pattern looks useful, abusive, or broken | | 4 | Identity verification | Whether the claimed crawler can be matched to stronger evidence | | 5 | Business decision | Whether to allow, monitor, throttle, challenge, or block |

Most teams stop at level one.

That is where mistakes happen.

Level 1: identify the claimed crawler

Start by finding the exact crawler name or user-agent string.

Use the Web Crawlers directory to look up the bot and compare what you see in logs with known crawler context.

For example:

  • Sleepbot may appear in feed or content-discovery contexts.
  • dcrawl bot may appear as a search or crawler-related request.
  • LinkupBot may show up around link, content, or discovery workflows.
  • MediaToolkitBot may be connected to media monitoring or mention tracking.
  • YisouSpider is tied to search crawler discovery.
  • Meta-ExternalFetcher is tied to Meta link preview and fetch behavior.

At this stage, do not make a policy decision yet.

Just classify the claim.

Ask:

  • Is this a known crawler?
  • Is this an AI crawler, search crawler, feed crawler, preview fetcher, monitoring bot, or unknown scraper?
  • Does the crawler profile explain why it may visit this kind of page?
  • Is the user-agent string exact, partial, or suspiciously generic?
  • Have we seen this crawler before?

This gives you the first layer of evidence.

Level 2: inspect the URL and status code

Next, check what the crawler actually requested.

The requested URL matters as much as the crawler name.

A bot visiting your homepage once is different from a bot requesting every filtered product URL, every internal search result, or every private-looking path.

For each crawler request, capture:

  • timestamp
  • claimed user agent
  • requested URL
  • status code
  • redirect target
  • response size
  • country or network metadata if available
  • whether the page is indexable or useful
  • whether the path is public, sensitive, duplicate, or expensive

The status code is especially important.

| Status | What to check | |---|---| | 200 | The crawler received a page. Was it the right page? | | 301/302 | Did the crawler follow a clean redirect or hit a loop? | | 403 | Was the crawler blocked by WAF, CDN, bot rules, or app logic? | | 404 | Is the crawler finding stale URLs? | | 429 | Is rate limiting too aggressive or correctly protecting the site? | | 5xx | Is crawler activity exposing server reliability issues? |

This is where CrawlConsole becomes useful: you are not only asking whether a bot appeared. You are checking which URL it requested and what response it received.

Level 3: compare behavior to purpose

Once you know the claimed crawler and the response it received, look at behavior.

Good crawler behavior usually has a pattern:

  • it requests crawlable public pages
  • it follows links or sitemap-like discovery paths
  • it does not hammer expensive parameter URLs
  • it does not repeatedly hit blocked paths
  • it does not request login, checkout, account, or admin paths
  • it returns over time in a way that matches discovery or monitoring

Suspicious behavior looks different:

  • high-volume scraping across the whole site
  • repeated requests to filtered or sorted URLs
  • many 403, 404, or 429 responses
  • user-agent rotation with similar behavior
  • requests that claim to be a trusted bot but behave like a scraper
  • traffic to pages that should not be useful to the claimed crawler

This is how you avoid treating every bot as either "good" or "bad."

The same crawler name can appear in a harmless request, a misconfigured crawl, or a spoofed session.

Behavior is the difference.

Level 4: verify identity when the decision matters

You do not need the same level of verification for every bot.

If a low-volume feed crawler visits a public blog post and gets a 200, monitoring may be enough.

If a bot is requesting pricing, product, inventory, lead-gen, or account-adjacent paths, you need more confidence.

Depending on the crawler, verification can include:

  • matching the request against known crawler documentation
  • checking published IP ranges when available
  • using reverse DNS verification where appropriate
  • comparing CDN bot labels or security logs
  • checking whether multiple requests follow the same behavior pattern
  • reviewing whether the crawler respects robots.txt and rate limits
  • watching whether the same claimed user agent appears from unrelated networks

Do not turn this into fake precision.

Some crawlers are well documented. Some are not. Some intermediaries and agentic systems may fetch through infrastructure that is hard to classify.

When you cannot fully verify identity, classify the request honestly:

  • verified
  • likely legitimate
  • unknown
  • suspicious
  • abusive

That is better than pretending a user-agent string is proof.

Level 5: choose the right access policy

Once you have evidence, decide what to do.

Do not use one rule for all AI bot traffic.

Use a tiered policy:

| Classification | Recommended policy | |---|---| | Verified useful crawler | Allow and monitor | | Likely useful crawler | Allow limited public paths and monitor | | Unknown low-volume bot | Monitor before changing policy | | Unknown high-volume bot | Throttle or challenge | | Spoofed or abusive traffic | Block or restrict | | Bot hitting sensitive paths | Block through real access controls, not only robots.txt |

For crawler access, robots.txt is a policy signal. It is not security.

If a page should not be public, do not rely on robots.txt. Use authentication, server-side authorization, and proper access controls.

For public pages, use robots.txt and crawler monitoring together:

  • robots.txt expresses what compliant crawlers may access
  • WAF/CDN rules protect the site from abuse
  • crawler logs show what actually happened
  • CrawlConsole helps classify and monitor the crawler layer

A practical verification worksheet

Use this before changing crawler policy.

| Question | Why it matters | |---|---| | What user agent did the request claim? | First classification layer | | Is there a matching crawler profile? | Helps identify purpose and context | | Which URL did it request? | Shows whether the request is useful or risky | | What status code did it receive? | Reveals blocks, redirects, errors, or successful access | | Did it request public or sensitive paths? | Separates visibility traffic from risk traffic | | Was the behavior low-volume or aggressive? | Helps decide monitor vs throttle | | Can the identity be verified beyond user agent? | Reduces spoofing risk | | Does the crawler support search, AI visibility, previews, feeds, or monitoring? | Helps estimate upside | | What policy should apply? | Turns evidence into action | | What should be monitored after the change? | Prevents one-time audits from going stale |

Example: do not treat all unfamiliar crawlers the same

Imagine you see these requests:

| Claimed crawler | URL | Status | Initial interpretation | |---|---|---:|---| | Sleepbot | /blog/new-guide | 200 | Likely content/feed discovery worth monitoring | | dcrawl bot | /web-crawlers | 200 | Search/crawler discovery signal worth classifying | | LinkupBot | /pricing | 200 | Check purpose and behavior before deciding | | Unknown bot | /search?q=* | 429 | Likely expensive path; throttle or restrict | | Claimed PerplexityBot | /products?sort=price&page=900 | 200 | Needs behavior and identity verification | | Claimed Meta crawler | /account/settings | 403 | Access control likely doing its job |

The point is not that one crawler is always safe and another is always bad.

The point is that crawler policy should be based on evidence, not fear, guesswork, or a copied user-agent string.

How this connects to AI visibility

There is a real SEO and AI-visibility risk here.

If you block every unfamiliar bot, you may prevent useful systems from discovering your pages. That can affect search visibility, AI search visibility, link previews, feed discovery, and downstream agent workflows.

If you allow every bot that claims to be useful, you may expose your site to scraping, waste server resources, or leak public-but-sensitive business data at scale.

The practical middle ground is:

  1. Identify crawler traffic.
  2. Verify the claim when the decision matters.
  3. Allow useful public-page discovery.
  4. Restrict expensive, duplicate, private, or abusive paths.
  5. Monitor changes after publishing or updating policy.

This is also where crawler data can support agent-readiness work.

If you are building pages for AI agents, MCP discovery, or WebMCP, you need to know whether the pages are actually being discovered. If you are testing prompts in the Prompt Library, crawler evidence can help separate "the answer did not mention us" from "the system may never have reached the page."

AI visibility starts with access evidence.

What to monitor after you allow a crawler

After you decide to allow or monitor a crawler, do not stop.

Track:

  • visits to the exact pages you want crawled
  • status codes for those pages
  • repeated 403, 404, 429, and 5xx responses
  • crawl frequency before and after publishing
  • whether crawlers follow internal links to related pages
  • whether strategic pages start receiving more search impressions
  • whether AI-search prompt tests begin surfacing the page

For example, if you publish a new crawler guide, watch whether useful bots request:

  • the new article
  • the linked crawler profile
  • your Web Crawlers directory
  • relevant product or workflow pages
  • related WebMCP or prompt-testing pages

That gives you a feedback loop:

publish useful content -> internally link to strategic pages -> monitor crawlers -> watch GSC and AI-search signals -> update the next article

The bottom line

AI bot traffic is not one category.

Some crawlers help your pages get discovered. Some power previews, feeds, search, monitoring, or AI answers. Some are noisy. Some are pretending to be something they are not.

The mistake is making policy from the crawler name alone.

Before you allow, block, or throttle a bot, verify:

  • what it claims to be
  • what page it requested
  • what response it received
  • how it behaved
  • whether the identity is credible
  • what business value or risk it creates

Use the Web Crawlers directory to identify crawler profiles, then use CrawlConsole request evidence to see the full picture: crawler name, URL, status code, timing, and behavior.

That is how you move from "a bot visited my site" to a real crawler visibility workflow.