Skip to main content

AI crawler directory: who fetches your site, and what to allow

The working list of AI crawlers worth a robots.txt decision: who operates each one, what it feeds (training, search index, or live answers), and a decision framework, with this site's 15-crawler allowlist as the example.

Small robot crawlers queueing at an open gate, one passing through under a mint pass mark
By Lars Nyman5 min readUpdated

The directory

Engines change their agents more often than most sites update robots.txt; this table is reviewed quarterly (last review dated above).

GPTBot

Operator
OpenAI
What it feeds
Model training corpus
If you block it
Future GPT models know less about you

OAI-SearchBot

Operator
OpenAI
What it feeds
ChatGPT search index
If you block it
You drop out of ChatGPT search results

ChatGPT-User

Operator
OpenAI
What it feeds
Live fetches during user chats
If you block it
ChatGPT cannot open your pages when asked

ClaudeBot

Operator
Anthropic
What it feeds
Training and index crawling
If you block it
Future Claude models know less about you

Claude-Web

Operator
Anthropic
What it feeds
Live fetches during user chats
If you block it
Claude cannot open your pages when asked

Anthropic-ai

Operator
Anthropic
What it feeds
Legacy training agent
If you block it
Belt-and-suspenders companion to ClaudeBot

PerplexityBot

Operator
Perplexity
What it feeds
Answer-engine index
If you block it
You vanish from Perplexity citations

Google-Extended

Operator
Google
What it feeds
Gemini training (not Search)
If you block it
Gemini training opt-out; Search unaffected

Applebot-Extended

Operator
Apple
What it feeds
Apple Intelligence training
If you block it
Same trade as Google-Extended, Apple edition

Bingbot

Operator
Microsoft
What it feeds
Bing index, feeds Copilot
If you block it
You exit both Bing and Copilot answers

CCBot

Operator
Common Crawl
What it feeds
Open web corpus used by many labs
If you block it
You leave the default dataset of new models

Bytespider

Operator
ByteDance
What it feeds
Model training
If you block it
Known to ignore robots.txt at times; blocking is partly symbolic

cohere-ai

Operator
Cohere
What it feeds
Model training
If you block it
Enterprise-model exposure, minor for most

Amazonbot

Operator
Amazon
What it feeds
Alexa and Amazon AI surfaces
If you block it
Alexa-adjacent answers lose you

Meta-ExternalAgent

Operator
Meta
What it feeds
Meta AI training and retrieval
If you block it
Meta AI surfaces know less about you

The decision framework

The blanket question "should I block AI crawlers" hides three separate trades.

Trainers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, cohere-ai, Bytespider, Meta-ExternalAgent)

The trade is IP versus ambient familiarity. If your content is the product (publishers, course creators), blocking is a coherent licensing position. If your content exists to make buyers choose you, models that learned your category from your pages describe you better forever after.

Indexers (OAI-SearchBot, PerplexityBot, Bingbot)

The trade is crawl load versus presence in AI search results. For a commercial site there is no real trade; blocking an indexer is delisting yourself from the surface buyers are migrating to.

Fetchers (ChatGPT-User, Claude-Web)

The trade does not exist. These act on behalf of a human asking about you right now. Blocking them is hanging up on a prospect mid-question.

Blocking a trainer is an IP position. Blocking a fetcher is hanging up on a buyer who just asked about you.

For a B2B company selling on expertise, the resolution is usually: allow everything, then make the crawlable surface excellent. That is the position this site takes, and the benchmark data shows it is now the norm: bot access is the highest-scoring section in the field, averaging 98/100. The control plane is solved; the differentiation moved to what crawlers find once inside.

Implementation notes

What to do next

Open your robots.txt now and check it against the table; if you cannot say which of the three kinds each rule affects, run the audit and let the bot-access section grade it for you.

Frequently asked

Questions