Will blocking trainers protect our content from AI?

Partially at best. robots.txt is voluntary compliance; major operators honor it, some agents historically have not, and your content also reaches corpora through scrapes, syndication, and archives. Treat blocking as a statement of licensing position, not as encryption.

Does allowing these crawlers cost us bandwidth or rankings?

Bandwidth: trivially, yes; AI crawler traffic is real but modest for a typical B2B site, and rate limits live in your CDN if it ever is not. Rankings: no. Google-Extended explicitly does not affect Search, and none of the others touch classical rankings.

We blocked everything a year ago during the backlash. Undo it?

If buyers research your category inside AI engines, yes, and the undo is one robots.txt deploy. Re-crawl and re-description then happen on the engines' cadence: typically days to weeks for fetchers and indexers, the next training cycle for trainers.

AI crawler directory: who fetches your site, and what to allow

The directory

Engines change their agents more often than most sites update robots.txt; this table is reviewed quarterly (last review dated above).

User agent	Operator	What it feeds	If you block it
GPTBot	OpenAI	Model training corpus	Future GPT models know less about you
OAI-SearchBot	OpenAI	ChatGPT search index	You drop out of ChatGPT search results
ChatGPT-User	OpenAI	Live fetches during user chats	ChatGPT cannot open your pages when asked
ClaudeBot	Anthropic	Training and index crawling	Future Claude models know less about you
Claude-Web	Anthropic	Live fetches during user chats	Claude cannot open your pages when asked
Anthropic-ai	Anthropic	Legacy training agent	Belt-and-suspenders companion to ClaudeBot
PerplexityBot	Perplexity	Answer-engine index	You vanish from Perplexity citations
Google-Extended	Google	Gemini training (not Search)	Gemini training opt-out; Search unaffected
Applebot-Extended	Apple	Apple Intelligence training	Same trade as Google-Extended, Apple edition
Bingbot	Microsoft	Bing index, feeds Copilot	You exit both Bing and Copilot answers
CCBot	Common Crawl	Open web corpus used by many labs	You leave the default dataset of new models
Bytespider	ByteDance	Model training	Known to ignore robots.txt at times; blocking is partly symbolic
cohere-ai	Cohere	Model training	Enterprise-model exposure, minor for most
Amazonbot	Amazon	Alexa and Amazon AI surfaces	Alexa-adjacent answers lose you
Meta-ExternalAgent	Meta	Meta AI training and retrieval	Meta AI surfaces know less about you

GPTBot

Operator: OpenAI
What it feeds: Model training corpus
If you block it: Future GPT models know less about you

OAI-SearchBot

Operator: OpenAI
What it feeds: ChatGPT search index
If you block it: You drop out of ChatGPT search results

ChatGPT-User

Operator: OpenAI
What it feeds: Live fetches during user chats
If you block it: ChatGPT cannot open your pages when asked

ClaudeBot

Operator: Anthropic
What it feeds: Training and index crawling
If you block it: Future Claude models know less about you

Claude-Web

Operator: Anthropic
What it feeds: Live fetches during user chats
If you block it: Claude cannot open your pages when asked

Anthropic-ai

Operator: Anthropic
What it feeds: Legacy training agent
If you block it: Belt-and-suspenders companion to ClaudeBot

PerplexityBot

Operator: Perplexity
What it feeds: Answer-engine index
If you block it: You vanish from Perplexity citations

Google-Extended

Operator: Google
What it feeds: Gemini training (not Search)
If you block it: Gemini training opt-out; Search unaffected

Applebot-Extended

Operator: Apple
What it feeds: Apple Intelligence training
If you block it: Same trade as Google-Extended, Apple edition

Bingbot

Operator: Microsoft
What it feeds: Bing index, feeds Copilot
If you block it: You exit both Bing and Copilot answers

CCBot

Operator: Common Crawl
What it feeds: Open web corpus used by many labs
If you block it: You leave the default dataset of new models

Bytespider

Operator: ByteDance
What it feeds: Model training
If you block it: Known to ignore robots.txt at times; blocking is partly symbolic

cohere-ai

Operator: Cohere
What it feeds: Model training
If you block it: Enterprise-model exposure, minor for most

Amazonbot

Operator: Amazon
What it feeds: Alexa and Amazon AI surfaces
If you block it: Alexa-adjacent answers lose you

Meta-ExternalAgent

Operator: Meta
What it feeds: Meta AI training and retrieval
If you block it: Meta AI surfaces know less about you

The decision framework

The blanket question "should I block AI crawlers" hides three separate trades.

Trainers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, cohere-ai, Bytespider, Meta-ExternalAgent)

The trade is IP versus ambient familiarity. If your content is the product (publishers, course creators), blocking is a coherent licensing position. If your content exists to make buyers choose you, models that learned your category from your pages describe you better forever after.

Indexers (OAI-SearchBot, PerplexityBot, Bingbot)

The trade is crawl load versus presence in AI search results. For a commercial site there is no real trade; blocking an indexer is delisting yourself from the surface buyers are migrating to.

Fetchers (ChatGPT-User, Claude-Web)

The trade does not exist. These act on behalf of a human asking about you right now. Blocking them is hanging up on a prospect mid-question.

Blocking a trainer is an IP position. Blocking a fetcher is hanging up on a buyer who just asked about you.

For a B2B company selling on expertise, the resolution is usually: allow everything, then make the crawlable surface excellent. That is the position this site takes, and the benchmark data shows it is now the norm: bot access is the highest-scoring section in the field, averaging 98/100. The control plane is solved; the differentiation moved to what crawlers find once inside.

Implementation notes

Audit list

Name agents explicitly: A bare User-agent: * allow works, but explicit per-agent rules survive a later wildcard restriction and document intent. Fifteen named rules cost nothing.
Check all three control surfaces: robots.txt, the meta robots tag, and the X-Robots-Tag header. A stray noai in any one of them overrides good intentions in the others; the audit checks all three.
Verify the real visitors: Crawler names are spoofable. The major operators publish IP ranges or verification endpoints; spot-check server logs before drawing conclusions from agent strings.
Point agents at your briefing: Our robots.txt carries a comment directing AI agents to the llms-handshake file, which turns a crawl-permission file into a routing hint.
Re-review quarterly: Agents appear, rename, and split (OpenAI alone runs three with different jobs). A calendar reminder beats discovering a year later that you never allowed an agent that did not exist when you last looked.

What to do next

Open your robots.txt now and check it against the table; if you cannot say which of the three kinds each rule affects, run the audit and let the bot-access section grade it for you.

AI crawler directory: who fetches your site, and what to allow

01·The directory

02·The decision framework

03·Implementation notes

04·What to do next

Questions

The directory

The decision framework

Implementation notes

What to do next