The Rise of the AI Crawler
For decades, SEO professionals only really cared about Googlebot and maybe Bingbot. Those days are over. Today, a massive portion of web traffic comprises AI agents and LLM training bots scraping the web for context, answers, and training data.
If you don't explicitly manage these bots in your robots.txt, your site could be suffering from severe server strain or inadvertently submitting proprietary data into public AI models.
The Major AI User-Agents
To control AI access, you need to know who is crawling. Here are the most prominent AI user-agents hitting your servers right now:
- GPTBot: OpenAI's main crawler for training data.
- ChatGPT-User: OpenAI's crawler for active, real-time web browsing queries.
- ClaudeBot: Anthropic's crawler for Claude.
- PerplexityBot: The crawler that powers Perplexity's answer engine.
- Amazonbot: Used by Amazon for various foundational models.
- Applebot-Extended: Apple's AI-specific crawler.
Should You Block Them?
There are two schools of thought on blocking AI bots:
1. The "Protect Everything" Approach
Publishers who rely entirely on ad revenue often block training bots (like GPTBot) because they fear their content will be consumed to train models that give away the publisher's answers for free. To do this, you explicitly Disallow the training bots.
2. The "Agent SEO" Approach
If you run a SaaS company, a B2B service, or an agency, you want AI models to understand what you do. When a user asks an AI "What is the best CRM in London?", you want your site cited. In this case, you should allow discovery bots (like ChatGPT-User and PerplexityBot) while optionally blocking pure training scrapers.
Recommendation: Do NOT block ChatGPT-User or PerplexityBot if you want to be discovered in the new era of generative search. Being invisible to LLMs is the new equivalent of being deindexed by Google.
Example: The Balanced Robots.txt
Here is a modern, balanced robots.txt configuration that blocks aggressive training scrapers but allows real-time answer engines to read your site:
User-agent: *
Allow: /
# Block OpenAI training data scraper
User-agent: GPTBot
Disallow: /
# ALLOW real-time ChatGPT browsing for citations
User-agent: ChatGPT-User
Allow: /
# Block Anthropic training scraper
User-agent: ClaudeBot
Disallow: /
# Allow Perplexity Answer Engine for discoverability
User-agent: PerplexityBot
Allow: /
Verifying Your Setup
Our free SEO Agent Scanner specifically checks your robots.txt file for these modern bot rules and flags if you are accidentally blocking crucial discovery engines.