robots.txt for AI Crawlers: Stop Being Invisible to ChatGPT

Step-by-step guide to configuring robots.txt for 9 major AI crawlers. Includes 2026 templates to make your content visible to ChatGPT, Claude, and Perplexity.

robots.txtAI-crawlersGEOAI-searchtechnical-SEO

Why robots.txt Is the First Gate to AI Visibility

You've invested in great content, built authority, and optimized for SEO. But ChatGPT and Perplexity still don't mention your brand. The culprit might be the most overlooked file on your website — robots.txt.

robots.txt is a plain text file in your site's root directory that tells search engines and AI crawlers which pages they can access. If your robots.txt blocks AI crawlers, your content is effectively invisible to the AI world.

AI Crawlers vs. Traditional Crawlers

Traditional search crawlers like Googlebot index content for ranking. AI crawlers serve two distinct purposes:

  • Training crawlers: Collect web content to train large language models (e.g., GPTBot gathers data for OpenAI's model training)
  • Search/retrieval crawlers: Fetch content in real-time to answer user queries (e.g., ChatGPT-User retrieves fresh information when users ask questions)

This distinction matters because you can make granular decisions: allow AI to cite your content in answers while blocking your data from model training.

The Data Tells a Stark Story

According to research by Paul Calvano, 5.14% of domains block GPTBot. That sounds small, but the impact is dramatic — GPTBot's actual page coverage has plummeted from 84% to just 12% because the sites blocking it tend to be major publishers and high-authority domains.

More critically, sites that block GPTBot see a 73% reduction in citation frequency across ChatGPT responses. When you close the door, AI truly stops mentioning you.

The 9 AI Crawlers You Need to Know in 2026

Here's a comprehensive table of the major AI crawlers currently active:

CrawlerCompanyPurposerobots.txt Identifier
GPTBotOpenAIModel trainingGPTBot
ChatGPT-UserOpenAIReal-time search retrievalChatGPT-User
OAI-SearchBotOpenAISearch functionalityOAI-SearchBot
ClaudeBotAnthropicModel trainingClaudeBot
anthropic-aiAnthropicAI traininganthropic-ai
Google-ExtendedGoogleGemini trainingGoogle-Extended
PerplexityBotPerplexitySearch + trainingPerplexityBot
BytespiderByteDanceTraining + searchBytespider
cohere-aiCohereModel trainingcohere-ai

Key insight: ClaudeBot's training crawler is blocked by a staggering 69% of websites. AI training traffic accounts for 42% of all AI crawler requests. Most sites selectively block training crawlers while keeping search crawlers accessible.

Three robots.txt Strategies: Pick Yours

Strategy 1: Allow Everything (Recommended for SMBs)

If maximum AI visibility is your goal, let all AI crawlers access your content freely:

# AI Crawlers - Allow All
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

Best for: Brand websites, content sites, and SaaS product pages that want AI recommendation. For small and mid-size brands, the indirect brand exposure from training data far outweighs the "data used for training" risk.

Strategy 2: Block Training, Allow Search (Recommended for Publishers)

Allow AI to cite your content when answering questions, but prevent it from being used to train models:

# Block Training Crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

# Allow Search/Retrieval Crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Best for: News outlets, paywalled content platforms, and large publishers who want AI citations without contributing to model training.

Strategy 3: Block Everything (Not Recommended)

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Warning: This strategy effectively erases your brand from AI search. Given the 73% citation reduction data, this approach is only justified for sites with strict copyright protection requirements.

5-Minute Fix: Check and Update Your robots.txt

Step 1: Check Your Current Configuration

Visit https://yourdomain.com/robots.txt in your browser. Look for any rules targeting AI crawlers. If there's no mention of GPTBot, ClaudeBot, etc., you're relying on the default User-agent: * rule — which usually means access is allowed, but explicit declarations are better practice.

Step 2: Run an AI Visibility Audit

Use RankWeave's free AI visibility audit to instantly check whether your robots.txt is AI-crawler friendly. The tool analyzes your robots.txt and identifies which AI crawlers are blocked.

Step 3: Edit Based on Your Strategy

Choose your strategy and edit the robots.txt file in your site's root directory. Here's how on popular platforms:

  • WordPress: Install Yoast SEO or Rank Math, then edit robots.txt under Tools → File Editor
  • Shopify: Settings → Custom Liquid → Edit the robots.txt.liquid template
  • Next.js / Nuxt: Create or modify the robots.txt file directly in the public directory
  • Wix: SEO Settings → robots.txt editor

Step 4: Verify the Changes

After editing, revisit https://yourdomain.com/robots.txt to confirm the changes are live. Then run RankWeave's audit again to verify all AI crawlers show the expected status.

Advanced: The Cloudflare Pitfall

If you use Cloudflare, watch out for these common issues:

Bot Fight Mode May Block AI Crawlers

Cloudflare's Bot Fight Mode and Super Bot Fight Mode actively intercept traffic it considers malicious automation. The problem: some AI crawlers may be misclassified as malicious bots and blocked — even if your robots.txt explicitly allows them.

Fix: In your Cloudflare dashboard under Security → Bots, review Bot Fight Mode settings. If you see 403 errors in AI crawler logs, consider adding known AI crawler IP ranges to your allowlist.

WAF Rule Conflicts

Cloudflare's Web Application Firewall rules may conflict with AI crawler request patterns, especially when crawlers send high-volume requests in short intervals.

Recommendation: Create WAF exemption rules for known AI crawler User-Agents like GPTBot and ChatGPT-User.

Cloudflare AI Audit

In 2026, Cloudflare launched its AI Audit feature, letting you see which AI crawlers visit your site and how many pages they crawl — directly from your dashboard. This is far more convenient than parsing server logs manually.

After robots.txt: What's Next?

Getting robots.txt right is step one. Once AI crawlers can access your content, you need to make sure they understand it:

  1. Add structured data: Use Schema.org JSON-LD to help AI engines parse your content. Pages with structured data are 2.5x more likely to be cited by AI. Read our Schema.org Structured Data Guide.

  2. Build knowledge graph presence: Create a Wikidata entry for your brand so AI systems can verify your identity through this trusted source. See our Wikidata Brand Guide.

  3. Full GEO optimization: From technical foundations to content strategy, systematically boost your AI visibility. Learn what GEO is and explore our AI Search Optimization Guide.

Remember: robots.txt determines whether AI can see you. Structured data determines whether AI can understand you. Knowledge graphs determine whether AI trusts you. All three are essential.

Run a free audit with RankWeave to see how your website looks through AI crawlers' eyes.