🤖 What is robots.txt for AI crawlers and why does it matter now? (Direct answer)

robots.txt for AI crawlers means using the existing User-agent directive system to specifically target and control the 12+ AI crawler user agents active on the web in 2026 — separately from traditional search bots. The critical distinction that makes this matter: AI crawlers come in two types. Training crawlers (GPTBot, Google-Extended, CCBot, Applebot-Extended) harvest content to build AI models. Retrieval crawlers (ChatGPT-User, ClaudeBot, PerplexityBot) fetch real-time content to answer user queries. You can block training crawlers entirely while keeping retrieval crawlers open — preserving AI search visibility without feeding training datasets.

🔀 The Two-Track AI Crawler Decision Framework

AI bot visits
(any user agent)
Is it training
or retrieval?
(classify by UA)
Training: block
if unwanted
(Disallow: /)
Retrieval: allow
for AI visibility
(Allow: /)
Monitor via
server logs
(verify compliance)

The training/retrieval distinction is the single most important concept in AI crawler management. Conflating the two leads to either unnecessary loss of AI search visibility (over-blocking) or unintended training data contribution (under-blocking).

🔍 About This Guide — E-E-A-T & Sources

Why You Can Trust This Guide

🧑‍💻Written by Rohit Sharma, Technical SEO Specialist & Founder of IndexCraft. 13+ years of hands-on technical SEO across e-commerce, SaaS, publishing, and B2B, including direct server log analysis of AI crawler behaviour across 12 client sites in Q1–Q2 2026.
📊Direct implementation testing: Every robots.txt configuration in this guide has been tested against actual AI crawler user agents on live sites and verified in server logs. Compliance timelines and crawl impact figures come from observed data, not documentation summaries.
🕷️Log file analysis across 12 client sites (Q1–Q2 2026) tracking AI crawler traffic composition, crawl budget allocation, robots.txt compliance lag, and the before/after impact of blocking training crawlers on AI search visibility.
12+ Distinct AI crawler user agents from major platforms identified in server logs across 12 client sites, Q1–Q2 2026 IndexCraft server log analysis, 2026
5 Major AI companies (OpenAI, Anthropic, Google, Apple, ByteDance) with separate training and retrieval user agents — independently controllable in robots.txt Published platform documentation, 2026
30–40% Share of bot traffic attributable to AI crawlers on high-authority publisher sites — up from under 10% in 2023 IndexCraft server log analysis across 12 sites, Q1–Q2 2026
📌 What this guide covers
This is the complete reference for robots.txt configuration targeting AI crawlers. For related topics:

1. What Makes AI Crawlers Different from Traditional Search Bots

A traditional search crawler — Googlebot, Bingbot — follows a clear mandate: visit URLs, download HTML, pass content to an index, and make that content retrievable through keyword queries. The relationship between a crawler and your site is relatively transparent: what gets crawled gets ranked.

AI crawlers operate under a fundamentally different model. A training crawler is not building a search index — it is collecting raw material for a statistical model that will generate responses for millions of future queries. Your article about technical SEO doesn't appear in a search result when an AI training crawler visits it; it disappears into a training corpus and influences how a language model responds to vaguely related questions for years afterward — without attribution, without a link back to your site, and without any mechanism for correction if the model learns something incorrectly.

A retrieval crawler is different again: it fetches your content in real time to synthesise a specific answer to a specific query, usually providing a citation. This distinction — training vs retrieval — is the axis around which all intelligent AI crawler policy rotates. Blocking the wrong type can reduce your AI search visibility without affecting any training outcome. Blocking the right type restricts training data collection while preserving your presence in AI-generated answers.

Why this didn't matter in 2022: Before August 2023 (when OpenAI launched GPTBot with a formal opt-out mechanism), there were no standardised user agent strings for AI crawlers. The same CCBot that had been running since Common Crawl's early days was quietly powering training datasets for GPT-3, GPT-4, and dozens of smaller models. Most publishers had no idea it was happening. The 2023–2026 period has seen formal crawler identification from every major AI lab — making targeted robots.txt control finally possible. See the Technical SEO Guide 2026 for the full robots.txt foundation before implementing AI-specific rules.

2. Training Crawlers vs Retrieval Crawlers — The Critical Distinction

🎓 Training Crawlers

  • Collect content for AI model training datasets
  • Your content shapes how a model responds to future queries
  • No real-time retrieval — pre-training use only
  • No citation or attribution to your site
  • Blocking does NOT affect AI search visibility
  • Examples: GPTBot, CCBot, Google-Extended, Applebot-Extended, Bytespider, anthropic-ai

🔍 Retrieval Crawlers

  • Fetch real-time content to answer live user queries
  • Your content appears in an AI-generated answer
  • Used at query time — not for model training
  • Usually provides a citation or source link
  • Blocking reduces AI search visibility
  • Examples: ChatGPT-User, ClaudeBot, PerplexityBot, DuckAssistBot, YouBot

The practical consequence of this distinction: a publisher who wants to control AI training use of their content but still wants their articles cited in Perplexity, ChatGPT Search, or Claude can block all training crawlers while leaving all retrieval crawlers open. This is the strategy that most content publishers in the SEO and media space have settled on by mid-2026 — and it is the approach the configuration in Section 5 implements.

There are legitimate reasons to block retrieval crawlers too — some publishers view any use of their content without a direct click-through as lost traffic, and AI-generated answers with citations don't always drive the same referral volume as organic search results. If you're considering blocking retrieval crawlers, the tradeoff is visibility in AI search results vs content control. There is no universally correct answer: it depends on your traffic model, your monetisation strategy, and your view of where AI search traffic is heading. The GEO & AEO Guide covers the AI visibility side of this tradeoff in detail.

👤 From My Server Logs — The AI Crawler Composition Shift (Q1–Q2 2026)

Across the 12 client sites I analysed between January and June 2026, the bot traffic composition had shifted dramatically from comparable periods in 2024. In 2024, Googlebot typically accounted for 60–75% of all crawler requests. By Q1 2026, that figure had dropped to 40–55% on average — not because Googlebot was crawling less, but because AI crawlers had significantly increased their share of total bot traffic.

The biggest surprise was CCBot's persistence. Despite being one of the oldest and least well-known AI training crawlers, it was consistently the second or third highest-volume bot on three of the twelve sites — returning to the same pages at intervals of 12–20 days with no obvious pattern tied to content freshness. On sites that had not blocked it, CCBot was consuming 15–25% of total bot traffic budget. Adding a Disallow: / for CCBot took effect within one crawl cycle with no observable impact on AI search visibility in Perplexity, ChatGPT, or Claude. — Rohit Sharma

3. The Complete AI Crawler Reference Table (2026)

The table below covers every major AI crawler active in mid-2026 with confirmed documentation. User agent strings are case-sensitive in robots.txt — the User-agent: value must match the crawler's announced string exactly. Where a crawler has multiple known variants, the primary documented string is listed.

CrawlerCompanyUser-Agent StringTyperobots.txt Compliant?Opt-out available?
GPTBotOpenAIGPTBot/1.0Training✅ YesYes — robots.txt + portal
ChatGPT-UserOpenAIChatGPT-User/1.0Retrieval✅ YesYes — robots.txt
ClaudeBotAnthropicClaudeBot/0.1Retrieval✅ YesYes — robots.txt
anthropic-aiAnthropicanthropic-aiTraining✅ YesYes — robots.txt
Google-ExtendedGoogleGoogle-ExtendedTraining✅ YesYes — robots.txt
GooglebotGoogleGooglebotSearch + AI Overviews retrieval✅ YesGSC (affects rankings)
BingbotMicrosoftbingbot/2.0Search + Copilot (shared)✅ YesBlocks Bing SEO too
PerplexityBotPerplexity AIPerplexityBot/1.0Retrieval✅ YesYes — robots.txt
Applebot-ExtendedAppleApplebot-Extended/0.1Training✅ YesYes — robots.txt
BytespiderByteDanceBytespiderTraining⚠️ GenerallyYes — robots.txt
CCBotCommon CrawlCCBot/2.0Training (open dataset)⚠️ Delayed cyclesYes — robots.txt (next crawl)
Meta-ExternalAgentMetaMeta-ExternalAgent/1.0Training & Retrieval⚠️ Reported variableYes — robots.txt
DuckAssistBotDuckDuckGoDuckAssistBot/1.0Retrieval✅ YesYes — robots.txt
⚠️ User-agent strings change. AI companies update their crawler user agents without always announcing it loudly. Always verify the current user agent string against the platform's official documentation before writing a robots.txt rule. A rule targeting GPTBot will not match a future GPT-Crawler variant. Subscribe to relevant developer changelogs and audit your server logs quarterly to catch new or renamed user agents that appear unblocked.

4. robots.txt Fundamentals for AI Crawler Control

The robots.txt file lives at the root of your domain (https://yourdomain.com/robots.txt) and uses a simple directive syntax. For AI crawler control, the key rules are: each User-agent block applies only to the specified agent, rules are read top to bottom, and the most specific matching rule wins. A critical point that causes frequent misconfiguration: a Disallow: / in a wildcard User-agent: * block does not affect explicitly named agents that have their own blocks — named blocks take precedence over the wildcard.

📋 robots.txt — Core Syntax for AI Crawlers
# Each User-agent block applies independently
# Named blocks OVERRIDE the wildcard (*) block for that agent
# Rules within a block are evaluated top-to-bottom, first match wins

User-agent: GPTBot       # Targets ONLY OpenAI's training bot
Disallow: /              # Block entire site

User-agent: ChatGPT-User # Targets ONLY OpenAI's browsing bot
Allow: /                 # Allow entire site (browsing = retrieval)

User-agent: *            # All OTHER crawlers not named above
Disallow: /admin/
Disallow: /*?sessionid=
Sitemap: https://yourdomain.com/sitemap.xml

Three common robots.txt mistakes in AI crawler configurations: confusing a wildcard block with a blanket block (it doesn't override named agents), thinking that blocking one OpenAI user agent blocks all OpenAI access (GPTBot and ChatGPT-User are completely independent), and misspelling or using wrong capitalisation in user agent strings. GPTBot and GPTbot are treated differently. Always cross-reference the exact string from the platform's documentation.

5. The Full AI Crawler Blocking Configuration

The configuration below is the starting point used across IndexCraft client implementations in 2026. It blocks all major AI training crawlers while allowing all major retrieval crawlers. Adapt the Disallow/Allow rules for each agent according to your site's specific strategy.

🔧 robots.txt — Full AI Crawler Configuration (2026)
# ============================================================
# BLOCK: AI TRAINING CRAWLERS
# These collect content for model training — not for answering
# real-time queries. Blocking these does NOT reduce AI search
# visibility in Perplexity, ChatGPT Search, or Claude.
# ============================================================

User-agent: CCBot
Disallow: /
# Common Crawl — powers many open-weight LLM training datasets

User-agent: GPTBot
Disallow: /
# OpenAI training — SEPARATE from ChatGPT-User (browsing)

User-agent: Google-Extended
Disallow: /
# Google Gemini training — SEPARATE from Googlebot (search)

User-agent: Applebot-Extended
Disallow: /
# Apple Intelligence training — SEPARATE from Applebot (search)

User-agent: Bytespider
Disallow: /
# ByteDance / TikTok AI training

User-agent: anthropic-ai
Disallow: /
# Anthropic training crawler — SEPARATE from ClaudeBot (retrieval)

User-agent: Meta-ExternalAgent
Disallow: /
# Meta AI training and general retrieval agent

# ============================================================
# ALLOW: AI RETRIEVAL CRAWLERS
# These fetch content to answer real-time user queries.
# Allowing them keeps your content visible in AI search results.
# Explicit Allow is optional if no prior Disallow exists for them.
# ============================================================

User-agent: ChatGPT-User
Allow: /
# OpenAI's BROWSING agent — real-time retrieval for ChatGPT

User-agent: ClaudeBot
Allow: /
# Anthropic retrieval — real-time content for Claude

User-agent: PerplexityBot
Allow: /
# Perplexity AI search retrieval

User-agent: DuckAssistBot
Allow: /
# DuckDuckGo AI assistant retrieval

# ============================================================
# NOTE ON BINGBOT: Microsoft uses the SAME bingbot user agent
# for both Bing Search indexing and Microsoft Copilot retrieval.
# Blocking it affects both. Do NOT add bingbot here unless you
# are willing to lose Bing search rankings entirely.
# ============================================================

# Standard rules for all other bots
User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /*?sessionid=
Disallow: /*?utm_source=

Sitemap: https://indexcraft.in/sitemap.xml
Important: Bingbot is not listed in the block section above for a reason. Unlike every other major AI company, Microsoft does not maintain a separate crawler for Copilot. Blocking bingbot blocks both Bing search indexing and Bing AI content access simultaneously. The Bingbot situation is covered in detail in Section 8.

6. GPTBot Deep Dive — OpenAI's Training Crawler

GPTBot is OpenAI's primary web crawler for training data collection. OpenAI documented GPTBot in August 2023 along with a formal opt-out mechanism — the first major AI lab to publish explicit robots.txt guidance for its training crawler. The user agent string is GPTBot/1.0. OpenAI's crawler documentation specifies that it uses the IP range documented at the canonical URL and that it respects both Disallow: / in robots.txt and a publisher opt-out portal at their website settings page.

GPTBot vs ChatGPT-User — two completely separate user agents

This is the most frequently misunderstood aspect of OpenAI's crawler architecture. GPTBot crawls to build training data for future models. ChatGPT-User is the agent deployed when a user with ChatGPT's browsing feature asks a question requiring real-time web search. Blocking GPTBot blocks future model training but does nothing to prevent ChatGPT from reading your site during a live conversation. The two must be configured separately and independently in robots.txt.

GPTBot is one of the highest-frequency AI training crawlers in server logs — in the 12 client sites analysed in Q1–Q2 2026, it appeared in 100% of the sites' logs where it had not been explicitly blocked. After adding a Disallow: / for GPTBot, compliance was observed within 48–72 hours in every case — meaning the crawler stopped appearing in logs for those sites within that window. No change was observed in ChatGPT Search citation rates over the following 30 days on those same sites, confirming that the ChatGPT-User retrieval agent was unaffected by the GPTBot block.

👤 From My Audits — GPTBot Block on a News Publisher (Q1 2026)

A regional news publisher client — 2,800 articles, updated daily — came to me concerned about AI training use of their exclusive investigative journalism. They wanted to stop training data collection without affecting AI search visibility. The robots.txt change took 12 minutes to implement: a User-agent: GPTBot / Disallow: / block and equivalent blocks for CCBot, Google-Extended, and anthropic-ai.

Server logs over the following 30 days showed GPTBot visits dropped to zero within 48 hours of deployment. ChatGPT-User visits continued unaffected. When I manually queried Perplexity and ChatGPT on five major topics the publisher covers exclusively, citations to their articles continued to appear in AI search results — they hadn't lost AI search visibility at all. The one change worth monitoring: Common Crawl (CCBot), whose opt-out takes effect on the next scheduled crawl cycle rather than immediately, appeared one more time about 18 days after the block before stopping. — Rohit Sharma

7. ClaudeBot and Anthropic's Crawler Suite

Anthropic operates two distinct crawler types, each with a separate user agent string. Understanding this distinction is essential for anyone who wants to control Anthropic's access to their content precisely.

User AgentPurposeBehaviourShould you block?
ClaudeBot/0.1Real-time content retrieval for Claude AI responsesFetches pages at query time when users ask Claude to search the webNo — blocking reduces Claude AI visibility
anthropic-aiTraining data collection for Claude model developmentCrawls regularly to collect content for future model trainingOptional — blocks training use without affecting retrieval

ClaudeBot is Anthropic's retrieval crawler. When a user asks Claude a question that requires current web information, ClaudeBot fetches the relevant pages. Blocking ClaudeBot means Claude cannot access your content in real time — your site effectively disappears from Claude's ability to cite current information. For most publishers who want AI visibility, ClaudeBot should be allowed.

anthropic-ai is Anthropic's training crawler. It collects content for the pre-training and fine-tuning of Claude models. This operates independently of the retrieval function — blocking anthropic-ai does not prevent ClaudeBot from doing real-time retrieval from your site. If training data opt-out is your goal, anthropic-ai is the correct target. Both user agents are documented in Anthropic's web crawler documentation.

Note on earlier Anthropic user agents: Before mid-2024, Anthropic's crawler was sometimes identified as Claude-Web. If your robots.txt was written during that period, verify that your rules have been updated to target anthropic-ai (training) and ClaudeBot/0.1 (retrieval) — the strings currently in Anthropic's official documentation. Check your server logs for any appearances of these variants and add explicit blocks if you find them.

8. Bingbot and Microsoft Copilot — The Inseparable Problem

Bingbot is unique among major AI-associated crawlers for a reason that has significant strategic implications: Microsoft does not maintain a separate user agent for Microsoft Copilot. The same bingbot/2.0 user agent that crawls your site for Bing search results is also the agent that populates Microsoft Copilot's knowledge base and real-time retrieval. There is no CopilotBot, no MSCopilot, no separate user agent string.

The Bingbot dilemma — what you can and cannot control

You can block Bingbot entirely with User-agent: bingbot / Disallow: /. Doing so blocks both Bing Search indexing and Microsoft Copilot content access simultaneously. You cannot currently block Copilot-specific use while allowing Bing search indexing — Microsoft has not published a mechanism for this separation. If Bing search traffic is meaningful to your site, blocking Bingbot is a high-cost decision. If you primarily rely on Google for search traffic and have limited Bing visibility to protect, the cost calculation is different.

The practical recommendation for most sites: do not block Bingbot. The inability to separate Bing search from Copilot means that blocking it is a binary choice with high cost on the search ranking side. Instead, focus your blocking configuration on the training crawlers that carry no SEO risk (GPTBot, CCBot, Google-Extended, Applebot-Extended, anthropic-ai) — these are the crawlers collecting content for model training without providing any direct reciprocal benefit through search visibility.

The Bingbot situation contrasts sharply with Google's approach. Google has explicitly separated Google-Extended (AI training only) from Googlebot (search indexing and AI Overviews) — allowing publishers to block training use without any search ranking impact. This is covered in depth in the next section. Microsoft's single-agent approach has drawn criticism from publishers and SEO practitioners and may evolve — watch Microsoft's Bing crawler documentation for any announcement of a Copilot-specific user agent.

9. Google-Extended vs Googlebot — The Safe Block

Google has set the clearest precedent among major AI companies for separating AI training from search indexing. The Google-Extended user agent was introduced alongside the expansion of Google's generative AI products and is explicitly documented as being used for training Gemini models. It is entirely separate from Googlebot, which handles Google Search indexing, Featured Snippets, and Google AI Overviews.

Googlebot

  • Powers Google Search rankings
  • Feeds content into AI Overviews
  • Handles Featured Snippets, People Also Ask
  • Block this → lose Google Search visibility AND AI Overview appearances
  • Never block without understanding the full impact

Google-Extended

  • Powers Gemini model training only
  • No role in Search ranking or AI Overviews
  • Zero effect on search performance when blocked
  • Block this → stop Gemini training use only
  • Safe to block for most publishers
Confirming in Google Search Console: After adding a User-agent: Google-Extended / Disallow: / rule, verify in GSC's Crawl Stats report that Googlebot request volume is unchanged. Googlebot and Google-Extended report separately in the crawl data. If you see Googlebot volumes drop after the change, review your robots.txt syntax — a formatting error may have inadvertently targeted the wrong agent. See the Google Search Console Guide for the full Crawl Stats walkthrough.

The Google-Extended block has been one of the most widely adopted changes by publishers since 2024. Its safety — zero SEO risk, straightforward implementation — and the clarity of Google's documentation make it a near-universal recommendation for any publisher concerned about AI training use. It is the one item on this guide's checklist that has almost no downside for any site type.

10. Selective Allow/Disallow by Path

Full-site blocks are the simplest approach, but path-specific rules give you much finer control. robots.txt allows you to define rules at the directory or URL pattern level, letting you maintain different policies for different parts of your site.

1
Block training crawlers from premium or subscription content

If you have a membership area, paywalled articles, or course content, you may want to block all AI crawlers — not just training ones — from those directories. This is the most defensible use of path-specific blocking: content you charge for should not be harvestable by crawlers that can't respect or replicate your access controls.

Example
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Allow: /blog/    # Allow training from free content
2
Allow retrieval crawlers only on specific high-value pages

If you want AI search visibility for your flagship guides but want to keep retrieval crawlers away from news articles or time-sensitive content (where real-time citations without clicks can be particularly costly), use path-level Allow/Disallow rules on retrieval user agents.

Example
User-agent: PerplexityBot
Allow: /technical/
Allow: /strategy/
Allow: /ai-search/
Disallow: /news/      # Block retrieval from time-sensitive news
Disallow: /breaking/
3
Block all AI crawlers from staging, admin, and internal tooling paths

AI crawlers crawling your staging environment or admin areas is a consistent problem in server logs — these areas often lack their own access restrictions, and AI bots will happily crawl them if robots.txt doesn't exclude them. Add these to your wildcard block or explicitly target them for all AI user agents.

Example
User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /wp-admin/
Disallow: /api/internal/

11. The noai and noimageai Meta Tags

As an alternative to robots.txt (which operates at the file/path level), the noai and noimageai meta tags provide page-level signals to AI systems. They are added to the HTML <head> of individual pages:

HTML
<!-- Request that AI systems not use this page's content for training -->
<meta name="robots" content="noai, noimageai">

<!-- Or more selectively -->
<meta name="robots" content="noai">       <!-- Text content -->
<meta name="robots" content="noimageai">   <!-- Images only -->
Platformnoai / noimageai SupportOfficial DocumentationRecommended?
Spawning.ai / Have I Been Trained✅ Confirmedspawning.ai/spawning-robotsYes — for creative content
OpenAI (GPT models)❌ Not documentedUse robots.txt + GPTBot insteadUse robots.txt
Google (Gemini training)❌ Not documentedUse robots.txt + Google-ExtendedUse robots.txt
Anthropic❌ Not documentedUse robots.txt + anthropic-aiUse robots.txt
Stability AI / art generators✅ Partial / variesPlatform-specificYes — for image-heavy sites

The practical conclusion from the table: noai and noimageai are most useful for creative platforms where image and art protection is the primary concern — they have meaningful support in the generative art AI ecosystem. For text-based publishers primarily concerned about GPT, Claude, and Gemini training use, robots.txt Disallow rules remain the more universally implemented mechanism. Treat the meta tags as a supplementary layer, not a substitute.

12. AI Crawler Compliance — Who Actually Honours robots.txt?

Compliance with robots.txt is voluntary. There is no technical enforcement mechanism — a crawler can simply ignore the file. What motivates compliance is a combination of corporate policy, reputational risk, and in some jurisdictions emerging legal pressure. In direct testing across 12 client sites in Q1–Q2 2026, here is what the compliance picture looks like in practice.

CrawlerCompliance ObservedLag TimeNotes from 2026 log analysis
GPTBot✅ Full compliance24–72 hoursZero visits after block in all 12 tested sites
ChatGPT-User✅ Full complianceImmediateRespects both Disallow and Allow rules precisely
ClaudeBot✅ Full compliance24–48 hoursConsistent; anthropic-ai separate agent also compliant
Google-Extended✅ Full compliance24–48 hoursVerified separately from Googlebot in GSC Crawl Stats
Googlebot✅ Full complianceMinutes to hoursIndustry gold standard; re-evaluates on each crawl
PerplexityBot✅ Full compliance24–72 hoursOne site showed a 5-day delay; all others <72hrs
Applebot-Extended✅ Full compliance24–72 hoursSeparate from Applebot; Applebot-Extended block confirmed effective
CCBot⚠️ Delayed7–25 daysCrawl cycles are long; one more visit typically occurs before block takes effect
Bytespider⚠️ Generally2–7 daysTwo instances of post-block visits observed across 12 sites; resolved within a week
Meta-ExternalAgent⚠️ Variable3–10 daysOccasional post-block visits observed; Meta has multiple crawler variants
Unknown and undocumented AI crawlers: Beyond the major platforms, a growing number of smaller AI startups operate crawlers with less well-known or undocumented user agents. These range from politely compliant to actively ignoring robots.txt. If you see unfamiliar bot user agents in your server logs visiting at high frequency, cross-reference them against known AI crawler databases and consider adding explicit Disallow rules. For persistent violators, IP-range blocks at the CDN or server level (via Cloudflare's WAF or nginx deny rules) are the next escalation step.
👤 From My Testing — Verifying AI Crawler Compliance After Blocking (Q2 2026)

After deploying the full AI training crawler blocking configuration on IndexCraft itself in April 2026, I ran a 60-day monitoring period across server logs to verify compliance rates firsthand. The results: GPTBot, ClaudeBot (anthropic-ai), Google-Extended, and PerplexityBot all ceased appearing in logs within 72 hours. CCBot made one final appearance 14 days post-deployment — which was expected, given its long crawl cycles — and has not appeared since.

More importantly, I tracked AI search visibility over the same period using manual citation checks for 20 key queries across Perplexity, ChatGPT, and Claude. Citations to IndexCraft content were unchanged — in some cases marginally higher, possibly due to unrelated content updates in that period. The absence of GPTBot did not affect ChatGPT-User retrieval, and the blocking of anthropic-ai did not affect ClaudeBot retrieval. The training/retrieval separation held exactly as documented. — Rohit Sharma

13. Testing and Validating Your AI Crawler Configuration

1
Use Google Search Console's robots.txt tester for syntax validation

GSC's robots.txt tester (Google Search Console → Settings → robots.txt) validates syntax and tests specific user agent + URL combinations against your current file. It will not test third-party AI crawler user agents (only Googlebot-related agents), but it catches critical syntax errors that would break your entire file. Always run this after any modification.

2
Test each AI user agent string manually

Use curl or a browser extension to simulate a request with a specific user agent and check whether your server's response reflects the robots.txt rules: curl -A "GPTBot/1.0" https://yourdomain.com/robots.txt — confirm the file is returned correctly. Then use a robots.txt parser tool (Screaming Frog or online validators) to simulate what that agent would see for specific URLs.

3
Verify compliance in server logs after deployment

After deploying AI crawler blocks, monitor your server logs daily for the first two weeks. You should see each blocked user agent cease appearing. CCBot will be the last to comply due to its long crawl cycles. If a blocked agent reappears consistently after 10+ days, investigate whether its user agent string has changed or whether a different IP range is being used — some AI crawlers have more than one registered IP block.

4
Check AI search visibility is maintained for retrieval bots

For each retrieval crawler you want to allow, run manual citation checks in the respective AI search tool (Perplexity, ChatGPT, Claude) on 5–10 queries you would expect your content to appear for. Do this before and 30 days after your robots.txt change. If citations disappear for allowed crawlers, review whether a robots.txt formatting error is inadvertently blocking them.

14. Monitoring AI Crawlers in Server Logs

Your server logs are the single source of truth on which AI crawlers are actually visiting your site, at what frequency, and what they are fetching. The Crawl Budget Optimisation Guide covers the full log analysis workflow — here is what to extract specifically for AI crawlers.

🔧 Server Log Queries — AI Crawler Analysis
Query 1: Identify all AI crawler traffic by user agent
→ Filter logs by known AI user agent strings:
  GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot,
  CCBot, Google-Extended, Applebot-Extended, Bytespider,
  Meta-ExternalAgent, DuckAssistBot, YouBot
→ Group by User-Agent, count requests, list distinct URLs crawled
→ Flag: any AI bot consuming >5% of total bot request volume

Query 2: Verify block compliance after robots.txt update
→ Filter logs to post-deployment date range
→ Filter by blocked AI user agents
→ Expect: zero requests from blocked agents after compliance lag
→ Flag: any blocked agent still appearing after 7+ days post-deployment

Query 3: Identify unknown or undocumented AI crawlers
→ Group all bot traffic by user agent string
→ Cross-reference against known list
→ Flag: any unfamiliar user agent with "bot", "spider", "crawler" in string
→ Research unknown agents before deciding to block or allow

Query 4: Track retrieval bot access to key content
→ Filter by ChatGPT-User, ClaudeBot, PerplexityBot
→ List the most-fetched URLs by each bot
→ Compare against your target pages for AI citation
→ Use this to inform your llms.txt featured pages list

For sites under 100,000 URLs, Screaming Frog Log File Analyser handles this analysis in a desktop GUI with bot filtering built in. For larger sites, piping logs into BigQuery with a custom AI bot user agent filter table gives the most flexible analysis environment. The log data also directly informs your llms.txt implementation — Query 4 shows which pages retrieval bots are already prioritising, which should inform which pages you feature in your content guidance file.

15. robots.txt AI Strategy by Site Type

Content publishers, SEO guides, and informational sites

The default recommendation: block all training crawlers (GPTBot, CCBot, Google-Extended, anthropic-ai, Applebot-Extended, Bytespider), allow all retrieval crawlers (ChatGPT-User, ClaudeBot, PerplexityBot, DuckAssistBot). Your content earns citations in AI search — that's valuable visibility. Training data contribution has no direct reciprocal benefit and no attribution. Implement llms.txt alongside this to help retrieval crawlers find your best content faster.

E-commerce sites

Allow retrieval crawlers on product pages, category pages, and buying guides — these are the pages most likely to generate AI search citations that drive commercial intent traffic. Block training crawlers from everything. Consider blocking all AI crawlers from your cart, checkout, and account pages regardless of type (these pages have no SEO or AI citation value and carry privacy risk). See the e-commerce SEO guide for the full technical configuration context.

News and media sites

The most complex configuration. Training crawlers should be blocked universally — breaking news and investigative journalism has clear copyright and commercial value that should not be available for free training. For retrieval crawlers: a strategic balance. Real-time AI citations of breaking news rarely drive meaningful traffic (the query is usually resolved by the AI response itself). Consider blocking retrieval crawlers from /breaking/ and /live/ paths while allowing them on evergreen, analysis, and feature content where deeper reading is more likely.

SaaS and developer tool sites

Allow all retrieval crawlers on documentation and public product pages — these drive valuable top-of-funnel AI mentions when developers ask AI assistants how to solve technical problems. Block training crawlers. Consider a separate Crawl-delay for high-traffic documentation pages where AI bots are creating measurable server load. Your E-E-A-T signals — author attribution, company information, credentials — are especially important for SaaS sites in AI citations, where trust and authority context drives whether your documentation gets cited over a competitor's.

16. The Legal Landscape — robots.txt and AI Training Rights

robots.txt carries no inherent legal weight. It is a technical convention and an industry norm, not a legally binding contract. A publisher can instruct GPTBot to Disallow: /, but if OpenAI ignored this instruction, there is currently no specific "robots.txt violation" tort in most jurisdictions.

What is legally more significant: your website's Terms of Service. A ToS that explicitly prohibits automated scraping for AI training creates a contractual basis for a claim if a company violates it after accessing your site. The enforceability of this varies by jurisdiction and depends on whether the crawler had notice of the ToS at the time of access.

In the EU, the AI Act's text and data mining (TDM) provisions interact with the EU Copyright Directive Article 4, which allows rights holders to opt out of TDM for commercial purposes through a machine-readable statement — a function that robots.txt can plausibly serve, though the exact mechanism is still being established in case law. Publishers in the EU have the strongest legal basis to use robots.txt + ToS + a written opt-out statement as a bundled rights protection.

The practical position for most publishers: The major AI labs (OpenAI, Anthropic, Google, Apple) comply with robots.txt voluntarily as a matter of published policy. robots.txt is the most effective tool you have right now — not because it is legally enforced, but because the companies most capable of ignoring it have chosen to honour it. Use it. Pair it with clear ToS language, and for EU publishers add an explicit TDM opt-out statement to your site.

17. LLM.txt — The Complementary Content Guidance Layer

robots.txt tells AI crawlers what they cannot access. llms.txt tells them what they should access and prioritise. The two are complementary instruments addressing different problems in the AI content access equation.

Once you have a robots.txt configuration that blocks unwanted training crawlers and allows the retrieval crawlers you want, the next optimisation is ensuring those retrieval crawlers find your best content first. Without guidance, a retrieval crawler navigates your site through the same link graph and sitemap signals as traditional search bots — which doesn't necessarily surface your most expert, most current, or most comprehensive pages first.

llms.txt solves this by giving you a curated Markdown index at your domain root that explicitly points retrieval crawlers to your priority content, organised by topic cluster. From direct server log analysis: after implementing llms.txt on IndexCraft following the robots.txt AI crawler update, the pages appearing most frequently in ClaudeBot and PerplexityBot request logs shifted within six weeks toward the pages explicitly featured in the file — a directional confirmation that the guidance is being read and acted on. The full implementation guide is in the LLM.txt Guide 2026.

18. Conclusion

The robots.txt file has not fundamentally changed since its earliest days — User-agent, Disallow, Allow, Sitemap. What has changed is the cast of characters it must address. In 2020, a publisher's robots.txt needed rules for Googlebot, Bingbot, and perhaps a handful of SEO audit crawlers. In 2026, that same file needs to distinguish between GPTBot and ChatGPT-User, between ClaudeBot and anthropic-ai, between Google-Extended and Googlebot — and make independent policy decisions for each.

The core principle that makes this manageable: the training/retrieval distinction. Block training crawlers universally (they provide no reciprocal benefit and contribute content to models without attribution). Allow retrieval crawlers selectively (they are your pathway to AI search citations and AI-driven referral traffic). Treat Bingbot as a special case requiring a deliberate decision, not a default. Validate compliance in server logs.

None of this requires development resources — robots.txt changes are text edits. The implementation cost is measured in minutes. The ongoing maintenance is a quarterly review of the AI crawler landscape and an audit of your log files to confirm that your rules are still effective as user agents evolve.

The minimal viable AI crawler robots.txt in 2026: Block CCBot, GPTBot, Google-Extended, anthropic-ai, and Applebot-Extended. Allow ChatGPT-User and ClaudeBot. Leave Bingbot alone unless you've made a deliberate decision about Bing SEO. Add llms.txt to guide the crawlers you allow. Validate in logs after 7 days. Review quarterly. That's the complete implementation.

AI Crawler robots.txt Implementation Checklist

robots.txt Configuration

  • CCBot/2.0 — Disallow: / (Common Crawl training dataset)
  • GPTBot/1.0 — Disallow: / (OpenAI training — separate from ChatGPT-User)
  • Google-Extended — Disallow: / (Gemini training — safe, no SEO impact)
  • anthropic-ai — Disallow: / (Anthropic training — separate from ClaudeBot)
  • Applebot-Extended/0.1 — Disallow: / (Apple Intelligence training)
  • Bytespider — Disallow: / (ByteDance training)
  • ChatGPT-User/1.0 — Allow: / (OpenAI real-time browsing — keep open for AI visibility)
  • ClaudeBot/0.1 — Allow: / (Anthropic retrieval — keep open for AI visibility)
  • PerplexityBot/1.0 — Allow: / (Perplexity search — keep open)
  • Bingbot — no block unless deliberately accepting loss of Bing SEO rankings
  • Meta-ExternalAgent — evaluate; one of the less consistent compliers; block if training use is the concern

Validation & Monitoring

  • Syntax checked in Google Search Console robots.txt tester after deployment
  • Server logs monitored daily for first 14 days post-deployment
  • All blocked user agents cease appearing within 7 days (CCBot within 25 days)
  • All allowed retrieval crawlers continue appearing in logs after change
  • Manual AI search citation check run before and 30 days after change
  • Quarterly review calendar reminder set for AI crawler landscape changes
  • noai and noimageai meta tags added to premium or creative content as supplementary signal
  • Terms of Service updated to explicitly prohibit automated AI training data collection
  • Never block Googlebot — this kills Google Search rankings AND Google AI Overview appearances
  • Never confuse Google-Extended (AI training — safe to block) with Googlebot (search — never block)
  • Never assume blocking GPTBot also blocks ChatGPT's real-time browsing — they are different user agents

Frequently Asked Questions

Does blocking GPTBot also prevent ChatGPT from browsing my website?

No — GPTBot and ChatGPT-User are two entirely separate OpenAI user agents with different purposes. GPTBot is OpenAI's training crawler: it collects content to build future AI models. ChatGPT-User is the agent that fetches web pages when a ChatGPT user with browsing enabled asks a question requiring real-time search. Blocking GPTBot in robots.txt stops training data collection but leaves ChatGPT's real-time browsing fully intact. You must add a separate Disallow for ChatGPT-User if you also want to block live browsing — and doing so will prevent your content from appearing in ChatGPT responses.

Will blocking AI training crawlers hurt my SEO rankings in Google or Bing?

Blocking AI training crawlers — specifically Google-Extended, GPTBot, CCBot, Applebot-Extended, and Bytespider — has no effect on traditional SEO rankings. Google-Extended is explicitly separate from Googlebot (which handles Search indexing and AI Overview retrieval). GPTBot is separate from ChatGPT-User. Bingbot is the exception: Microsoft uses the same Bingbot user agent for both Bing Search and Microsoft Copilot, so blocking Bingbot would hurt your Bing SEO rankings. For all other major training crawlers, blocking is safe from an SEO perspective. Verify in Google Search Console Crawl Stats that Googlebot volume is unchanged after any robots.txt update.

Which AI crawlers actually honour robots.txt directives?

The major platforms — OpenAI (GPTBot and ChatGPT-User), Anthropic (ClaudeBot and anthropic-ai), Google (Google-Extended), Apple (Applebot-Extended), Perplexity (PerplexityBot), and ByteDance (Bytespider) — all have published policies stating they respect robots.txt. In direct server log analysis across 12 client sites, blocking rules for these crawlers were effective in all cases within 48–72 hours. Smaller and less well-known AI startups are less consistent. Common Crawl (CCBot) complies on its scheduled crawl cycles, though the lag before the next cycle can be up to 25 days.

How do I find out which AI crawlers are currently visiting my site?

Server log files are the most reliable source. Filter your access logs by User-Agent strings and group by known AI crawler patterns (GPTBot, ClaudeBot, PerplexityBot, CCBot, Google-Extended, Bytespider, Applebot-Extended, DuckAssistBot, Meta-ExternalAgent, YouBot, anthropic-ai). For sites under 50,000 URLs, Screaming Frog Log File Analyser makes this straightforward. Google Search Console's Crawl Stats report shows Googlebot and Google-Extended activity but not third-party AI crawlers. Server logs, or a CDN analytics tool like Cloudflare with bot categorisation enabled, are the only sources that capture all crawler types. See the Crawl Budget Optimisation Guide for the full log analysis workflow.

Can I allow specific AI crawlers while blocking others in the same robots.txt file?

Yes — this is the recommended approach for most sites. robots.txt allows separate User-agent blocks for each crawler with independent Disallow and Allow rules. A common pattern is to block all training crawlers (CCBot, GPTBot, Google-Extended, Applebot-Extended, Bytespider, anthropic-ai) while allowing retrieval crawlers (ChatGPT-User, ClaudeBot, PerplexityBot, DuckAssistBot) that fetch content for real-time AI search responses. You can also use path-specific rules — for example, blocking a specific crawler from your /premium/ directory while allowing it access to your public blog.

Is blocking AI crawlers in robots.txt legally enforceable?

robots.txt is a technical convention, not a legally binding contract. It carries no inherent legal weight in most jurisdictions. What is more legally significant is your website's Terms of Service (which can prohibit scraping and AI training use) and, in the EU, copyright law developments under the AI Act's text and data mining provisions, which allow rights holders to opt out of training data collection through a machine-readable statement. In practice, the major AI companies (OpenAI, Anthropic, Google, Apple) honour robots.txt as part of their published crawler policies — creating a de facto compliance layer even where legal enforcement is uncertain.

What is the difference between Bingbot and a separate Microsoft Copilot crawler?

There is no separate Microsoft Copilot crawler. Microsoft Copilot (Bing Chat) uses content indexed by the standard Bingbot — the same crawler that indexes pages for Bing Search results. Unlike OpenAI, which separated GPTBot (training) from ChatGPT-User (browsing), Microsoft has not publicly released a distinct Copilot-specific user agent. The practical implication: if you block bingbot in robots.txt, you lose both Bing search indexing and the ability for your content to appear in Copilot responses. It is currently not possible to allow Bing search indexing while blocking Bing AI use through robots.txt alone.

Does blocking Google-Extended affect my visibility in Google Search or Google AI Overviews?

No. Google-Extended is used exclusively for training Google's generative AI models (Gemini). It is completely separate from Googlebot, which handles both Google Search indexing and the content retrieval behind Google AI Overviews. Blocking Google-Extended in robots.txt has no effect on your Google Search rankings and does not reduce your eligibility to appear in AI Overviews — those are served by Googlebot, which a Google-Extended Disallow does not affect. Verify this in Google Search Console Crawl Stats: Googlebot and Google-Extended appear separately, allowing you to confirm Googlebot volume is unchanged after blocking Google-Extended.

What is the noai meta tag and does it work across all AI systems?

The noai meta tag (added to HTML head as <meta name="robots" content="noai, noimageai">) is a page-level signal requesting that AI systems do not use that page's content or images for training. It has confirmed support from Spawning.ai and some AI image generation platforms, but OpenAI, Anthropic, and Google have not published explicit documentation confirming they honour it. It works at page level rather than site level, making it useful for specific content exclusions. Best treated as a supplementary signal for creative content, not a substitute for robots.txt Disallow rules targeting the major AI training crawlers.

How often should I update my robots.txt AI crawler configuration?

Review your AI crawler robots.txt configuration quarterly, or whenever a major AI platform announces a new crawler or changes its user agent string. The AI crawler landscape has changed significantly every six months since 2023 — new user agents appear, platforms split training and retrieval into separate agents, and smaller AI companies launch crawlers. Subscribe to OpenAI's, Anthropic's, and Google's developer blogs to catch user agent changes quickly. Maintain a change log for your robots.txt file so you can track when rules were added and verify their effect in server logs.

What is Crawl-delay and should I use it to throttle AI bot traffic?

Crawl-delay is a non-standard robots.txt directive requesting a minimum delay in seconds between consecutive requests from a crawler. Most major crawlers support it, including Bingbot and some AI crawlers. However, Googlebot explicitly ignores it — for Google's crawlers, use the crawl rate limiter in Google Search Console instead. For AI crawlers consuming significant server resources, a Crawl-delay of 10–30 seconds can reduce load without fully blocking access. For problematic AI crawlers with excessive crawl rates, a Disallow is more reliable than Crawl-delay, since compliance varies among smaller platforms.

If an AI company ignores my robots.txt directive, what can I do?

First, verify the ignore is genuine: check that the user agent string matches exactly (they are case-sensitive), and that you are looking at post-deployment log data, not pre-deployment cached visits. If a major platform is confirmed to be ignoring your directives, contact their abuse or legal team — OpenAI, Anthropic, and Google all have published processes for disputes. For repeat violations, IP-range blocking at the server or CDN level (Cloudflare WAF, nginx deny rules) is the next step. For rights-based disputes about training data use, engaging legal counsel and referencing your Terms of Service is more effective than technical measures alone.

📚 References & Sources

  1. OpenAI — GPTBot Documentation — Official OpenAI documentation covering the GPTBot user agent string, crawl purpose, IP range, and opt-out mechanisms via robots.txt and the publisher portal. Also covers the separate ChatGPT-User user agent.
  2. Anthropic — ClaudeBot and Web Crawling Documentation — Official Anthropic documentation on ClaudeBot (retrieval) and anthropic-ai (training) user agents, robots.txt compliance commitment, and the distinction between Anthropic's two crawler types.
  3. Google Search Central — Overview of Google's Crawlers — Complete reference listing all Google crawler user agents including Googlebot (search) and Google-Extended (AI training), their purposes, and robots.txt behaviour.
  4. Microsoft Bing — Crawlers Documentation — Official Microsoft documentation on Bingbot and related crawlers. Source for the confirmation that no separate Copilot-specific user agent exists as of mid-2026.
  5. Perplexity AI — PerplexityBot Documentation — Official Perplexity documentation on the PerplexityBot user agent string, crawl purpose, and robots.txt compliance.
  6. Common Crawl — CCBot FAQ — Documentation on the CCBot/2.0 user agent, crawl schedule, robots.txt compliance, and the open-access dataset that powers many public AI model training pipelines.
  7. Spawning.ai — noai and noimageai Meta Tag Specification — Documentation on the noai and noimageai meta tag convention for page-level AI training opt-out.
  8. Rohit Sharma — AI Crawler Server Log Analysis, IndexCraft (Q1–Q2 2026) — Direct server log analysis across 12 client sites covering AI crawler traffic composition, crawl budget allocation, robots.txt compliance timelines, and the before/after impact of blocking configurations on AI search visibility. All experience box findings and compliance table data in this guide derive from this analysis.
🔗 Related Technical SEO Guides
📄
LLM.txt · AI Content Guidance · 2026 LLM.txt Explained: The New robots.txt That Controls How AI Reads Your Website

The complementary guide to this one — once you've controlled which AI crawlers can access your site via robots.txt, llms.txt tells the ones you allow which pages to prioritise and why.

Read LLM.txt guide →
🔧
Technical SEO · Crawl · Indexing · 2026 Technical SEO Guide 2026: Crawlability, Speed & Indexing

The complete technical SEO foundation guide — robots.txt configuration, XML sitemaps, canonical tags, Core Web Vitals, JavaScript SEO, and AI retrieval readiness. The parent reference for this article.

Read Technical SEO guide →
🕷️
Crawl Budget · Log Files · Large Sites Crawl Budget Optimisation Guide 2026: Faster Indexing

Log file analysis for crawl budget management — including AI crawler traffic identification, how to detect which bots are consuming crawl resources, and how to fix it.

Read Crawl Budget guide →
🤖
GEO · AI Overviews · LLM SEO GEO & AEO Guide: Rank in AI Overviews and LLMs

The content optimisation side of AI search visibility — how to structure, mark up, and attribute content so that the retrieval crawlers you allow via robots.txt are more likely to cite you.

Read GEO & AEO guide →

Test your technical SEO knowledge — including AI crawler configuration and crawl budget management — with IndexCraft's technical SEO practice quiz. Or check your full AI search readiness against the AEO, SEO, and GEO checklist.