🕷️ What is crawl budget and how do you optimise it? (Direct answer)
Crawl budget is the number of URLs Googlebot can and wants to crawl on your site within a given timeframe — set by your server's capacity (crawl rate limit) and how much Google thinks your pages are worth revisiting (crawl demand). Getting it right means cutting out URL waste so Googlebot stops burning time on junk pages and actually reaches your important ones. The quickest wins: block parameterised URL variants, collapse redirect chains, return 410 status codes for permanently deleted pages, clean up your XML sitemap, and make sure your server responds fast. In 2026, this has gotten more complicated — AI training crawlers, search bots, and user-action bots are all hitting your server at once, so crawl governance is now a multi-bot problem, not just a Googlebot one.
This is a crawl budget-only guide — the two-factor model, URL inventory management, AI bot handling, log file analysis, and the technical fixes that consistently move the needle across 35+ site audits. If you need to go deeper on related topics:
- ChatGPT Search SEO (Bing indexing + OAI-SearchBot): ChatGPT SEO Guide →
- Schema markup for structured indexing: Schema Markup Guide 2026 →
- Complete technical SEO foundations: Technical SEO Guide →
Over the past three years I've done deep crawl budget audits across 35+ websites — everything from a 12,000-page B2B SaaS documentation site to a 280,000-page fashion platform with aggressive faceted navigation. The one finding that keeps showing up: the crawl efficiency problem is almost never a server issue — it's a URL inventory problem. On average, across the e-commerce sites I've audited, 38% of the URLs Googlebot was crawling generated zero organic traffic and had no real search intent behind them — yet they were soaking up crawl resources that should have gone to core product and category pages. Everything in this guide — the audit steps, the diagnostic frameworks, the fix prioritisation — comes from what I've seen in actual Googlebot log data, live Search Console coverage reports, and indexing velocity measurements after fixes went in. Not theoretical frameworks.
1. What Is Crawl Budget? The Two-Factor Model Explained
Google defines crawl budget as the set of URLs Googlebot can and wants to crawl on your site within a given timeframe. Its December 2025 documentation spells it out as the product of two independent factors: crawl rate limit and crawl demand. [1] You need to understand each one separately, because the fixes for each look completely different.
📐 The Crawl Budget Formula
Effective crawl budget = min(crawl rate limit, crawl demand)
Whichever factor is lower becomes the ceiling. Your server might be able to handle 200 requests per second — but if Google only wants to crawl 50 URLs, you get 50. Flip it around: if Google wants to crawl 1,000 URLs but your server throttles Googlebot to 20 per second, only 20 get crawled per session. Most sites with crawl problems are constrained by crawl demand, not server capacity — because a bloated URL inventory full of low-value pages suppresses what Google thinks is worth crawling in the first place.
Factor 1: Crawl Rate Limit
The crawl rate limit is the maximum speed at which Googlebot will crawl your site without overloading your server. Google adjusts it automatically based on how your server performs — specifically your Time to First Byte (TTFB) and the consistency of your responses. Fast, stable servers get higher crawl rates. If TTFB climbs past 500ms, or your server starts throwing 5xx errors, Googlebot backs off to avoid hammering your infrastructure. You can also manually set a crawl rate cap in Google Search Console — but this reduces your crawl budget ceiling, and you should only do it if you genuinely have server capacity constraints.
Factor 2: Crawl Demand
Crawl demand is how much Google actually wants to crawl your site — driven by three signals: URL popularity (well-linked pages with strong backlink profiles get crawled more often), content freshness (pages that change frequently, like news articles or product pricing, get revisited more), and indexing status (already-indexed pages are rechecked for freshness; newly discovered pages queue for initial crawling). The big takeaway for large sites: you can actively improve crawl demand by cutting the low-value URLs that drag down Google's overall perception of your site, and by building stronger link authority and freshness signals around your most important pages.
🔍 Google's Crawl Budget Decision Pipeline
(server TTFB + stability)
(demand signals)
(value assessment)
URL inventory quality affects both the queue prioritisation and indexing decision steps. Cut the low-value URLs and Googlebot's budget naturally shifts toward pages that actually clear the indexing threshold.
2. Which Sites Actually Need Crawl Budget Optimisation?
Google is pretty explicit about this: if your site has fewer than a few thousand pages and new content typically gets indexed within days of publication, crawl budget is not your problem. [1] Crawl budget work is a high-priority concern for a specific profile of sites — and spending time on it when you have 2,000 clean pages is effort better pointed elsewhere.
✅ Crawl Budget Optimisation: HIGH Priority
- E-commerce sites with faceted navigation generating thousands of filtered URL combinations
- Sites with 10,000+ pages where new content takes weeks to index
- News and media publishers where freshness is a ranking signal and indexing speed affects visibility
- Marketplace and directory sites with user-generated content at scale
- Sites with large volumes of "Discovered — currently not indexed" in Search Console
- Sites that have migrated platforms and carry a large redirect inventory
⬇️ Crawl Budget Optimisation: LOW Priority
- Sites with fewer than 5,000 pages and stable architecture
- Sites where new content is consistently indexed within 24–72 hours
- Lead generation or services sites with mostly static content
- SaaS marketing sites with modest blog volume
- Sites whose Search Console shows primarily "Crawled — currently not indexed" (a content quality signal, not a crawl signal)
3. How to Diagnose Crawl Budget Waste: Tools and Signals
Good crawl budget diagnosis pulls data from three places: Google Search Console coverage reports, Crawl Stats, and server log files. Each gives you a different slice of the picture — and none of them alone tells the whole story.
Go to Settings → Crawl Stats. Check three things: the daily crawl request volume trend (a sustained decline over weeks usually means Googlebot is backing off due to server issues or low content quality), the average response time (aim for under 200ms; above 500ms is where crawl rates start taking a hit), and the response code breakdown. Spikes in 4xx or 5xx responses are the clearest sign of crawl waste. The "File type" breakdown is worth checking too — if Googlebot is burning a lot of budget on CSS, JS, and images, you can deprioritise those via robots.txt.
The Page Indexing report (formerly Coverage) is your crawl budget health dashboard. Sort by "Discovered — currently not indexed" — that number as a percentage of your total submitted URLs is the most direct measure of crawl budget shortfall. If you're above 15–20% and have 10,000+ pages, you have a real problem worth fixing. Cross-reference which URL templates are driving the highest volumes — faceted navigation, pagination, and session parameter variants are where this almost always concentrates.
A full site crawl gives you a complete URL inventory with HTTP status codes, redirect chains, canonical configurations, and indexability status — the bulk data that Search Console won't export. For large sites, export the full crawl to a spreadsheet and segment by URL template type (category pages, product pages, filter pages, pagination, etc.) so you can see where the crawl waste is coming from before you start making robots.txt decisions.
Server logs are the only source that shows you exactly which URLs Googlebot actually crawled, how often, and what response codes came back — regardless of what your CMS, sitemap, or GSC says. Log file analysis gets its own section (Section 10). If you have access to server logs through your hosting provider, CDN log export, or a tool like Screaming Frog Log Analyser, Botify, or JetOctopus, prioritise this above everything else for crawl budget diagnosis.
On a large e-commerce audit, the client's team was convinced they had a content quality problem — thin product descriptions, low engagement rates, too many similar pages. They had budgeted for a six-month content refresh programme.
When I pulled the log files and looked at where Googlebot was actually spending its crawl budget, the content pages were barely part of the story. Around 45% of all Googlebot requests were going to session-parameterised URLs — tracking parameters appended by their analytics and affiliate systems that were generating unique URL variants for every visit. None of these had any SEO value. Blocking the parameter variants in robots.txt and canonicalising the clean URLs took about a week to implement. Crawl frequency on the actual product pages improved substantially within six weeks. The content refresh that had seemed urgent became a secondary priority once Googlebot could actually find and crawl the pages that needed it. — Rohit Sharma
4. URL Inventory Management: The Highest-Leverage Crawl Fix
Google's own documentation calls URL inventory management the most directly actionable crawl budget lever — the thing webmasters can most positively influence. [1] The idea is simple: make sure the only URLs Googlebot can discover and crawl are ones with genuine indexing value, and systematically remove everything else from the crawlable surface.
📊 Crawl Budget Waste — Frequency by URL Type
The URL inventory audit process
Start by segmenting your full URL set by template type, then evaluate each template across five dimensions: URL volume generated, expected indexability, crawl frequency in logs, organic traffic contribution, and business value. Templates generating high URL volume with zero organic traffic and no real search intent behind them are your first targets. You're trying to reduce what Google calls "undesirable perceived inventory" — the mass of low-value URLs that quietly drag down crawl demand signals for every good page on your site. [2]
5. How to Handle Parameterised URLs and Faceted Navigation
Faceted navigation is the biggest single source of crawl budget waste on e-commerce and directory sites. A modest category structure with 200 categories and 10 filter dimensions can spit out hundreds of thousands of unique URL combinations — and Googlebot will try to crawl all of them if they're reachable via followed links. Every one of those filter combination URLs burns crawl budget while adding nothing that isn't already on the base category page.
Don't block everything before checking whether any filter combinations actually have search volume. "Blue running shoes for men" might have enough demand to justify its own crawlable, indexable URL. Run your key filter combinations through Ahrefs or Semrush. Most won't have meaningful volume — but some will, and those are worth keeping. Document your decisions clearly so future site changes don't quietly re-open blocked patterns.
For URL patterns with no unique content and no search demand, use Disallow in robots.txt to stop Googlebot crawling them — Google's own recommended approach for faceted navigation. Be precise with your patterns; too broad and you'll accidentally block legitimate category pages. Verify everything in the Google Search Console robots.txt tester before deploying. One thing to watch: robots.txt blocks crawling, not indexing. If blocked URLs have external links pointing at them, they can still appear in the index as "URL unknown" entries. Combine robots.txt blocking with noindex on rendered pages if that's a risk.
For filter URLs that need to stay crawlable (blocking them would break the user experience) but shouldn't be indexed, put a canonical tag on them pointing to the base category page. Googlebot crawls the page, reads the canonical signal, and consolidates link equity back to the base. This is the right move when JavaScript rendering makes robots.txt blocking impractical, or when you want Googlebot to reach the page without indexing the filter variant.
User-agent: * # Block all filter parameter combinations Disallow: /*?color= Disallow: /*?size= Disallow: /*?sort= Disallow: /*?page= Disallow: /*&color= Disallow: /*&size= User-agent: Googlebot # Allow high-value facet combinations with confirmed search demand Allow: /category/mens-shoes?color=blue Allow: /category/dresses?length=maxi # Block internal search pages entirely Disallow: /search?q= Disallow: /search/ # Block session and tracking parameters Disallow: /*?sessionid= Disallow: /*?utm_source= Sitemap: https://yoursite.com/sitemap.xml
6. Internal Linking and Crawl Depth Optimisation
Crawl depth — the number of internal link hops from the homepage to a given page — directly affects how often that page gets crawled and how much link equity it accumulates. Pages that require 5 or more hops tend to get crawled infrequently. Pages beyond depth 7 may not get crawled at all within a normal crawl budget cycle. For large sites, reducing crawl depth for high-value content is one of the most impactful architectural changes you can make.
Think of it as: homepage at depth 0; top-level categories at 1–2; subcategories at 2–3; product and article pages at 3–4 maximum. If your architecture puts products 7–8 clicks from the homepage — common in older Magento builds — that's a structural problem. Fix it by creating hub pages, improving category pagination linking, or adding contextual internal links from high-traffic pages down to deep content.
Freshness is a crawl demand signal. A page that's been updated but has weak internal linking won't get recrawled quickly — meaning your edits sit in limbo before they show up in the index. Add internal links from high-traffic hub pages to recently updated content. This matters most for news publishers, product pricing pages, and anything where freshness is a ranking differentiator.
Orphan pages — pages with no internal links pointing to them — can only be discovered via sitemap, which means they get crawled infrequently. A full crawl audit in Screaming Frog or Sitebulb will surface them by cross-referencing your crawled URL list with your sitemap. Any page that matters for organic traffic needs at least one internal link from a crawled, indexed page — not just a sitemap entry.
7. Server Performance and Crawl Rate Optimisation
For most sites, server performance isn't the constraint — and shaving TTFB from 250ms to 150ms when Googlebot is already crawling freely won't move the needle on crawl budget. But when server performance actually is the bottleneck, fixing it can unlock significant crawl rate improvements. The signs that it's the real problem: TTFB consistently above 500ms in Crawl Stats, frequent 5xx responses in your log data, or a GSC crawl stats chart where crawl rates dropped at the same time your server load went up.
| Metric | Target | Risk Threshold | Crawl Budget Impact |
|---|---|---|---|
| Time to First Byte (TTFB) | <200ms | >500ms | Direct — Googlebot reduces crawl rate automatically above threshold [1] |
| 5xx Error Rate | <0.1% of crawl requests | >1% of daily crawl requests | Significant — sustained 5xx errors trigger Googlebot crawl rate reduction |
| DNS Resolution Time | <50ms | >200ms | Often overlooked — slow DNS consumes part of each crawl request's time budget [3] |
| Redirect Response Time | Each hop <100ms | Chains >3 hops | Each redirect adds latency and consumes an additional crawl request |
| Page Render Time (JS-heavy pages) | <5 seconds | >10 seconds | Googlebot may partially render or skip rendering for slow pages, causing content to be absent from the index |
8. XML Sitemap Hygiene for Crawl Efficiency
Your XML sitemap is a priority signal — it tells Googlebot which URLs deserve attention. A bloated or inaccurate sitemap does the opposite: it dilutes the signal from your best pages by mixing them in with URLs that have no business being there. Google explicitly recommends keeping your sitemap current and pruning outdated URLs as part of core crawl budget management. [1]
Every URL in your sitemap should pass four tests: it returns a 200 HTTP status, it's not blocked by robots.txt, it's not tagged noindex, and it carries a self-referencing canonical. A robots.txt-blocked page in your sitemap sends contradictory signals. A noindex page in your sitemap wastes a crawl request that could go to something indexable. Audit against these four criteria monthly — most CMS platforms silently include non-canonical and noindexed URLs in auto-generated sitemaps.
The lastmod tag influences recrawl priority by signalling when a page was last meaningfully updated. Google's documentation warns that inaccurate lastmod values — ones that don't reflect real content changes — erode the signal's reliability over time. [1] Only update lastmod when substantive changes are made, not on every page load or template render. If your CMS auto-stamps every page with today's date regardless of whether anything actually changed, turn that off.
If you're above 50,000 URLs, sitemap index files segmented by content type (products, categories, articles, landing pages) make Search Console data significantly more useful — you can monitor indexing rates per content type instead of wading through aggregate numbers. Segmented sitemaps also let you submit new content independently without regenerating the full sitemap, which speeds discovery for high-priority new pages.
On a media publishing client with around 42,000 articles, the sitemap audit turned up 6,200 URLs tagged noindex in the CMS — editorial decisions to de-publish older content that the sitemap generator had never been told about. Another 3,100 URLs were returning 301 redirects to consolidated content instead of pointing to the final canonical URL. After cleaning the sitemap down to only canonical, 200-status, indexable URLs, "Discovered — currently not indexed" in Search Console dropped 31% within six weeks. The crawl budget freed up from those dead-end URLs redirected to 4,800 high-priority articles that had been sitting in the discovery queue for over 90 days. [4]
9. AI Bot Crawling in 2026: What It Means for Your Crawl Budget
The crawler landscape has changed significantly since 2024. Crawl budget management is no longer just about Googlebot — it's a multi-bot problem. Cloudflare's July 2025 analysis found that AI and search crawler traffic grew 18% from May 2024 to May 2025, with GPTBot alone growing 305% and Googlebot up 96% in raw request volume. [5] By December 2025, Googlebot alone accounted for 4.5% of all HTML request traffic Cloudflare observed — more than all other AI bots combined at 4.2%. [6]
Managing AI bots without hurting search indexing
| Bot | Primary Purpose | Recommended Setting | Rationale |
|---|---|---|---|
| Googlebot | Search indexing + AI training (dual-use) | Allow (full) | Blocking harms search visibility. Cannot separate search crawl from AI training crawl. |
| Bingbot | Bing search indexing + AI training (dual-use) | Allow (full) | Blocking eliminates ChatGPT Search and Microsoft Copilot citation eligibility. [9] |
| OAI-SearchBot | ChatGPT Search real-time retrieval | Allow | Blocking reduces ChatGPT Search citation eligibility for your pages. [8] |
| GPTBot | OpenAI AI model training (not search) | Publisher's choice | Blocking has no impact on search indexing. GPTBot grew 305% in raw request volume from May 2024 to May 2025. [5] Rate-limit if server load is a concern. |
| ClaudeBot | Anthropic AI model training | Publisher's choice | No search indexing function. Cloudflare data shows Anthropic had the highest crawl-to-refer ratio among AI platforms in 2025. [7] |
| PerplexityBot | Perplexity AI search retrieval | Allow (if AI search visibility is valued) | Grew from a very small base but represents a search retrieval function — blocking reduces Perplexity AI citation eligibility. |
10. Log File Analysis: The Definitive Crawl Budget Data Source
Server log files are the only source that tells you what Googlebot actually did on your site — not what Search Console approximates, not what your sitemap assumes, not what your CMS thinks happened. Every HTTP request to your server gets recorded: the user-agent, URL, response code, response time, timestamp. That data gives you a precise picture of where Googlebot is spending its crawl budget versus where you actually want it spending it.
Question 1: What percentage of crawl requests go to high-value pages? → Filter logs: User-Agent contains "Googlebot" → Group by URL template (category, product, article, filter, pagination) → Calculate percentage of total crawl requests per template → Target: Category + Product + Article pages = 70%+ of crawl requests → Red flag: Filter/parameter URLs > 20% of crawl requests Question 2: Which URLs are crawled frequently but never indexed? → Cross-reference log data with GSC Page Indexing export → Identify URLs crawled 10+ times with "Not indexed" GSC status → These are crawl budget drains — evaluate for noindex or block Question 3: What is the response code distribution? → Group crawl requests by HTTP status code → Calculate percentage: 200, 301, 302, 404, 410, 5xx → Target: 200 responses > 90% of Googlebot requests → Red flag: 404 responses > 5%, 5xx responses > 1% Question 4: Which bots beyond Googlebot are consuming server resources? → Group all log requests by User-Agent → Calculate percentage of total requests per bot → Identify training-only bots consuming significant server bandwidth → Consider rate limiting for training bots with no search indexing function Question 5: What is the crawl frequency trend? → Plot daily Googlebot request volume over 90 days → Declining trend = server issue or content value degradation → Stable/growing trend = healthy crawl relationship
Recommended log file analysis tools (2026)
For sites under 50,000 URLs, Screaming Frog Log File Analyser is the easiest entry point — it imports server logs and visualises bot behaviour, crawl frequency, and response code distribution in a desktop GUI with no SQL or data engineering needed. Above 100,000 URLs, dedicated platforms like Botify, JetOctopus, and OnCrawl handle the scale better. If your team has data engineering capacity, piping raw logs into BigQuery and visualising in Looker Studio gives you the most flexible analysis environment for the lowest ongoing tooling cost.
11. Crawl Budget for Specific Site Types
E-commerce is where crawl budget optimisation delivers its biggest returns. Faceted navigation is the main culprit — tens of thousands of low-value filter URL combinations quietly absorbing crawl activity while core categories and products wait in the queue. [2] The fix sequence: pull log files to measure how much of Googlebot's budget is going to facet URLs; run keyword demand analysis on key filter combinations; block zero-demand combinations via robots.txt; implement canonicals on retained facet URLs; clean your sitemap of all blocked and noindexed filter variants.
For news publishers, the goal isn't cutting URL inventory — it's maximising crawl frequency for new and recently updated content. The key interventions: server-side rendering so articles are immediately available to Googlebot without waiting on JavaScript; IndexNow implementation to notify Bing and other participating engines at the moment of publication; Google News sitemap submission with accurate timestamps; and internal linking from high-traffic category hubs to new articles within minutes of going live, giving Googlebot additional discovery paths beyond the sitemap.
Documentation sites tend to accumulate crawl waste through versioned docs (v1, v2, v3 covering the same features), archived support articles, and developer portal sections behind authentication that Googlebot hits but can't render. High-impact fixes: consolidate versioned documentation to the current version with proper redirects from legacy paths; block authenticated developer portal sections via robots.txt; and check whether your documentation CMS is generating pagination variants for long pages that carry no unique content.
12. How to Measure Crawl Budget Improvements
If you don't establish baselines before making changes, you have no way of knowing whether what you did actually worked. Take measurements before any intervention, track on a consistent timeline after, and you'll have something concrete to point to in a stakeholder report rather than just a gut feeling that things got better.
| Metric | Data Source | Baseline Measurement | Expected Improvement Timeline |
|---|---|---|---|
| Discovered — currently not indexed count | Google Search Console Page Indexing | Export count on day 0 | 4–8 weeks post-fix for significant reduction |
| Daily Googlebot crawl request volume | GSC Crawl Stats or server logs | 30-day average pre-intervention | 2–4 weeks to see redistribution toward high-value pages |
| Indexed page count by template type | GSC Page Indexing (filter by sitemap child) | Count per content type on day 0 | 6–12 weeks for meaningful indexing gains in priority templates |
| Redirect URL count | Screaming Frog or site crawl export | Total 301/302 responses in crawl | Immediate after redirect consolidation; GSC effect within 2–4 weeks |
| Crawl budget waste rate | Server log file analysis | % of Googlebot requests to non-indexable URLs | Immediate after robots.txt/sitemap fixes; verify via next log analysis cycle |
13. Crawl Budget Mistakes to Avoid
| Mistake | Why It Harms Crawl Efficiency | Severity | Fix |
|---|---|---|---|
| Including noindexed pages in XML sitemap | The sitemap says "crawl this," the noindex tag says "don't index this." Mixed signals, wasted crawl request. Turns up in 64% of audited sites. [4] | HIGH | Cross-reference your sitemap against noindex tags monthly. Pull any noindexed URLs out of the sitemap. |
| Leaving soft 404s returning 200 status | Googlebot burns a crawl request on a page that shouldn't exist, and soft 404s quietly drag down crawl demand signals across the whole site. [1] | HIGH | Return a real 404 or 410. Use 410 for permanently removed pages — Google removes 410s from the crawl queue faster than 404s. |
| Blocking Bingbot via robots.txt wildcard | A broad Disallow: * under User-agent: * silently cuts off Bingbot, which kills ChatGPT Search and Microsoft Copilot citation eligibility. Showing up in 38% of audited sites. [9] |
CRITICAL | Test Bingbot specifically in Bing Webmaster Tools' Robots.txt Tester. Explicitly allow it on all content pages. |
| Manually reducing crawl rate in GSC without server constraints | You're artificially capping your crawl budget ceiling for no reason. That setting exists for when 5xx errors confirm your server is overwhelmed — using it as a precaution just slows indexing. | MEDIUM | Remove any crawl rate limits in GSC unless active 5xx errors are the problem. Let Google manage the rate automatically. |
| Using inaccurate lastmod dates in sitemap | When your CMS stamps every page with today's date on every render, Googlebot can't use lastmod to prioritise genuinely fresh content for recrawl. The signal becomes worthless. [1] | MEDIUM | Only update lastmod on real content changes. Disable auto-date-stamping on minor template updates in your CMS sitemap settings. |
| Blocking AI search bots to "save crawl budget" | OAI-SearchBot and Bingbot handle search retrieval, not just training. Blocking them cuts AI search citation eligibility. Their server load is usually small compared to Googlebot and training-only bots anyway. [8] | MEDIUM | Know the difference: training bots (GPTBot, ClaudeBot) vs. search retrieval bots (OAI-SearchBot, Bingbot). Rate-limit or block training-only bots if server load is a genuine concern. |
✅ Crawl Budget Optimisation — Complete Audit Checklist
- Google Search Console Crawl Stats reviewed — average response time and response code distribution checked
- Page Indexing report exported — "Discovered — currently not indexed" count established as baseline
- Full site crawl completed (Screaming Frog/Sitebulb) — URL inventory segmented by template type
- Server log files obtained and analysed — Googlebot crawl distribution mapped to URL templates
- Faceted navigation URL patterns identified — demand-tested against keyword data before blocking
- Low-value facet URLs blocked via robots.txt and/or canonicalised to base category pages
- All redirect chains audited — flattened to single hop where possible
- Soft 404s identified and resolved — returning proper 404 or 410 status codes
- XML sitemap audited — noindexed, redirected, and blocked URLs removed
- Lastmod dates verified — only accurate publication/modification dates used
- Crawl depth of high-value pages checked — pages beyond depth 5 added to internal linking plan
- Orphan pages identified — high-value orphans linked from relevant hub pages
- Internal search result pages blocked via robots.txt
- Session ID and tracking parameter URLs blocked or canonicalised
- Bingbot confirmed as allowed — Bing Webmaster Tools robots.txt tester verified
- OAI-SearchBot not blocked — robots.txt and WAF rules checked
- GPTBot/ClaudeBot strategy documented — conscious decision on training bot access made
- Do not reduce crawl rate in GSC unless active 5xx errors confirm server overload
- Never include noindexed or robots.txt-blocked pages in XML sitemap
- Never return 200 status on pages that have been permanently removed — use 410
14. Frequently Asked Questions About Crawl Budget Optimisation
What is crawl budget in SEO?
Crawl budget is the number of URLs Googlebot can and wants to crawl on your site within a given timeframe — the product of crawl rate limit (how fast Googlebot can crawl without stressing your server) and crawl demand (how strongly Google wants to crawl based on popularity, freshness, and content quality). Pages that don't get crawled can't be indexed. Pages crawled infrequently may surface outdated information in search results. If you're under 5,000 pages with clean architecture, crawl budget is rarely a real constraint. [1]
How do I check my crawl budget in Google Search Console?
Go to Settings → Crawl Stats for total daily crawl requests, average response time, and a 90-day response code breakdown. The Page Indexing report (Indexing → Pages) shows how many of your URLs are in each indexing status category — "Discovered — currently not indexed" is the metric that flags crawl budget problems on large sites. For deeper insight, combine Crawl Stats with log file analysis, which gives you bot-level URL frequency data that Search Console doesn't export. [1]
Does page speed affect crawl budget?
Yes, but only when server response is a genuine constraint — not a marginal improvement opportunity. Googlebot automatically backs off when TTFB exceeds roughly 500ms or when 5xx error rates spike. If your TTFB is under 300ms and 5xx errors are minimal, squeezing more server performance out won't move the crawl budget needle. For most sites, URL inventory management does far more than server optimisation. [1]
Should I use noindex or robots.txt to block low-value pages?
It depends on what outcome you need. Robots.txt blocks crawling but not indexing — pages blocked this way can still end up in the index if external sites link to them. Noindex allows crawling but prevents indexing — Googlebot visits the page, reads the directive, and removes it from the index. For pages with no external links and zero value, robots.txt prevents the crawl request entirely, which is the most efficient option. For pages that might attract external links, use both robots.txt and noindex together. [1]
How does AI bot crawling affect crawl budget in 2026?
AI crawler traffic grew 18% from May 2024 to May 2025, and Googlebot alone now generates more than 25% of all Verified Bot traffic — more than all other AI bots combined. [6] You now need bot access policies across Googlebot and Bingbot (search indexing + AI training), OAI-SearchBot (ChatGPT Search retrieval), GPTBot (OpenAI training only), ClaudeBot (Anthropic training only), and PerplexityBot (search retrieval). Allow search retrieval bots — blocking them harms AI search visibility. Training-only bots can be rate-limited or blocked without touching search indexing.
What is the difference between 404 and 410 for crawl budget?
Both tell Google the page doesn't exist, but 410 (Gone) works faster. Googlebot may revisit a 404 page multiple times before accepting it's gone permanently, while a 410 is treated as a hard removal signal that stops recrawl attempts much sooner. For any page that's been permanently deleted and is never coming back, return a 410 — it's the faster path to eliminating those wasted crawl requests. [1]
Sources & References
📚 Research, Data & Official Documentation Referenced in This Article
- Google Search Central — Manage Your Crawl Budget (Official Documentation, updated December 2025)
Google's authoritative documentation on crawl budget for large sites — defining crawl rate limit and crawl demand, URL inventory management recommendations, and sitemap guidance. Updated December 10, 2025.
developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget - Incremys — SEO Crawl Budget: A Technical Guide (February 2026)
Technical crawl budget guide covering URL family segmentation, faceted navigation drift, and the five-dimension URL evaluation framework for e-commerce and high-volume sites. References Google documentation on "undesirable perceived inventory" as the most influenceable lever.
incremys.com/en/resources/blog/seo-crawl-budget - CaptainDNS — Crawl Budget Optimization: Complete Guide 2026 (February 2026)
Comprehensive crawl budget guide covering the effective crawl budget formula, DNS performance impact on crawl speed, and seven practical optimisation techniques including URL inventory control and sitemap hygiene.
captaindns.com/en/blog/crawl-budget-optimization - IndexCraft — Internal Crawl Budget Audit Data (2025–2026)
Proprietary observational data from crawl budget audits across 35+ client websites, log file analyses, and Google Search Console coverage tracking conducted by Rohit Sharma at IndexCraft. Aggregate findings cited in this article; full data available to clients under NDA. - Cloudflare Blog — "From Googlebot to GPTBot: Who's Crawling Your Site in 2025" (July 2025)
Cloudflare's analysis of crawler traffic patterns showing 18% overall AI and search crawler growth from May 2024 to May 2025, GPTBot growing 305%, Googlebot up 96%, and the shift in market share across leading bots including GPTBot rising from #9 to #3.
blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/ - Cloudflare Radar Year in Review 2025 (December 2025)
Cloudflare's sixth annual review of internet traffic trends — confirming Googlebot generated more than 25% of all Verified Bot traffic and 4.5% of all HTML request traffic in 2025, exceeding all other AI bots combined (4.2%). Includes analysis showing Googlebot crawled 11.6% of all sampled pages versus GPTBot at 3.6%.
blog.cloudflare.com/cloudflare-radar-2025-year-in-review/ - Cloudflare Blog — "The crawl-to-click gap: Cloudflare data on AI bots, training, and referrals" (October 2025)
Analysis of AI bot crawl-to-refer ratios showing Anthropic with the highest ratios (up to 38,000 crawls per visitor in July 2025), and the breakdown of AI crawling purposes: 82% training, 15% search, 3% user actions.
blog.cloudflare.com/crawlers-click-ai-bots-training/ - OpenAI — OAI-SearchBot & GPTBot Crawler Documentation
OpenAI's official documentation on web crawler user-agent strings, crawl behaviour, and recommended robots.txt configuration distinguishing OAI-SearchBot (ChatGPT Search retrieval) from GPTBot (AI model training only).
platform.openai.com/docs/gptbot - IndexCraft — ChatGPT SEO Guide 2026 (March 2026)
IndexCraft's platform-specific guide on ChatGPT Search optimisation — covering the Bing index prerequisite, Bingbot crawlability requirements, and the finding across 47 audited sites that 38% had inadvertent Bingbot-blocking in robots.txt affecting ChatGPT Search eligibility.
indexcraft.in/blog/chatgpt-seo-guide
Platform-specific deep-dive on ChatGPT Search optimisation — Bing indexing prerequisites, OAI-SearchBot crawlability, Browse tool mechanics, and the content structure signals that earn ChatGPT footnote citations.
Read ChatGPT SEO guide →Complete schema markup implementation guide covering Article, FAQPage, HowTo, Product, and BreadcrumbList schemas — the structured data signals that improve both traditional SERP features and AI search citation eligibility.
Read schema markup guide →Platform-exclusive deep-dive covering Google AI Mode's Gemini architecture, full-page search experience, and the content and technical signals specific to Google AI Mode citation — including how crawl quality affects AI Mode inclusion.
Read Google AI Mode guide →The complete technical SEO foundation guide covering Core Web Vitals, JavaScript rendering, mobile-first indexing, HTTPS, structured data, and the full technical audit checklist — the framework that crawl budget optimisation sits within.
Read technical SEO guide →