Crawl Budget Optimisation Guide 2026: Faster Indexing for Large Sites

Q: How do I check my crawl budget in Google Search Console?

Go to Settings then Crawl Stats for total daily crawl requests, average response time, and a 90-day response code breakdown. The Page Indexing report (Indexing then Pages) shows how many of your URLs are in each indexing status category — Discovered currently not indexed is the metric that flags crawl budget problems on large sites. For deeper insight, combine Crawl Stats with log file analysis, which gives you bot-level URL frequency data that Search Console doesn't export.

🕷️ What is crawl budget and how do you optimise it? (Direct answer)

Crawl budget is the number of URLs Googlebot can and wants to crawl on your site within a given timeframe — set by your server's capacity (crawl rate limit) and how much Google thinks your pages are worth revisiting (crawl demand). Getting it right means cutting out URL waste so Googlebot stops burning time on junk pages and actually reaches your important ones. The quickest wins: block parameterised URL variants, collapse redirect chains, return 410 status codes for permanently deleted pages, clean up your XML sitemap, and make sure your server responds fast. In 2026, this has gotten more complicated — AI training crawlers, search bots, and user-action bots are all hitting your server at once, so crawl governance is now a multi-bot problem, not just a Googlebot one.

📌 What this guide covers
This is a crawl budget-only guide — the two-factor model, URL inventory management, AI bot handling, log file analysis, and the technical fixes that consistently move the needle across 35+ site audits. If you need to go deeper on related topics:

ChatGPT Search SEO (Bing indexing + OAI-SearchBot): ChatGPT SEO Guide →
Schema markup for structured indexing: Schema Markup Guide 2026 →
Complete technical SEO foundations: Technical SEO Guide →

👤 From My Audits — Rohit Sharma, IndexCraft

Over the past three years I've done deep crawl budget audits across 35+ websites — everything from a 12,000-page B2B SaaS documentation site running on a headless CMS to a 280,000-page fashion platform with aggressive faceted navigation. The one finding that keeps showing up: the crawl efficiency problem is almost never a server issue — it's a URL inventory problem. On average, across the e-commerce sites I've audited, 38% of the URLs Googlebot was crawling generated zero organic traffic and had no real search intent behind them — yet they were soaking up crawl resources that should have gone to core product and category pages. Everything in this guide — the audit steps, the diagnostic frameworks, the fix prioritisation — comes from what I've seen in actual Googlebot log data, live Search Console coverage reports, and indexing velocity measurements after fixes went in. Not theoretical frameworks.

10,000+ Pages — the threshold above which Google explicitly recommends crawl budget management as a primary SEO concern Source: Google Search Central Documentation, updated December 2025

+18% Growth in total AI and search crawler traffic from May 2024 to May 2025 — with GPTBot alone growing 305% in that period Source: Cloudflare Blog — "From Googlebot to GPTBot: Who's Crawling Your Site in 2025," July 2025

4.5% Of all HTML requests on Cloudflare's network generated by Googlebot alone in 2025 — more than all other AI bots combined (4.2%) Source: Cloudflare Radar Year in Review 2025, December 2025

⚡ Key Takeaways

Crawl budget = crawl rate limit (server capacity) × crawl demand (Google's desire to crawl). Both levers are optimisable.
Only critical for sites with 10,000+ pages, heavy faceted navigation, or large volumes of "Discovered — currently not indexed" in Search Console.
URL inventory management is the highest-leverage fix — blocking parameter variants, removing noindexed pages from sitemaps, and consolidating thin content.
Server log analysis is the only way to see exactly where Googlebot is spending its crawl budget — Search Console alone is insufficient.
AI bot crawl volume is rising fast (GPTBot +305% YoY per Cloudflare). Manage AI crawlers with granular robots.txt policies — don't block them all.
Return 410 (Gone) instead of 404 for permanently deleted pages — Googlebot stops recrawling 410s significantly faster.

1. What Is Crawl Budget? The Two-Factor Model Explained

Google defines crawl budget as the set of URLs Googlebot can and wants to crawl on your site within a given timeframe. Its December 2025 documentation spells it out as the product of two independent factors: crawl rate limit and crawl demand. [1] You need to understand each one separately, because the fixes for each look completely different.

📐 The Crawl Budget Formula

Effective crawl budget = min(crawl rate limit, crawl demand)

Whichever factor is lower becomes the ceiling. Your server might be able to handle 200 requests per second — but if Google only wants to crawl 50 URLs, you get 50. Flip it around: if Google wants to crawl 1,000 URLs but your server throttles Googlebot to 20 per second, only 20 get crawled per session. Most sites with crawl problems are constrained by crawl demand, not server capacity — because a bloated URL inventory full of low-value pages suppresses what Google thinks is worth crawling in the first place.

Factor 1: Crawl Rate Limit

The crawl rate limit is the maximum speed at which Googlebot will crawl your site without overloading your server. Google adjusts it automatically based on how your server performs — specifically your Time to First Byte (TTFB) and the consistency of your responses. Fast, stable servers get higher crawl rates. If TTFB climbs past 500ms, or your server starts throwing 5xx errors, Googlebot backs off to avoid hammering your infrastructure. You can also manually set a crawl rate cap in Google Search Console — but this reduces your crawl budget ceiling, and you should only do it if you genuinely have server capacity constraints.

Factor 2: Crawl Demand

Crawl demand is how much Google actually wants to crawl your site — driven by three signals: URL popularity (well-linked pages with strong backlink profiles get crawled more often), content freshness (pages that change frequently, like news articles or product pricing, get revisited more), and indexing status (already-indexed pages are rechecked for freshness; newly discovered pages queue for initial crawling). The big takeaway for large sites: you can actively improve crawl demand by cutting the low-value URLs that drag down Google's overall perception of your site, and by building stronger link authority and freshness signals around your most important pages.

🔍 Google's Crawl Budget Decision Pipeline

Googlebot visits site

→

Crawl rate limit assessed
(server TTFB + stability)

→

URL queue prioritised
(demand signals)

→

Pages fetched & rendered

→

Indexing decision
(value assessment)

URL inventory quality affects both the queue prioritisation and indexing decision steps. Cut the low-value URLs and Googlebot's budget naturally shifts toward pages that actually clear the indexing threshold.

2. Which Sites Actually Need Crawl Budget Optimisation?

Google is pretty explicit about this: if your site has fewer than a few thousand pages and new content typically gets indexed within days of publication, crawl budget is not your problem. [1] Crawl budget work is a high-priority concern for a specific profile of sites — and spending time on it when you have 2,000 clean pages is effort better pointed elsewhere.

✅ Crawl Budget Optimisation: HIGH Priority

E-commerce sites with faceted navigation generating thousands of filtered URL combinations
Sites with 10,000+ pages where new content takes weeks to index
News and media publishers where freshness is a ranking signal and indexing speed affects visibility
Marketplace and directory sites with user-generated content at scale
Sites with large volumes of "Discovered — currently not indexed" in Search Console
Sites that have migrated platforms and carry a large redirect inventory

⬇️ Crawl Budget Optimisation: LOW Priority

Sites with fewer than 5,000 pages and stable architecture
Sites where new content is consistently indexed within 24–72 hours
Lead generation or services sites with mostly static content
SaaS marketing sites with modest blog volume
Sites whose Search Console shows primarily "Crawled — currently not indexed" (a content quality signal, not a crawl signal)

If this is you — especially a newer site — your time is better spent elsewhere for now. Our SEO for startups playbook covers what to prioritise instead at this stage.

Important distinction: "Crawled — currently not indexed" in Search Console is almost always a content quality problem, not a crawl budget problem. Google visited those pages and decided they weren't worth indexing. Crawl budget work won't fix thin or duplicate content. The status you actually want to watch is "Discovered — currently not indexed" — pages Google found but hasn't crawled yet. That's your real crawl budget signal.

3. How to Diagnose Crawl Budget Waste: Tools and Signals

Good crawl budget diagnosis pulls data from three places: Google Search Console coverage reports, Crawl Stats, and server log files. Each gives you a different slice of the picture — and none of them alone tells the whole story.

Google Search Console — Crawl Stats Report

Go to Settings → Crawl Stats. Check three things: the daily crawl request volume trend (a sustained decline over weeks usually means Googlebot is backing off due to server issues or low content quality — though it's worth ruling out a broader algorithm update hitting the site at the same time), the average response time (aim for under 200ms; above 500ms is where crawl rates start taking a hit), and the response code breakdown. Spikes in 4xx or 5xx responses are the clearest sign of crawl waste. The "File type" breakdown is worth checking too — if Googlebot is burning a lot of budget on CSS, JS, and images, you can deprioritise those via robots.txt.

Google Search Console — Page Indexing Report

The Page Indexing report (formerly Coverage) is your crawl budget health dashboard — if you're not comfortable navigating every report in the platform yet, our Google Search Console guide walks through each one. Sort by "Discovered — currently not indexed" — that number as a percentage of your total submitted URLs is the most direct measure of crawl budget shortfall. If you're above 15–20% and have 10,000+ pages, you have a real problem worth fixing. Cross-reference which URL templates are driving the highest volumes — faceted navigation, pagination, and session parameter variants are where this almost always concentrates.

Third-Party Crawlers — Screaming Frog, Sitebulb, or Ahrefs Site Audit

A full site crawl gives you a complete URL inventory with HTTP status codes, redirect chains, canonical configurations, and indexability status — the bulk data that Search Console won't export. For large sites, export the full crawl to a spreadsheet and segment by URL template type (category pages, product pages, filter pages, pagination, etc.) so you can see where the crawl waste is coming from before you start making robots.txt decisions.

Server Log Files — The Ground Truth

Server logs are the only source that shows you exactly which URLs Googlebot actually crawled, how often, and what response codes came back — regardless of what your CMS, sitemap, or GSC says. Log file analysis gets its own section (Section 10). If you have access to server logs through your hosting provider, CDN log export, or a tool like Screaming Frog Log Analyser, Botify, or JetOctopus, prioritise this above everything else for crawl budget diagnosis.

👤 From My Audits — The Diagnosis That Changed the Fix Priority

On a large e-commerce audit, the client's team was convinced they had a content quality problem — thin product descriptions, low engagement rates, too many similar pages. They had budgeted for a six-month content refresh programme.

When I pulled the log files and looked at where Googlebot was actually spending its crawl budget, the content pages were barely part of the story. Around 45% of all Googlebot requests were going to session-parameterised URLs — tracking parameters appended by their analytics and affiliate systems that were generating unique URL variants for every visit. None of these had any SEO value. Blocking the parameter variants in robots.txt and canonicalising the clean URLs took about a week to implement. Crawl frequency on the actual product pages improved substantially within six weeks. The content refresh that had seemed urgent became a secondary priority once Googlebot could actually find and crawl the pages that needed it. — Rohit Sharma

4. URL Inventory Management: The Highest-Leverage Crawl Fix

Google's own documentation calls URL inventory management the most directly actionable crawl budget lever — the thing webmasters can most positively influence. [1] The idea is simple: make sure the only URLs Googlebot can discover and crawl are ones with genuine indexing value, and systematically remove everything else from the crawlable surface.

📊 Crawl Budget Waste — Frequency by URL Type

Methodology: These frequency scores show how often each URL type came up as the primary crawl waste source during log file and Search Console analysis across 35+ site audits (IndexCraft internal research, 2025–2026 [4]), cross-referenced with Google Search Central documentation [1] and Incremys crawl budget analysis [2]. Observational estimates from audit patterns — not algorithmic weights.

Faceted filter / parameter URL combinations

95%

Soft 404 pages (200 status but no real content)

90%

Redirect chains (2+ hops)

85%

Pagination pages beyond page 3–4 with thin content

78%

404 pages not returning proper 404/410 status

70%

Near-duplicate product/content variants

62%

Internal search result pages (?q= variations)

50%

Session ID and tracking parameter URLs

42%

The URL inventory audit process

Start by segmenting your full URL set by template type, then evaluate each template across five dimensions: URL volume generated, expected indexability, crawl frequency in logs, organic traffic contribution, and business value. Templates generating high URL volume with zero organic traffic and no real search intent behind them are your first targets — the same triage logic covered in our content pruning guide applies here, just at the URL-template level rather than the individual-page level. You're trying to reduce what Google calls "undesirable perceived inventory" — the mass of low-value URLs that quietly drag down crawl demand signals for every good page on your site. [2]

5. How to Handle Parameterised URLs and Faceted Navigation

Faceted navigation is the biggest single source of crawl budget waste on e-commerce and directory sites. A modest category structure with 200 categories and 10 filter dimensions can spit out hundreds of thousands of unique URL combinations — and Googlebot will try to crawl all of them if they're reachable via followed links. Every one of those filter combination URLs burns crawl budget while adding nothing that isn't already on the base category page.

Identify which facet combinations have genuine search demand

Don't block everything before checking whether any filter combinations actually have search volume. "Blue running shoes for men" might have enough demand to justify its own crawlable, indexable URL. Run your key filter combinations through Ahrefs or Semrush. Most won't have meaningful volume — but some will, and those are worth keeping. Document your decisions clearly so future site changes don't quietly re-open blocked patterns.

Block low-value facet URLs via robots.txt

For URL patterns with no unique content and no search demand, use Disallow in robots.txt to stop Googlebot crawling them — Google's own recommended approach for faceted navigation. Be precise with your patterns; too broad and you'll accidentally block legitimate category pages. Verify everything in the Google Search Console robots.txt tester before deploying. One thing to watch: robots.txt blocks crawling, not indexing. If blocked URLs have external links pointing at them, they can still appear in the index as "URL unknown" entries. Combine robots.txt blocking with noindex on rendered pages if that's a risk.

Use canonical tags on faceted pages you want crawlable but not indexed

For filter URLs that need to stay crawlable (blocking them would break the user experience) but shouldn't be indexed, put a canonical tag on them pointing to the base category page. Googlebot crawls the page, reads the canonical signal, and consolidates link equity back to the base. This is the right move when JavaScript rendering makes robots.txt blocking impractical, or when you want Googlebot to reach the page without indexing the filter variant.

🔧 robots.txt — Faceted Navigation Blocking Pattern

User-agent: *
# Block all filter parameter combinations
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=
Disallow: /*?page=
Disallow: /*&color=
Disallow: /*&size=

User-agent: Googlebot
# Allow high-value facet combinations with confirmed search demand
Allow: /category/mens-shoes?color=blue
Allow: /category/dresses?length=maxi

# Block internal search pages entirely
Disallow: /search?q=
Disallow: /search/

# Block session and tracking parameters
Disallow: /*?sessionid=
Disallow: /*?utm_source=

Sitemap: https://yoursite.com/sitemap.xml

6. Internal Linking and Crawl Depth Optimisation

Crawl depth — the number of internal link hops from the homepage to a given page — directly affects how often that page gets crawled and how much link equity it accumulates. Pages that require 5 or more hops tend to get crawled infrequently. Pages beyond depth 7 may not get crawled at all within a normal crawl budget cycle. For large sites, reducing crawl depth for high-value content is one of the most impactful architectural changes you can make — see our full internal linking strategy guide for the broader framework beyond just crawl depth.

Crawl depth targets by page type

Think of it as: homepage at depth 0; top-level categories at 1–2; subcategories at 2–3; product and article pages at 3–4 maximum. If your architecture puts products 7–8 clicks from the homepage — common in older Magento builds — that's a structural problem. Fix it by creating hub pages, improving category pagination linking, or adding contextual internal links from high-traffic pages down to deep content.

Prioritise internal linking to recently updated pages

Freshness is a crawl demand signal. A page that's been updated but has weak internal linking won't get recrawled quickly — meaning your edits sit in limbo before they show up in the index. Add internal links from high-traffic hub pages to recently updated content. This matters most for news publishers, product pricing pages, and anything where freshness is a ranking differentiator.

Eliminate orphan pages and broken internal link chains

Orphan pages — pages with no internal links pointing to them — can only be discovered via sitemap, which means they get crawled infrequently. A full crawl audit in Screaming Frog or Sitebulb will surface them by cross-referencing your crawled URL list with your sitemap. Any page that matters for organic traffic needs at least one internal link from a crawled, indexed page — not just a sitemap entry. Internal links aren't a substitute for external authority either; pair this with the off-site work in our link building guide if a page is orphaned both internally and externally.

7. Server Performance and Crawl Rate Optimisation

For most sites, server performance isn't the constraint — and shaving TTFB from 250ms to 150ms when Googlebot is already crawling freely won't move the needle on crawl budget. But when server performance actually is the bottleneck, fixing it can unlock significant crawl rate improvements — our site speed optimisation guide covers the broader performance work this connects to. The signs that it's the real problem: TTFB consistently above 500ms in Crawl Stats, frequent 5xx responses in your log data, or a GSC crawl stats chart where crawl rates dropped at the same time your server load went up.

Metric	Target	Risk Threshold	Crawl Budget Impact
Time to First Byte (TTFB)	<200ms	>500ms	Direct — Googlebot reduces crawl rate automatically above threshold [1]
5xx Error Rate	<0.1% of crawl requests	>1% of daily crawl requests	Significant — sustained 5xx errors trigger Googlebot crawl rate reduction
DNS Resolution Time	<50ms	>200ms	Often overlooked — slow DNS consumes part of each crawl request's time budget [3]
Redirect Response Time	Each hop <100ms	Chains >3 hops	Each redirect adds latency and consumes an additional crawl request
Page Render Time (JS-heavy pages)	<5 seconds	>10 seconds	Googlebot may partially render or skip rendering for slow pages, causing content to be absent from the index

Redirect chain arithmetic: Every redirect hop burns one crawl request. A 3-hop chain (A → B → C → D) uses 4 crawl requests to reach the final destination — 3 more than a direct URL. At scale this compounds fast. A site with 5,000 redirect chains averaging 2 hops each is wasting 10,000 crawl requests per full crawl cycle — requests that could be going to new or updated content. Flattening redirect chains to a single hop is one of the quickest wins in a crawl budget audit.

8. XML Sitemap Hygiene for Crawl Efficiency

Your XML sitemap is a priority signal — it tells Googlebot which URLs deserve attention. A bloated or inaccurate sitemap does the opposite: it dilutes the signal from your best pages by mixing them in with URLs that have no business being there. Google explicitly recommends keeping your sitemap current and pruning outdated URLs as part of core crawl budget management. [1]

Include only canonical, indexable pages in your sitemap

Every URL in your sitemap should pass four tests: it returns a 200 HTTP status, it's not blocked by robots.txt, it's not tagged noindex, and it carries a self-referencing canonical. A robots.txt-blocked page in your sitemap sends contradictory signals. A noindex page in your sitemap wastes a crawl request that could go to something indexable. Audit against these four criteria monthly — most CMS platforms silently include non-canonical and noindexed URLs in auto-generated sitemaps.

Use accurate lastmod dates

The lastmod tag influences recrawl priority by signalling when a page was last meaningfully updated. Google's documentation warns that inaccurate lastmod values — ones that don't reflect real content changes — erode the signal's reliability over time. [1] Only update lastmod when substantive changes are made, not on every page load or template render. If your CMS auto-stamps every page with today's date regardless of whether anything actually changed, turn that off.

Use sitemap index files and segment by content type for large sites

If you're above 50,000 URLs, sitemap index files segmented by content type (products, categories, articles, landing pages) make Search Console data significantly more useful — you can monitor indexing rates per content type instead of wading through aggregate numbers. Segmented sitemaps also let you submit new content independently without regenerating the full sitemap, which speeds discovery for high-priority new pages.

👤 Measured Result — Sitemap Audit Impact

On a media publishing client with around 42,000 articles, the sitemap audit turned up 6,200 URLs tagged noindex in the CMS — editorial decisions to de-publish older content that the sitemap generator had never been told about. Another 3,100 URLs were returning 301 redirects to consolidated content instead of pointing to the final canonical URL. After cleaning the sitemap down to only canonical, 200-status, indexable URLs, "Discovered — currently not indexed" in Search Console dropped 31% within six weeks. The crawl budget freed up from those dead-end URLs redirected to 4,800 high-priority articles that had been sitting in the discovery queue for over 90 days. [4]

9. AI Bot Crawling in 2026: What It Means for Your Crawl Budget

The crawler landscape has changed significantly since 2024. Crawl budget management is no longer just about Googlebot — it's a multi-bot problem. Cloudflare's July 2025 analysis found that AI and search crawler traffic grew 18% from May 2024 to May 2025, with GPTBot alone growing 305% and Googlebot up 96% in raw request volume. [5] By December 2025, Googlebot alone accounted for 4.5% of all HTML request traffic Cloudflare observed — more than all other AI bots combined at 4.2%. [6]

The dual-purpose crawler dilemma: Googlebot and Bingbot now crawl for both search indexing and AI model training at the same time. Cloudflare's December 2025 data found Googlebot crawled 11.6% of all sampled unique pages — three times GPTBot's rate of 3.6%. [6] Because Googlebot handles both functions, you can't block Google's AI training without also blocking search indexing. In practice, most sophisticated publishers allow Googlebot in full and make selective decisions about GPTBot, ClaudeBot, and OAI-SearchBot based on their AI citation strategy — Bingbot deserves the same full-allow treatment, since (as our Bing SEO guide covers) it's the index ChatGPT Search and Microsoft Copilot draw from.

Managing AI bots without hurting search indexing

Bot	Primary Purpose	Recommended Setting	Rationale
Googlebot	Search indexing + AI training (dual-use)	Allow (full)	Blocking harms search visibility. Cannot separate search crawl from AI training crawl.
Bingbot	Bing search indexing + AI training (dual-use)	Allow (full)	Blocking eliminates ChatGPT Search and Microsoft Copilot citation eligibility. [9]
OAI-SearchBot	ChatGPT Search real-time retrieval	Allow	Blocking reduces ChatGPT Search citation eligibility for your pages. [8]
GPTBot	OpenAI AI model training (not search)	Publisher's choice	Blocking has no impact on search indexing. GPTBot grew 305% in raw request volume from May 2024 to May 2025. [5] Rate-limit if server load is a concern.
ClaudeBot	Anthropic AI model training	Publisher's choice	No search indexing function. Cloudflare data shows Anthropic had the highest crawl-to-refer ratio among AI platforms in 2025. [7]
PerplexityBot	Perplexity AI search retrieval	Allow (if AI search visibility is valued)	Grew from a very small base but represents a search retrieval function — blocking reduces Perplexity AI citation eligibility.

Server resource implication of AI bot growth: That 18% overall crawler traffic growth in 2025 — with GPTBot up 305% — has real server load consequences. If your WAF or CDN isn't correctly identifying and rate-limiting training-only bots like GPTBot while keeping the door open for Googlebot, Bingbot, and OAI-SearchBot, you may be handing server capacity to AI training bots at the direct expense of search indexing crawlers. Check your CDN and WAF bot management rules and make sure search-critical user agents sit in the "allow and prioritise" tier.

10. Log File Analysis: The Definitive Crawl Budget Data Source

Server log files are the only source that tells you what Googlebot actually did on your site — not what Search Console approximates, not what your sitemap assumes, not what your CMS thinks happened. Every HTTP request to your server gets recorded: the user-agent, URL, response code, response time, timestamp. That data gives you a precise picture of where Googlebot is spending its crawl budget versus where you actually want it spending it.

🔧 Log File Analysis — Key Questions to Answer

Question 1: What percentage of crawl requests go to high-value pages?
→ Filter logs: User-Agent contains "Googlebot"
→ Group by URL template (category, product, article, filter, pagination)
→ Calculate percentage of total crawl requests per template
→ Target: Category + Product + Article pages = 70%+ of crawl requests
→ Red flag: Filter/parameter URLs > 20% of crawl requests

Question 2: Which URLs are crawled frequently but never indexed?
→ Cross-reference log data with GSC Page Indexing export
→ Identify URLs crawled 10+ times with "Not indexed" GSC status
→ These are crawl budget drains — evaluate for noindex or block

Question 3: What is the response code distribution?
→ Group crawl requests by HTTP status code
→ Calculate percentage: 200, 301, 302, 404, 410, 5xx
→ Target: 200 responses > 90% of Googlebot requests
→ Red flag: 404 responses > 5%, 5xx responses > 1%

Question 4: Which bots beyond Googlebot are consuming server resources?
→ Group all log requests by User-Agent
→ Calculate percentage of total requests per bot
→ Identify training-only bots consuming significant server bandwidth
→ Consider rate limiting for training bots with no search indexing function

Question 5: What is the crawl frequency trend?
→ Plot daily Googlebot request volume over 90 days
→ Declining trend = server issue or content value degradation
→ Stable/growing trend = healthy crawl relationship

Recommended log file analysis tools (2026)

For sites under 50,000 URLs, Screaming Frog Log File Analyser is the easiest entry point — it imports server logs and visualises bot behaviour, crawl frequency, and response code distribution in a desktop GUI with no SQL or data engineering needed. Above 100,000 URLs, dedicated platforms like Botify, JetOctopus, and OnCrawl handle the scale better. If your team has data engineering capacity, piping raw logs into BigQuery and visualising in Looker Studio gives you the most flexible analysis environment for the lowest ongoing tooling cost. For a wider rundown of where AI fits into this kind of tooling stack, see our AI SEO tools guide.

11. Crawl Budget for Specific Site Types

E-commerce: Facet control is the primary intervention

E-commerce is where crawl budget optimisation delivers its biggest returns — see our complete e-commerce SEO guide for how this fits into the wider picture. Faceted navigation is the main culprit — tens of thousands of low-value filter URL combinations quietly absorbing crawl activity while core categories and products wait in the queue. [2] The fix sequence: pull log files to measure how much of Googlebot's budget is going to facet URLs; run keyword demand analysis on key filter combinations; block zero-demand combinations via robots.txt; implement canonicals on retained facet URLs; clean your sitemap of all blocked and noindexed filter variants.

News and media publishers: Freshness infrastructure is the priority

For news publishers, the goal isn't cutting URL inventory — it's maximising crawl frequency for new and recently updated content. The key interventions: server-side rendering so articles are immediately available to Googlebot without waiting on JavaScript; IndexNow implementation to notify Bing and other participating engines at the moment of publication; Google News sitemap submission with accurate timestamps; and internal linking from high-traffic category hubs to new articles within minutes of going live, giving Googlebot additional discovery paths beyond the sitemap.

SaaS and documentation sites: Crawl depth and version management

Documentation sites tend to accumulate crawl waste through versioned docs (v1, v2, v3 covering the same features), archived support articles, and developer portal sections behind authentication that Googlebot hits but can't render. High-impact fixes: consolidate versioned documentation to the current version with proper redirects from legacy paths; block authenticated developer portal sections via robots.txt; and check whether your documentation CMS is generating pagination variants for long pages that carry no unique content. If your docs or templated pages are generated at scale, our programmatic SEO guide covers the inventory-control side of that in more depth.

12. How to Measure Crawl Budget Improvements

If you don't establish baselines before making changes, you have no way of knowing whether what you did actually worked. Take measurements before any intervention, track on a consistent timeline after, and you'll have something concrete to point to in a stakeholder report rather than just a gut feeling that things got better.

Metric	Data Source	Baseline Measurement	Expected Improvement Timeline
Discovered — currently not indexed count	Google Search Console Page Indexing	Export count on day 0	4–8 weeks post-fix for significant reduction
Daily Googlebot crawl request volume	GSC Crawl Stats or server logs	30-day average pre-intervention	2–4 weeks to see redistribution toward high-value pages
Indexed page count by template type	GSC Page Indexing (filter by sitemap child)	Count per content type on day 0	6–12 weeks for meaningful indexing gains in priority templates
Redirect URL count	Screaming Frog or site crawl export	Total 301/302 responses in crawl	Immediate after redirect consolidation; GSC effect within 2–4 weeks
Crawl budget waste rate	Server log file analysis	% of Googlebot requests to non-indexable URLs	Immediate after robots.txt/sitemap fixes; verify via next log analysis cycle

13. Crawl Budget Mistakes to Avoid

Mistake	Why It Harms Crawl Efficiency	Severity	Fix
Including noindexed pages in XML sitemap	The sitemap says "crawl this," the noindex tag says "don't index this." Mixed signals, wasted crawl request. Turns up in 64% of audited sites. [4]	HIGH	Cross-reference your sitemap against noindex tags monthly. Pull any noindexed URLs out of the sitemap.
Leaving soft 404s returning 200 status	Googlebot burns a crawl request on a page that shouldn't exist, and soft 404s quietly drag down crawl demand signals across the whole site. [1]	HIGH	Return a real 404 or 410. Use 410 for permanently removed pages — Google removes 410s from the crawl queue faster than 404s.
Blocking Bingbot via robots.txt wildcard	A broad `Disallow: ` under `User-agent: ` silently cuts off Bingbot, which kills ChatGPT Search and Microsoft Copilot citation eligibility. Showing up in 38% of audited sites. [9]	CRITICAL	Test Bingbot specifically in Bing Webmaster Tools' Robots.txt Tester. Explicitly allow it on all content pages.
Manually reducing crawl rate in GSC without server constraints	You're artificially capping your crawl budget ceiling for no reason. That setting exists for when 5xx errors confirm your server is overwhelmed — using it as a precaution just slows indexing.	MEDIUM	Remove any crawl rate limits in GSC unless active 5xx errors are the problem. Let Google manage the rate automatically.
Using inaccurate lastmod dates in sitemap	When your CMS stamps every page with today's date on every render, Googlebot can't use lastmod to prioritise genuinely fresh content for recrawl. The signal becomes worthless. [1]	MEDIUM	Only update lastmod on real content changes. Disable auto-date-stamping on minor template updates in your CMS sitemap settings.
Blocking AI search bots to "save crawl budget"	OAI-SearchBot and Bingbot handle search retrieval, not just training. Blocking them cuts AI search citation eligibility. Their server load is usually small compared to Googlebot and training-only bots anyway. [8]	MEDIUM	Know the difference: training bots (GPTBot, ClaudeBot) vs. search retrieval bots (OAI-SearchBot, Bingbot). Rate-limit or block training-only bots if server load is a genuine concern.

✅ Crawl Budget Optimisation — Complete Audit Checklist

Google Search Console Crawl Stats reviewed — average response time and response code distribution checked
Page Indexing report exported — "Discovered — currently not indexed" count established as baseline
Full site crawl completed (Screaming Frog/Sitebulb) — URL inventory segmented by template type
Server log files obtained and analysed — Googlebot crawl distribution mapped to URL templates
Faceted navigation URL patterns identified — demand-tested against keyword data before blocking
Low-value facet URLs blocked via robots.txt and/or canonicalised to base category pages
All redirect chains audited — flattened to single hop where possible
Soft 404s identified and resolved — returning proper 404 or 410 status codes
XML sitemap audited — noindexed, redirected, and blocked URLs removed
Lastmod dates verified — only accurate publication/modification dates used
Crawl depth of high-value pages checked — pages beyond depth 5 added to internal linking plan
Orphan pages identified — high-value orphans linked from relevant hub pages
Internal search result pages blocked via robots.txt
Session ID and tracking parameter URLs blocked or canonicalised
Bingbot confirmed as allowed — Bing Webmaster Tools robots.txt tester verified
OAI-SearchBot not blocked — robots.txt and WAF rules checked
GPTBot/ClaudeBot strategy documented — conscious decision on training bot access made
Do not reduce crawl rate in GSC unless active 5xx errors confirm server overload
Never include noindexed or robots.txt-blocked pages in XML sitemap
Never return 200 status on pages that have been permanently removed — use 410

14. Frequently Asked Questions About Crawl Budget Optimisation

What is crawl budget in SEO?

Crawl budget is the number of URLs Googlebot can and wants to crawl on your site within a given timeframe — the product of crawl rate limit (how fast Googlebot can crawl without stressing your server) and crawl demand (how strongly Google wants to crawl based on popularity, freshness, and content quality). Pages that don't get crawled can't be indexed. Pages crawled infrequently may surface outdated information in search results. If you're under 5,000 pages with clean architecture, crawl budget is rarely a real constraint. [1]

How do I check my crawl budget in Google Search Console?

Go to Settings → Crawl Stats for total daily crawl requests, average response time, and a 90-day response code breakdown. The Page Indexing report (Indexing → Pages) shows how many of your URLs are in each indexing status category — "Discovered — currently not indexed" is the metric that flags crawl budget problems on large sites. For deeper insight, combine Crawl Stats with log file analysis, which gives you bot-level URL frequency data that Search Console doesn't export. [1]

Does page speed affect crawl budget?

Yes, but only when server response is a genuine constraint — not a marginal improvement opportunity. Googlebot automatically backs off when TTFB exceeds roughly 500ms or when 5xx error rates spike. If your TTFB is under 300ms and 5xx errors are minimal, squeezing more server performance out won't move the crawl budget needle. For most sites, URL inventory management does far more than server optimisation. [1]

Should I use noindex or robots.txt to block low-value pages?

It depends on what outcome you need. Robots.txt blocks crawling but not indexing — pages blocked this way can still end up in the index if external sites link to them. Noindex allows crawling but prevents indexing — Googlebot visits the page, reads the directive, and removes it from the index. For pages with no external links and zero value, robots.txt prevents the crawl request entirely, which is the most efficient option. For pages that might attract external links, use both robots.txt and noindex together. [1]

How does AI bot crawling affect crawl budget in 2026?

AI crawler traffic grew 18% from May 2024 to May 2025, and Googlebot alone now generates more than 25% of all Verified Bot traffic — more than all other AI bots combined. [6] You now need bot access policies across Googlebot and Bingbot (search indexing + AI training), OAI-SearchBot (ChatGPT Search retrieval), GPTBot (OpenAI training only), ClaudeBot (Anthropic training only), and PerplexityBot (search retrieval). Allow search retrieval bots — blocking them harms AI search visibility. Training-only bots can be rate-limited or blocked without touching search indexing.

What is the difference between 404 and 410 for crawl budget?

Both tell Google the page doesn't exist, but 410 (Gone) works faster. Googlebot may revisit a 404 page multiple times before accepting it's gone permanently, while a 410 is treated as a hard removal signal that stops recrawl attempts much sooner. For any page that's been permanently deleted and is never coming back, return a 410 — it's the faster path to eliminating those wasted crawl requests. [1]

Sources & References

📚 Research, Data & Official Documentation Referenced in This Article

Google Search Central — Manage Your Crawl Budget (Official Documentation, updated December 2025)
Google's authoritative documentation on crawl budget for large sites — defining crawl rate limit and crawl demand, URL inventory management recommendations, and sitemap guidance. Updated December 10, 2025.
developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget
Incremys — SEO Crawl Budget: A Technical Guide (February 2026)
Technical crawl budget guide covering URL family segmentation, faceted navigation drift, and the five-dimension URL evaluation framework for e-commerce and high-volume sites. References Google documentation on "undesirable perceived inventory" as the most influenceable lever.
incremys.com/en/resources/seo-crawl-budget
CaptainDNS — Crawl Budget Optimization: Complete Guide 2026 (February 2026)
Comprehensive crawl budget guide covering the effective crawl budget formula, DNS performance impact on crawl speed, and seven practical optimisation techniques including URL inventory control and sitemap hygiene.
captaindns.com/en/crawl-budget-optimization
IndexCraft — Internal Crawl Budget Audit Data (2025–2026)
Proprietary observational data from crawl budget audits across 35+ client websites, log file analyses, and Google Search Console coverage tracking conducted by Rohit Sharma at IndexCraft. Aggregate findings cited in this article; full data available to clients under NDA.
Cloudflare Blog — "From Googlebot to GPTBot: Who's Crawling Your Site in 2025" (July 2025)
Cloudflare's analysis of crawler traffic patterns showing 18% overall AI and search crawler growth from May 2024 to May 2025, GPTBot growing 305%, Googlebot up 96%, and the shift in market share across leading bots.
blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
Cloudflare Radar Year in Review 2025 (December 2025)
Cloudflare's sixth annual review of internet traffic trends — confirming Googlebot generated more than 25% of all Verified Bot traffic and 4.5% of all HTML request traffic in 2025, exceeding all other AI bots combined (4.2%). Includes analysis showing Googlebot crawled 11.6% of all sampled pages versus GPTBot at 3.6%.
blog.cloudflare.com/cloudflare-radar-2025-year-in-review/
Cloudflare Blog — "The crawl-to-click gap: Cloudflare data on AI bots, training, and referrals" (October 2025)
Analysis of AI bot crawl-to-refer ratios showing Anthropic with the highest ratios (up to 38,000 crawls per visitor in July 2025), and the breakdown of AI crawling purposes: 82% training, 15% search, 3% user actions.
blog.cloudflare.com/crawlers-click-ai-bots-training/
OpenAI — OAI-SearchBot & GPTBot Crawler Documentation
OpenAI's official documentation on web crawler user-agent strings, crawl behaviour, and recommended robots.txt configuration distinguishing OAI-SearchBot (ChatGPT Search retrieval) from GPTBot (AI model training only).
platform.openai.com/docs/gptbot
IndexCraft — ChatGPT SEO Guide 2026 (March 2026)
IndexCraft's platform-specific guide on ChatGPT Search optimisation — covering the Bing index prerequisite, Bingbot crawlability requirements, and the finding across 47 audited sites that 38% had inadvertent Bingbot-blocking in robots.txt affecting ChatGPT Search eligibility.
indexcraft.in/ai-search/optimize-perplexity-chatgpt-gemini-search

📚 Further reading on adjacent topics
Crawl budget work touches a lot of other parts of the SEO stack. A few guides worth pairing with this one depending on what you're working on:

On-page signals that affect crawl demand: On-Page SEO Guide →
Entity and topic modelling that feeds crawl prioritisation: Semantic SEO Guide →
Video sitemaps and crawl handling for video content: Video SEO Guide →
Local listings and Google Business Profile indexing: Local SEO Guide 2026 →
Knowledge Panel eligibility and entity crawling: Knowledge Panel Guide →
How SEO itself has changed leading into 2026: SEO Evolution Guide →
A printable AEO/GEO/SEO audit checklist: AEO-SEO-GEO Checklist →
Key GA4 terms referenced throughout our analytics guides: Google Analytics Glossary →
Test your technical SEO knowledge: Technical SEO MCQ Quiz →
Using Claude to automate parts of your SEO workflow: Claude AI SEO Automation Guide →

🔗 Related Technical SEO Guides

🤖

ChatGPT Search · Bing Index · OAI-SearchBot ChatGPT SEO Guide 2026: Get Cited in ChatGPT Search

Platform-specific deep-dive on ChatGPT Search optimisation — Bing indexing prerequisites, OAI-SearchBot crawlability, Browse tool mechanics, and the content structure signals that earn ChatGPT footnote citations.

Read ChatGPT SEO guide →

📐

Schema · Structured Data · 2026 Schema Markup Guide 2026: Structured Data for Search & AI

Complete schema markup implementation guide covering Article, FAQPage, HowTo, Product, and BreadcrumbList schemas — the structured data signals that improve both traditional SERP features and AI search citation eligibility.

Read schema markup guide →

🔵

Google AI Mode · Gemini · 2026 Google AI Mode SEO Guide 2026

Platform-exclusive deep-dive covering Google AI Mode's Gemini architecture, full-page search experience, and the content and technical signals specific to Google AI Mode citation — including how crawl quality affects AI Mode inclusion.

Read Google AI Mode guide →

🔧

Technical SEO · Core Web Vitals Technical SEO Guide 2026: Complete Foundation

The complete technical SEO foundation guide covering Core Web Vitals, JavaScript rendering, mobile-first indexing, HTTPS, structured data, and the full technical audit checklist — the framework that crawl budget optimisation sits within.

Read technical SEO guide →

Three things to do right now: (1) Pull up the Page Indexing report in Google Search Console and check your "Discovered — currently not indexed" count. If it's above 10–15% of your submitted sitemap URLs and you're running 10,000+ pages, crawl budget is actively holding you back. (2) Run your robots.txt through Bing Webmaster Tools' Robots.txt Tester and confirm Bingbot has access to all content pages — inadvertent Bingbot blocking shows up in 38% of audited sites and silently kills both ChatGPT Search and Microsoft Copilot citation eligibility. (3) Export your XML sitemap and cross-reference it against your CMS noindex settings. Remove any noindexed URLs from the sitemap now — they're sending mixed signals and burning crawl requests on pages Google will never index. All three can be done in a single audit session.

Crawl Budget Optimisation Guide 2026:Faster Indexing for Large Sites