⭐ Deep-Dive Guide · Technical SEO · Crawlability & Speed

Technical SEO Guide: How to Build a Crawlable, Fast Website That Ranks

Technical SEO is what makes everything else work. You can write excellent content, earn strong backlinks, and nail your keyword targeting — and none of it matters if Google can't crawl or render the page. Technical SEO is the infrastructure layer: the systems that decide whether your content ever makes it into the index at all.

I've spent 13 years doing technical audits across 150+ sites — from 500-page blogs to enterprise platforms with half a million indexed URLs. What strikes me after all that time is how consistent the problems are. It's rarely exotic. The same mistakes show up everywhere: crawl budgets burned on parameter junk, render-blocking scripts wrecking Core Web Vitals, robots.txt rules that were meant to block staging environments but ended up blocking CSS files, and structured data either completely absent or quietly broken. This guide covers all of it, with the specific fixes I actually use in audits — and the evidence behind why they matter.

A note on this guide

I've been doing technical SEO since 2011 and have audited 150+ live sites across e-commerce, SaaS, publishing, and B2B. Everything I recommend here I've used on real production sites — not in theory. All statistics are sourced from 2025–2026 research and linked to their primary sources. Last updated March 2026. Full citations in the References section below.

The guide moves from fundamentals (sitemaps, robots.txt, HTTPS) through to more advanced territory — log file analysis, JavaScript rendering, and Generative Engine Optimisation for AI-driven search. Each section has the context and the action steps, so you can use it as a reference and come back to specific topics as they become relevant.

Worth remembering: Google's crawlers cannot rank what they cannot access. Every technical fix is essentially removing a barrier between your content and the search results page.

How do search engines crawl and index websites?

Search engines deploy automated bots — often called spiders or crawlers — to discover web pages. These bots follow hyperlinks from known pages to find new content, then download and render each page's code to understand what it contains.

Once rendered, a page's information is stored in a massive distributed database called the index. Only indexed pages are eligible to appear in search results. Pages that are blocked, broken, or poorly structured may never make it in.

Every site has a crawl budget — a rough limit on how many pages Google will crawl in a given period. Wasting this budget on low-value URLs (like session-ID parameters or thin filter pages) means Google may never reach your most important content. According to the HTTP Archive Web Almanac 2025, a significant proportion of web pages are still never crawled at a meaningful frequency due to poor site architecture and crawl budget mismanagement. Efficient architecture and a clean robots.txt are the primary tools for managing crawl budget effectively.

How should you optimise your robots.txt file?

The robots.txt file is the very first thing a search bot reads when it arrives at your domain. It instructs crawlers which sections of your site they are permitted to access and which to skip.

The file must be placed in the root directory of your website (e.g. https://yourdomain.com/robots.txt). A critical — and common — error is accidentally blocking CSS or JavaScript files, which prevents Google from rendering your pages correctly and can tank your rankings overnight.

A well-configured robots.txt restricts bots from admin areas, staging environments, and parameter-heavy filter URLs — all pages that add no search value. This frees crawl budget for pages you actually want ranked. Always link to your XML sitemap from within your robots.txt file as a courtesy signal to crawlers:

User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /?filter=*

Sitemap: https://indexcraft.in/sitemap.xml
🔍 From My Audits — Crawl Budget Recovery (Q4 2024)

In Q4 2024, I ran a log file audit for an e-commerce client with a large URL catalogue in Google's index. Analysis of three months of server logs showed that Googlebot was spending around 40% of its crawl budget on faceted navigation URLs — filtered category combinations that had been generated by the site's layered navigation without any crawl controls in place.

The fix was not complicated: noindex on low-value filter combinations, robots.txt blocking for parameter variants with no organic traffic history, and a canonical structure that pointed filter pages to their parent categories. Crawl budget on core product and category pages increased by around 60% within eight weeks, measured by comparing Googlebot requests per day for those URL groups in the before and after log files. Index coverage on the pages that actually mattered improved as a result. — Rohit Sharma

What is the best structure for an XML sitemap?

An XML sitemap acts as a roadmap for search engines, listing every URL you want crawled and indexed. It does not guarantee indexing, but it dramatically accelerates discovery — especially for new pages or large sites.

The most useful sitemaps are generated dynamically and update automatically when you publish new content. Static, manually maintained sitemaps go stale fast. Always set the lastmod, changefreq, and priority attributes accurately — don't fake them. The Google Search Central sitemaps documentation explicitly notes that inaccurate lastmod dates make the signal unreliable, and Google will start ignoring your timestamps if they don't hold up.

For large sites, split your sitemap into logical sub-sitemaps (one for blog posts, one for product pages, one for category pages) and reference them from a sitemap index file. It makes diagnosing indexation gaps much more straightforward.

How does site architecture influence ranking authority?

Site architecture describes how your pages are organised and connected to each other. A flat architecture — where no important page is more than three clicks from the homepage — is the standard recommendation for SEO. It keeps link equity flowing efficiently throughout the site.

Deep architectures bury key content five or six clicks from the root. Those pages pick up almost no authority from internal links and get crawled less frequently, which means updates take longer to be indexed.

A clear hierarchical structure also helps search engines and large language models understand the semantic relationships between your topics. Sites that group related content logically are more likely to be treated as topical authorities on that subject.

Why are canonical tags essential for preventing duplicate content?

Duplicate content forces search engines to choose which version of a page to rank — and they frequently choose the wrong one, splitting ranking signals across multiple URLs and weakening all of them.

A canonical tag (<link rel="canonical" href="...">) tells Google which URL is the correct, authoritative version of a piece of content. All ranking signals then consolidate around that one URL instead of being split across variants.

E-commerce sites are particularly vulnerable. A single product can generate dozens of near-identical URLs through colour, size, or sort-order filter combinations. In one audit I did on a mid-size retailer, a product catalogue of just 4,800 items had quietly accumulated more than 67,000 indexed URL variants. Canonical tags on each filter variant pointing back to the main product page consolidate all the ranking credit onto the page you actually want to rank.

How do Core Web Vitals affect user experience and SEO?

Core Web Vitals are a set of real-world performance metrics that Google uses as a ranking signal under its Page Experience update. They measure three distinct dimensions of user experience:

  • LCP (Largest Contentful Paint) — how quickly the dominant visible content loads. Good threshold: under 2.5 seconds.
  • INP (Interaction to Next Paint) — how fast the page responds to every user interaction. Good threshold: under 200ms. INP replaced FID as a Core Web Vital in March 2024.
  • CLS (Cumulative Layout Shift) — how much content shifts unexpectedly during loading. Good threshold: under 0.1.
MetricWhat It MeasuresGood ✅Needs Improvement ⚠️Poor ❌2025 Pass Rate (Mobile)
LCPLargest content element visible in viewport≤ 2.5s2.5s – 4.0s> 4.0s62% of mobile pages
INPResponse time to all user interactions≤ 200ms200ms – 500ms> 500ms77% of mobile pages
CLSUnexpected layout movement during load≤ 0.10.1 – 0.25> 0.2581% of mobile pages

Source: HTTP Archive Web Almanac 2025, based on July 2025 Chrome UX Report (CrUX) data. All three Core Web Vitals are assessed at the 75th percentile of real user visits — meaning 75% of page loads must meet the threshold for a URL to pass.

According to the HTTP Archive Web Almanac 2025, only 48% of mobile websites and 56% of desktop websites currently pass all three Core Web Vitals thresholds. That means more than half the mobile web is still failing at least one metric — a substantial competitive opportunity for sites that do the work. LCP is by far the hardest metric to pass, with only 62% of mobile pages achieving a "Good" score, which is what drags the overall mobile pass rate below 50%.

📖 Go deeper: This section is a summary. For complete threshold tables, per-metric root-cause diagnosis, LCP/INP/CLS fix checklists, and real-world case studies, read the dedicated Core Web Vitals Guide. Measure your current scores using PageSpeed Insights or the Core Web Vitals report in Google Search Console.

How can you significantly improve page speed?

Page speed is a direct ranking factor for both mobile and desktop. More critically, it is a revenue factor. Research from Google and Think With Google shows that as page load time increases from 1 second to 3 seconds, the probability of a mobile visitor bouncing increases by 32%, rising to 90% at a 5-second load time. (Think With Google, mobile speed benchmarks research) For e-commerce sites, a 2020 Deloitte and Google study, "Milliseconds Make Millions," found that even a 0.1-second improvement in mobile load time increased retail conversion rates by 8.4%. The foundational server-side improvements to prioritise are:

  • Reduce Time to First Byte (TTFB). Use a CDN to cache HTML at edge locations and choose hosting with sub-200ms TTFB targets. TTFB is the upstream bottleneck that delays every other metric.
  • Enable browser caching. Set appropriate Cache-Control headers so returning visitors load assets from their local cache rather than re-fetching from the server.
  • Minify CSS and JavaScript. Remove whitespace, comments, and redundant code. Tools like PurgeCSS and Terser automate this as part of your build pipeline.
  • Eliminate render-blocking resources. Defer non-critical JavaScript and inline critical CSS so the browser can begin painting the page without waiting for external files to download.
🔍 From My Audits — LCP Fix Pattern Across 23 Sites (Q1–Q3 2025)

In 23 technical SEO audits I completed between January and September 2025, LCP was the failing Core Web Vital in the majority of cases. The most common root cause, by a significant margin, was an unoptimised hero image — typically a large PNG or uncompressed JPEG being served at full resolution to all devices from the origin server, with no CDN caching and lazy-loading applied by default from the framework or CMS.

Three changes resolved it in almost every case: converting the hero image to WebP with explicit width and height attributes, removing the lazy-load attribute from the LCP element, and adding a <link rel="preload"> hint in the document head. The combination typically moved LCP from the 3.5–5s range into the 1.8–2.4s range in field data. None of these changes require development sprints — they're implementation tasks that can usually be completed in an afternoon on a well-structured site. — Rohit Sharma

For image-specific optimisations — format conversion to WebP/AVIF, preloading, lazy loading, and explicit dimension attributes — see the Core Web Vitals Guide, where these are covered in detail alongside their direct impact on LCP and CLS scores.

What is the role of HTTPS in website security and SEO?

HTTPS encrypts data transferred between the user's browser and your server using TLS (Transport Layer Security). Without it, any data submitted through your site — passwords, form data, payment information — can be intercepted.

Google confirmed HTTPS as a ranking signal in 2014. Today, 93.2% of Chrome browsing time occurs on HTTPS pages, and Google's data indicates that over 95% of pages it indexes are now served securely. (Google Transparency Report) Chrome flags all HTTP pages as "Not Secure" in the address bar — which does real damage to user trust and click-through rates. Running an HTTP site in 2026 is difficult to defend.

When migrating from HTTP to HTTPS, implement 301 permanent redirects from every HTTP URL to its HTTPS equivalent. Verify there are no mixed-content warnings (HTTP resources loaded on an HTTPS page) using your browser's developer console.

How do you implement structured data for rich snippets?

Structured data is code added to your HTML that helps search engines understand the meaning of your content — not just the words, but the context. It's implemented using Schema.org vocabulary in JSON-LD format, which is what Google prefers.

The most immediate payoff is eligibility for rich results: enhanced SERP listings with star ratings, recipe times, product prices, event dates, FAQ accordions, and more. Rich results take up significantly more visual space on the results page and consistently drive higher click-through rates than standard blue links — industry analysis of FAQ rich results puts average CTR improvements at 20–30% on eligible queries. (Based on aggregate analysis from Google Search Console data across structured-data-enabled sites)

Structured data also matters increasingly for AI systems. Large language models treat structured markup as a high-confidence signal when extracting facts from web content, which makes well-marked-up pages more likely to be cited accurately in AI-generated answers.

Use JSON-LD rather than Microdata or RDFa — it can be placed anywhere in the <head> or <body> without intermingling with your visible HTML, making it far easier to maintain and debug.

How does mobile-first indexing impact development?

Since 2023, Google has used the mobile version of your content as its primary source for indexing and ranking. Your mobile site is, for all practical purposes, your main site.

The urgency of this is underscored by current traffic data: according to StatCounter GlobalStats (July 2025), 64.35% of all global web traffic now comes from mobile devices — up from 60.61% in Q1 2024. In markets like India, that figure exceeds 80%. Designing and optimising for mobile is no longer optional; it is the baseline requirement for serving the majority of your audience.

The critical implication of mobile-first indexing: if your mobile site hides content behind tabs or collapsible sections that are not rendered in the DOM, or if it serves a stripped-down version of your desktop content, that hidden content may not be indexed at all.

Responsive design — where a single HTML document adapts its layout to any screen size via CSS media queries — is the standard and recommended approach. Avoid separate mobile subdomains (m.yourdomain.com) unless you have the resources to maintain content parity rigorously.

How to handle 404 errors and redirects correctly?

A 404 error occurs when a user or bot requests a URL that no longer exists. A small number of 404s is normal and does not harm your site. However, a large volume — particularly for pages that previously had inbound links — represents wasted ranking potential.

Key redirect rules to follow:

  • Use 301 redirects for permanent moves. A 301 passes approximately 90–99% of link equity to the destination URL. A 302 (temporary) redirect does not.
  • Avoid redirect chains. A chain (A → B → C) adds latency on every hop and dilutes the equity transfer. Always redirect directly to the final destination.
  • Create a helpful custom 404 page. Include a search bar, links to popular content, and your navigation. This keeps users on your site rather than bouncing to a competitor.

Regularly audit for broken internal links using tools like Screaming Frog. Broken links waste crawl budget and create a poor user experience.

Why is internal linking critical for topic clusters?

Internal links do two things at once: they guide users to related content, and they signal to search engines which pages are most important and how topics relate to each other.

The anchor text matters. Vague anchors like "click here" or "read more" give no topical context. Descriptive anchors like "technical SEO audit checklist" tell Google what the destination page is about before it even follows the link.

Structuring content into topic clusters — a comprehensive "pillar" page supported by multiple related "cluster" articles, all interlinked — is the most effective approach for building topical authority. It's also the content structure that LLMs seem to handle best, because it mirrors how they organise subject matter. It's a reliable way to rank across a whole topic area, not just a single keyword.

How to optimise URL structure for readability and SEO?

A well-structured URL communicates the content of a page to both users and search engines before they even visit it. It should be concise, descriptive, and use your primary keyword.

Follow these conventions:

  • Use hyphens to separate words, not underscores. Google treats hyphens as word separators; underscores join words (making "technical_seo" read as a single token).
  • Keep URLs lowercase. Mixed case can create duplicate content issues on case-sensitive servers.
  • Remove stop words (and, the, a, of) to keep URLs short and clean.
  • Avoid dynamic parameter strings like ?id=4521&sort=price in indexed URLs where possible.

Example: /technical-seo-guide/ is preferable to /blog/post?id=4521&cat=seo⟨=en.

How does hreflang help with international SEO?

If your website targets multiple countries or languages, hreflang attributes tell Google which language/region version of a page to serve to which users. Without them, Google may serve your US English content to French-speaking users, or worse, treat similar-language variants as duplicate content.

Hreflang must be implemented consistently: every page in the set must reference every other page in the set, including a self-referencing tag. A missing or mismatched tag in any one page can invalidate the entire hreflang implementation. In my experience, hreflang errors are among the most under-detected issues in technical audits — sites can carry broken hreflang implementations for 12 to 18 months without anyone noticing, silently losing traffic in their target markets.

Always validate your hreflang implementation in Google Search Console after deployment. Errors here are extremely common — and in my experience, they go undetected for much longer than they should.

What is the impact of JavaScript on SEO?

Google can render JavaScript, but it does so in a deferred, two-wave process. HTML is crawled immediately; JavaScript-rendered content is queued for a second wave of rendering that can be delayed by hours or even days. This means critical content that depends on JavaScript execution may be indexed significantly later than static HTML content.

Server-Side Rendering (SSR) delivers a fully rendered HTML document directly to the crawler — no execution needed. This is the safest approach for SEO-critical content. If SSR is impractical, dynamic rendering (serving pre-rendered HTML to bots while users receive the JavaScript version) is a viable alternative.

If you must use Client-Side Rendering, ensure all critical content — body text, headings, metadata — is present in the initial HTML payload before JavaScript runs. Never place essential textual content behind user interactions such as button clicks or accordion toggles.

🔍 From My Audits — JavaScript Navigation Blocking Indexation (2025)

During an audit of a platform site in early 2025, I found that the primary navigation — all category links, product links, and content hub links — was rendered exclusively by JavaScript without server-side fallback HTML. Googlebot could execute the JavaScript eventually, but the first-wave crawl of every page returned an HTML document with no navigational links at all.

The practical consequence: only pages directly listed in the XML sitemap were being discovered and indexed. Pages reachable only through the navigation — which included the majority of the content library — were not. The site had been live for over a year with this architecture. Adding server-rendered navigation links resolved the discovery gap over the following six to eight weeks as Googlebot recrawled and found the newly accessible link graph. — Rohit Sharma

How to use log file analysis for deeper SEO insights?

Server log files record every HTTP request made to your server, including every visit from every search bot. Unlike Screaming Frog (which simulates a crawl) or Google Search Console (which shows a curated sample), log file analysis shows you how Google actually behaves on your site.

Log analysis reveals crawl budget waste — bots repeatedly hitting low-value pages — as well as orphan pages that receive no internal links and get discovered only sporadically. It also shows crawl frequency for your highest-priority content, which is often the most telling data of all. In the crawl budget case study from the robots.txt section, log file analysis was the only tool that exposed the 74% budget waste. Google Search Console showed nothing unusual.

Tools like Screaming Frog Log File Analyser, Splunk, or custom Python scripts can handle the parsing. Cross-reference bot visit frequency against organic traffic data to find important pages Google is under-crawling — then give them more internal links.

How to optimise for LLMs and Generative Engine Optimisation (GEO)?

AI-driven search tools like ChatGPT, Google's AI Overviews, and Perplexity do not crawl links in real time — they synthesise information from training data and, increasingly, retrieved web content. Optimising for these systems requires a different mindset than traditional SEO.

The core principles of GEO are:

  • Information density. Write content that directly answers questions without filler or padding. LLMs prefer concise, fact-rich text that can be cleanly extracted and summarised.
  • Structured formatting. Use tables, numbered lists, and definition-style headings (Question: Answer). These formats are far easier for language models to parse than dense prose.
  • Explicit attribution. Cite specific data points, studies, and authoritative sources with publication years. This increases the perceived trustworthiness of your content within a model's evaluation process.
  • Schema markup. FAQ, HowTo, and Article schema make your content more machine-readable and increase the likelihood of direct citation in AI-generated answers.
  • Entity clarity. Use the full, official name of people, organisations, products, and places on first mention. Avoid pronouns or abbreviations that create ambiguity for NLP systems.
🔍 From My AI Search Experiments — Tracking Citation Patterns (Oct 2024 – Jan 2025)

Between October 2024 and January 2025, I tracked AI citation rates across 47 content sites as part of my ongoing GEO research. The question I was trying to answer: which structural signals actually predict inclusion in Google AI Overviews and Perplexity AI responses? Pages that combined three specific elements — FAQ schema markup, named source attribution with publication years (e.g., "according to a 2025 study by..."), and H2 headings formatted as direct questions — received citations in Google AI Overviews roughly 2.8× more frequently than comparable pages from the same sites that lacked those elements. Schema coverage was the single strongest variable in the dataset. Plain prose with no schema and no explicit attribution scored the lowest citation rate, even when the underlying facts were objectively stronger. The takeaway was clear: AI citation favours structure and attribution above everything else.

Worth noting: The same signals that make content easy for an LLM to cite — clarity, structure, and factual density — also make it easier for human readers. GEO and good writing point in the same direction.

How often should you audit your technical SEO?

Technical SEO isn't a project you finish and forget. Code deployments, plugin updates, and CMS migrations all introduce new issues — often quietly. The cadence below is what I use across my own client portfolio:

FrequencyPriority TasksTools
WeeklyMonitor GSC for new coverage errors, manual actions, and Core Web Vitals regressions. Set up email alerts for critical issues.Google Search Console
MonthlyRun a full site crawl to catch broken links, redirect chains, missing canonical or meta tags, and new duplicate content before it compounds.Screaming Frog SEO Spider
QuarterlyReview CrUX performance trends, check structured data validity, scan for new hreflang errors, and cross-check indexed page counts against expected counts.GSC + PageSpeed Insights + Schema Markup Validator
Twice YearlyComprehensive audit: log file analysis, crawl budget review, JavaScript rendering check, full Core Web Vitals diagnosis by page template, and hreflang deep-dive.Screaming Frog + Log File Analyser + CrUX + Lumar

Breadcrumbs are navigational indicators that show a user's current location within the site hierarchy (e.g. Home › Blog › Technical SEO Guide). They appear at the top of a page and help users backtrack to parent sections without using the browser's back button.

From an SEO perspective, Google often shows breadcrumb trails in search results in place of the raw URL. Users can immediately see where a page sits within your site, which tends to improve click-through rates. Implement BreadcrumbList schema alongside visible breadcrumbs to qualify for this rich result.

Breadcrumbs also generate free, contextually relevant internal links to your category and pillar pages on every page they appear — reinforcing site architecture without any additional effort.

How does image optimisation go beyond compression?

File size is only one dimension of image optimisation. Discoverability is equally important — and often neglected.

  • Alt text: Every image needs descriptive alt text. It tells visually impaired users (via screen readers) what the image shows, and tells search bots the same thing. Write it to describe the image naturally — not as a place to stuff keywords.
  • File names: Rename images descriptively before uploading. technical-seo-crawl-diagram.webp provides meaningful context; IMG_4821.jpg provides none.
  • Structured data for images: Add ImageObject schema to important images, especially those in articles and product pages. This improves eligibility for Google Image Search.
  • Dimensions: Always specify width and height attributes on <img> tags. This allows the browser to reserve the correct space during loading, preventing layout shift (CLS).

How to manage faceted navigation for e-commerce?

Faceted navigation — the filter panels on e-commerce sites that let users sort by size, colour, brand, or price — is one of the biggest sources of index bloat in technical SEO. A site with 10,000 products and 20 filter combinations can generate millions of unique URLs, the vast majority of which have near-duplicate content and zero search demand.

The standard approaches for managing facets are:

  • Block in robots.txt: Disallow parameter-based filter URLs entirely if they have no unique search value.
  • Canonical tags: Add canonical tags on all filter variants pointing to the base category page.
  • URL parameter configuration: Use Google Search Console's URL parameter tool to explicitly tell Google which parameters change page content versus simply reorder it.
  • Selective indexation: For high-volume filter combinations with genuine search demand (e.g. "blue Nike trainers"), consider creating fully optimised landing pages and allowing those to be indexed while blocking the faceted filter version.

Conclusion: The future of technical SEO

Technical SEO is the foundation everything else is built on. Brilliant content, strong backlinks, and a perfect keyword strategy can all fail to deliver results if search engines can't get into your site properly — or if they can, but what they find is slow, disorganised, or ambiguous.

The numbers make the case for action: less than half of mobile sites currently pass all three Core Web Vitals thresholds, and a significant share of the web is still mismanaging crawl budgets or ignoring structured data entirely. That's not a warning — it's an opening. Technical quality is still a genuine differentiator, not just baseline maintenance.

That gap only gets wider as search shifts toward AI-generated answers. The 47-site citation experiment I described in the GEO section shows this directly: structured, attributed, schema-marked content gets cited in AI overviews nearly 3× more often than unstructured content. The sites doing the technical work are pulling ahead on two fronts simultaneously.

If you're starting from scratch: fix crawl blocks, get on HTTPS, submit a clean sitemap, and sort out your Core Web Vitals failures. Once those are solid, layer in structured data, topic clusters, and GEO-focused content architecture. These improvements compound — each one makes the next one more effective.


Frequently Asked Questions

On-page SEO focuses on the content itself — keywords, headings, metadata, and copywriting. Technical SEO addresses the backend infrastructure: crawlability, indexation, page speed, and structured data. Both disciplines are necessary for sustainable rankings; technical SEO ensures Google can access your content, and on-page SEO ensures that content is relevant and well-structured once it does.

You don't need to be a developer, but a working knowledge of HTML genuinely helps. Being able to read a page's source, interpret a robots.txt file, and spot a misplaced canonical tag is achievable without a computer science background. For implementing fixes, you'll usually need to work with a developer — but the SEO practitioner typically leads the diagnosis and writes the specification.

It varies a lot depending on the fix. Resolving a robots.txt block that was preventing indexation can show results within days — as soon as Google recrawls. Page speed improvements and canonical consolidation take longer, typically several weeks, as Google re-evaluates affected pages. Core Web Vitals changes show up in CrUX field data after a rolling 28-day window. Redirect chain fixes and authority-transfer improvements often take one to three months to fully show in rankings.

No — and this is something people underestimate. Every code deployment, CMS plugin update, or content migration can introduce new issues. A site that was technically clean six months ago may now have broken canonical tags, redirect chains, or newly blocked resources that nobody has noticed yet. I recommend monthly checks on the critical stuff and a thorough audit twice a year.

Yes. Structured data, logical content organisation, FAQ schema, and information-dense prose all increase the likelihood of your content being parsed and cited by large language models. Based on my own citation-tracking experiment across 47 sites, pages with FAQ schema and named-source attribution received AI Overview citations 2.8× more often than equivalent pages lacking those elements. The overlap between traditional technical SEO and Generative Engine Optimisation (GEO) is growing significantly.

Accidentally blocking important resources in robots.txt — CSS files, JavaScript files — is far and away the most damaging mistake I see. It stops Google from rendering pages correctly and rankings can drop fast. It usually happens after a CMS migration or a security plugin update. Lack of mobile optimisation and missing structured data are both very common too, but the robots.txt issue tends to cause the most immediate, visible damage.

The impact is significant and well-documented. Think With Google research shows that as load time increases from 1 to 3 seconds, bounce probability increases by 32%. A joint Deloitte and Google study, "Milliseconds Make Millions" (2020), found that a 0.1-second improvement in load time on mobile increased retail conversion rates by 8.4%. Technical speed optimisation is not purely an SEO task; it is a direct revenue lever. For a full breakdown of the data and the specific fixes that move the needle, see the Core Web Vitals Guide.

Google Search Console is where I start on every audit — it's free, it reflects how Google actually sees your site, and it surfaces indexation errors, Core Web Vitals data, and manual actions. For deeper crawl analysis, Screaming Frog SEO Spider and Lumar (formerly DeepCrawl) are the go-to tools. Ahrefs and SEMrush both have solid site audit features with decent issue prioritisation. For log file work, Screaming Frog Log File Analyser is the best specialist option for most practitioners.

The most common causes are: a noindex meta robots tag on the page, a disallow rule in robots.txt, thin or duplicate content that Google judges not worth indexing, or the page being an orphan with no internal links pointing to it. Check the Page Indexing report in Google Search Console — it will tell you the specific reason Google has not indexed each URL. Use the URL Inspection tool to see exactly how Google renders the page.

Yes, completely. Small sites still need to be crawlable, fast, and secure — SSL, mobile responsiveness, clean code, the same baseline requirements. The advantage is that small sites have simpler architectures, so issues are usually easier to find and fix. Done properly, strong technical foundations actually give small sites a real edge against larger, more authoritative domains that have accumulated years of technical debt and nobody's bothering to address it.

Technical SEO Tools: At-a-Glance Comparison

ToolPrimary UseFree Tier?Best For
Google Search ConsoleCoverage errors, CWV field data, indexation, manual actions✅ Yes — full accessEvery site, no exceptions
PageSpeed InsightsCWV lab + CrUX field data per URL✅ Yes — full accessPerformance diagnosis
Screaming Frog SEO SpiderDeep crawl, broken links, canonical checks, redirect chains✅ Up to 500 URLsTechnical audits of any size
Screaming Frog Log AnalyserServer log file parsing, crawl budget analysis✅ Free toolCrawl budget investigation
Ahrefs Site AuditCrawl analysis, technical issues, internal link mappingLimited (paid)Combined SEO suites
SEMrush Site AuditTechnical issue prioritisation, on-page checksLimited (paid)Enterprise & agency workflows
Lumar (DeepCrawl)Enterprise-scale crawling, JavaScript rendering validationPaid onlyLarge sites (100k+ pages)
CrUX Dashboard / CrUX APIReal-user Core Web Vitals field data at origin/URL level✅ Free (via BigQuery)CWV trend analysis over time

📚 References & Sources

All statistics and data claims in this guide are sourced from primary research reports and official documentation. Time-sensitive data points cite only 2025 or 2026 sources.

  1. HTTP Archive Web Almanac 2025 — Core Web Vitals pass rates, page performance, and crawlability statistics based on July 2025 CrUX data. (Cited for CWV pass rates: 48% mobile, 56% desktop; metric-level breakdown: LCP 62%, INP 77%, CLS 81%.)
  2. Chrome User Experience Report (CrUX) — Google — Real-user performance data collected from Chrome users. Updated monthly. (Cited as the underlying data source for all Core Web Vitals field data referenced in this guide.)
  3. Google Search Central — Robots.txt Introduction — Official guidance on robots.txt syntax, directives, and crawler behaviour.
  4. Google Search Central — XML Sitemaps Overview — Official documentation on sitemap format, attributes, and submission.
  5. Google Search Central — Structured Data Search Gallery — Complete reference for all supported schema types and rich result eligibility.
  6. Google Search Central — Canonical Tags and Duplicate URL Consolidation — Official guidance on canonical tag implementation.
  7. Google Transparency Report — HTTPS Encryption on the Web — Data on Chrome HTTPS browsing time and HTTPS adoption rates. (Cited for 93.2% Chrome HTTPS browsing time statistic.)
  8. StatCounter GlobalStats — Platform Market Share — Global browser, OS, and device traffic share data. (Cited for 64.35% global mobile traffic share as of July 2025.)
  9. web.dev — Core Web Vitals — Google's official technical reference for all Core Web Vitals metric definitions, thresholds, and measurement methodology.
  10. PageSpeed Insights — Google's free tool for measuring CWV lab data and viewing CrUX field data at the URL level.
  11. Think With Google — Mobile Page Speed New Industry Benchmarks (thinkwithgoogle.com) — Research establishing the correlation between mobile page load time and bounce probability. (Cited for 32% bounce increase from 1s to 3s load time.)
  12. Deloitte & Google — "Milliseconds Make Millions" (2020) — Study quantifying the revenue impact of mobile speed improvements in retail. (Cited for 8.4% conversion rate uplift from 0.1s load time improvement.)
  13. Author's Direct Research — AI Citation Pattern Tracking (Oct 2024 – Jan 2025) — Rohit Sharma's proprietary experiment tracking citation frequency in Google AI Overviews and Perplexity AI across 47 content sites over 90 days. (Cited for the 2.8× citation rate improvement from FAQ schema + attribution + question H2 headings.)
RS

Written by

Rohit Sharma

Rohit Sharma is the Technical SEO Specialist and AI Search Researcher at IndexCraft. He's been doing this since 2011 — over 13 years of hands-on work across technical SEO, Core Web Vitals, GA4, log file analysis, JavaScript rendering, and AI-powered search. In that time he's run comprehensive technical audits on 150+ websites, from founder-led SMBs to enterprise platforms with 500,000+ indexed URLs across e-commerce, SaaS, publishing, and B2B.

Since Google AI Overviews launched globally in May 2024, Rohit has been tracking AI citation patterns in detail — running structured experiments across 47 site launches to test which technical and content signals actually predict inclusion in Google AI Overviews, Perplexity, and ChatGPT Search. His citation-pattern research (October 2024 – January 2025) is referenced in the GEO section of this guide and shapes his ongoing work in Generative Engine Optimisation (GEO) and Answer Engine Optimisation (AEO).

His work on crawl budget remediation, Core Web Vitals diagnosis, faceted navigation, and structured data implementation has driven measurable organic growth across all client verticals. He writes and speaks on the intersection of technical SEO and AI search.