Robots.txt: Controls & Common Mistakes

Robots.txt is a plain text file that lives at the root of your domain. It's been around since 1994. And yet I still regularly find SEO errors caused by misunderstanding what it actually does. The most important thing to understand upfront: robots.txt controls whether crawlers visit your pages. It does not control whether your pages appear in search results. These are different things, and confusing them leads to mistakes that can either waste crawl budget or leave private content exposed.

What Robots.txt Actually Does

When a search engine crawler (Googlebot, Bingbot, etc.) visits your site, the first thing it does is check your robots.txt file. This file contains instructions that tell crawlers which pages or directories they're allowed to access. If you disallow a path, a compliant crawler will not request that URL.

That's it. That's all it does.

What it does NOT do:

Remove pages from the search index
Block pages from ranking in search results
Prevent pages from being linked to by other sites
Stop all bots (only compliant, well-behaved crawlers follow it — malicious scrapers don't)

The Crawl vs. Index Distinction

This is where most robots.txt confusion comes from. Here's the important truth: Google can index a URL it has never crawled, if that URL appears as a link on another page Google has crawled.

If you block a page in robots.txt but another indexed page links to it, Google can still show that blocked URL in search results. It won't know the page's content (because it was never crawled), but it can show the URL with a "description not available" snippet, based on the anchor text of links pointing to it.

This means: if you want a page removed from search results, robots.txt is the wrong tool. You need a noindex meta tag on the page itself, or a noindex HTTP header in the response. And the crawler has to be able to access the page to read that tag — which means you cannot combine noindex with a Disallow in robots.txt. A disallowed page where Google has never read the noindex tag will remain in the index indefinitely.

Real example of this going wrong: a client disallowed their entire staging subdomain in robots.txt, but linked from their production site to it in a blog post. Google couldn't crawl the staging site, so it couldn't read the noindex tags on staging pages. But because the production site linked to staging URLs, Google indexed them with "no description available" snippets. The client was confused why staging pages were appearing in search results despite the robots.txt block. The fix: remove the staging link from production, and add password protection to the staging environment.

Common Robots.txt Mistakes

Blocking CSS and JavaScript files: This was a widespread mistake before Google started rendering JavaScript. If you block /assets/ or /static/ or your CSS directories, Googlebot can't render your pages properly. It sees unstyled HTML and potentially can't access content rendered by JavaScript. Check your robots.txt for any rules that block resource files.

Blocking important pages accidentally: A wildcard Disallow: / blocks your entire site. I've seen this in production. A Disallow: /product blocks all URLs that start with /product — including /products/, /product-reviews/, /product-launches/. Always check whether your rules are more permissive than you intended.

Forgetting the robots.txt on staging: If your staging environment is publicly accessible and doesn't have robots.txt with Disallow: /, crawlers will index it. This creates duplicate content issues. Staging environments should also have password protection as a secondary measure, since robots.txt is voluntary.

Case sensitivity errors: The path matching in robots.txt is case-sensitive on most servers. Disallow: /Admin/ and Disallow: /admin/ are different rules. Make sure your paths match the actual URL case on your server.

How to Write It Correctly

The syntax is simpler than most people think. A basic robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://www.example.com/sitemap.xml

User-agent: * means "all crawlers." You can specify individual bots instead. Disallow: blocks a path. Allow: explicitly permits a path (useful for allowing specific subdirectories within a broader disallow). The Sitemap: line tells crawlers where to find your sitemap.

Order matters within a group. More specific rules take precedence over broader ones for the same user-agent. If you have Disallow: /products/ and Allow: /products/featured/, the Allow rule takes precedence for the /products/featured/ path.

AI Bot Directives

Since 2023, there's been growing interest in controlling which AI companies can crawl your site for training data. The major AI crawlers have specific user-agent names:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

A few things to note. Blocking PerplexityBot may reduce how often Perplexity cites your content in its answers — which is a real traffic and visibility consideration if you care about AI search. Blocking GPTBot prevents OpenAI from using your content for training but doesn't necessarily affect ChatGPT's ability to cite you via real-time web search. And blocking Google-Extended tells Google not to use your content for training its AI models, but this is separate from Googlebot (which affects regular search rankings).

Whether to block AI crawlers is a business decision, not a purely technical one. Publishers blocking all AI crawlers are betting that the traffic and citation value from AI discovery isn't worth the training data cost. That calculation is genuinely uncertain right now.

How to Test Your Robots.txt

Google Search Console has a robots.txt tester. Go to the Legacy Tools section. You can enter any URL on your site and it will tell you whether Googlebot is allowed to crawl it based on your current robots.txt.

You can also just visit yourdomain.com/robots.txt directly in a browser to see what it says. For a quick sanity check before launching a site, this should be part of every pre-launch checklist.

Robots.txt and Crawl Budget

Crawl budget is the number of pages Googlebot crawls on your site within a given time window. For small sites (under a few thousand pages), crawl budget is rarely a concern. For large sites, especially e-commerce or news sites with millions of URLs, it matters significantly.

Robots.txt is the tool for managing crawl budget at scale. If your e-commerce site generates thousands of faceted navigation URLs (filter combinations like ?color=blue&size=medium&sort=price), you can disallow those URL patterns to prevent Googlebot from wasting crawl budget on them. This frees up crawl capacity for your important product and category pages.

The practical approach: use Google Search Console's Coverage report to see what URLs Google is discovering and crawling. If you see a large number of URLs you don't want indexed, look at whether robots.txt can systematically block those patterns at the crawl level, and whether noindex tags or canonical consolidation is more appropriate at the indexing level. Usually the answer is a combination of both.

Robots.txt is a powerful file. Its power comes from its ability to redirect crawl resources toward your most important pages. Most sites need very little in their robots.txt — maybe blocking admin directories and staging paths. The complexity comes at scale, where every misconfigured rule has amplified consequences. Start simple, test every change, and use GSC's coverage reports as your feedback loop.

Robots.txt: Controls & Common Mistakes

What Robots.txt Actually Does

The Crawl vs. Index Distinction

Common Robots.txt Mistakes

How to Write It Correctly

AI Bot Directives

How to Test Your Robots.txt

Robots.txt and Crawl Budget

Related Articles

Core Web Vitals Checklist: Fix LCP, INP, and CLS Step by Step

Mobile-First Indexing: What It Means and How to Prepare

Site Architecture for SEO: How to Structure Your Website for Rankings

Try Our Free SEO & GEO Tools