Robots.txt Explained Simply

What is the robots.txt?

The robots.txt is a simple text file located in the root directory of a website and is accessible at a fixed address, namely https://www.example.com/robots.txt. It is used to provide search engine crawlers (also known as bots or robots) with instructions on which areas of a website they are allowed to crawl and which they are not. It is therefore one of the most fundamental tools in technical search engine optimization.

It is important to understand its function correctly: The robots.txt controls crawling, i.e., the retrieval and scanning of pages, but not necessarily indexing, i.e., inclusion in the search index. This very difference is the most common source of misunderstandings, more on which below.

How is a robots.txt structured?

The file consists of simple rules. The most important components are:

User-agent: Specifies which crawler the following rule applies to. An asterisk (*) means "for all bots".
Disallow: Prohibits the crawling of a specific directory or page.
Allow: Explicitly permits crawling, for example, to make an exception within a blocked directory.
Sitemap: Points to the location of the XML sitemap.

A typical example:

User-agent: *
Disallow: /internal/
Allow: /internal/public/
Sitemap: https://www.example.com/sitemap.xml

This file instructs all bots not to crawl the "/internal/" directory, with the exception of the subfolder "/internal/public/", and also indicates the location of the sitemap.

The most important misconception: A crawling ban does not mean exclusion from the index

This is the crucial point that is often misunderstood: If you exclude a page from crawling via Disallow in the robots.txt, this does not guarantee that this page will not appear in the Google index.

The reason: The robots.txt merely prevents Google from retrieving the content of the page. However, if Google knows the URL from other sources, such as because other websites link to it, the address can still be indexed. In the search results, an entry often appears with the note "No information is available for this page" because Google knows the URL but was not allowed to read the content.

The result is paradoxical: A page blocked via robots.txt can end up in the index, but without a meaningful description. Therefore, if you want to reliably keep a page out of the index, you should not block it via robots.txt.

Prevent crawling or prevent indexing? Choosing the right tool

The above leads to an important rule of thumb:

If a page should not be crawled (e.g., to save crawling budget or exclude unimportant areas): Use robots.txt with Disallow.
If a page should reliably not appear in the index: The page must remain crawlable and include a noindex meta tag in the <head>. Only this way can Google read the "do not index" instruction.

The most common mistake is combining both: Blocking a page via robots.txt and setting a noindex achieves the opposite. Google cannot read the noindex because crawling is blocked, and the page may remain in the index.

Typical potential for errors

Accidental complete block: A line like Disallow: / blocks the entire website for all bots. This error often occurs after a relaunch when a test setting is accidentally deployed live and can have devastating consequences for visibility.
Blocking CSS and JavaScript files: If these resources are blocked, Google cannot render and evaluate the page correctly. They should remain crawlable.
Attempting to "hide" sensitive data: The robots.txt is publicly accessible. If you list secret directories there, you are actually pointing them out. Confidential areas should be secured with password protection, not listed in the robots.txt.
Typos and incorrect paths: Even a small error in spelling can render a rule ineffective or unintentionally block too much.
Case sensitivity: Paths in the robots.txt are case-sensitive. /Internal/ and /internal/ are not the same.

Practical tips

Include the sitemap: The reference to the XML sitemap helps search engines find all important pages.
Check before going live: Especially after a relaunch, make sure no accidental block is active.
Test with tools: The Google Search Console offers ways to check the robots.txt and test individual URLs.
Use sparingly: The robots.txt is not a universal tool. When in doubt, it is better to block only what is necessary.
Understand as a hint: Reputable search engines comply with the robots.txt, but it is not technically enforceable. Malicious bots ignore it. It is therefore not a security tool.

Conclusion

The robots.txt is a simple yet powerful tool for giving search engine crawlers instructions on crawling a website. The crucial point that many overlook: A crawling ban via robots.txt does not reliably prevent a page from appearing in the Google index. If you want to keep a page out of the index for sure, it must remain crawlable, and you should use a noindex meta tag instead. Since small errors in the robots.txt can have major impacts on visibility, it is worth exercising particular care, thorough testing, and conscious, sparing use here.