Robots.txt and SEO: What you need to know in 2025

The Robots Exclusion Protocol (REP), commonly known as robots.txt, has been a web standard since 1994 and remains a key tool for website optimization today.
This simple yet powerful file helps control how search engines and other bots interact with a site.
Recent updates have made it important to understand the best ways to use it.
Why robots.txt matters
Robots.txt is a set of instructions for web crawlers, telling them what they can and can’t do on your site.
It helps you keep certain parts of your website private or avoid crawling pages that aren’t important.
This way, you can improve your SEO and keep your site running smoothly.
Setting up your robots.txt file
Creating a robots.txt file is straightforward.
It uses simple commands to instruct crawlers on how to interact with your site.
The essential ones are:
User-agent
, which specifies the bot you’re targeting.Disallow
, which tells the bot where it can’t go.
Here are two basic examples that demonstrate how robots.txt controls crawler access.
This one allows all bots to crawl the entire site:
User-agent: *
Disallow:
This one directs bots to crawl the entire site except the “Keep Out” folder:
User-agent: *
Disallow: /keep-out/
You can also specify certain crawlers to stay out:
User-agent: Googlebot
Disallow: /
This example instructs Googlebot not to spider any part of the site. It is not recommended, but you get the idea.
Using wildcards
As you can see in the examples above, wildcards (*
) are handy for making flexible robots.txt files.
They let you apply rules to many bots or pages without listing each one.
Page-level control
You have a great deal of control over spidering if needed.
If you need to block only certain pages instead of blocking an entire directory, you can block just specific files. This gives you more flexibility and precision.
Example:
User-agent: *
Disallow: /keep-out/file1.html
Disallow: /keep-out/file2.html
Only the necessary pages are restricted, so your valuable content stays visible.
Combining commands
In the past, the Disallow
directive was the only one available, and Google tended to apply the most restrictive directive in the file.
Recent changes have introduced the Allow
directive, giving website owners more granular control over how their sites are crawled.
For example, you can instruct bots to only crawl through the “Important” folder and stay out of everywhere else:
User-agent: *
Disallow: /
Allow: /important/
It’s also possible to combine commands to create complex rules.
You can use Allow
directives alongside Disallow
to fine-tune access.
Example:
User-agent: *
Disallow: /private/
Allow: /private/public-file.html
This lets you keep certain files accessible while protecting others.
Since robots.txt’s default is to allow all, combining Disallow
and Allow
directives is generally not needed. Keeping it simple is generally best.
There are situations, though, that require more advanced configurations.
If you manage a website that uses URL parameters on menu links to track clicks through the site and you can’t implement canonical tags, you could leverage robots.txt directives to mitigate duplicate content issues.
Example:
User-agent: *
Disallow: /*?*
Another scenario in which an advanced configuration might be needed is if a misconfiguration causes random low-quality URLs to pop up in randomly named folders.
In this case, you could use the robots.txt file to disable all folders except the ones with valuable content.
Example:
User-agent: *
Disallow: /
Allow: /essential-content/
Allow: /valuable-content-1/
Allow: /valuable-content-2/
Get the newsletter search marketers rely on.
See terms.
Comments
Comments can be a handy way to outline information in a more human-friendly way.
Comments are led by the pound sign (#
).
On files that are manually updated, I recommend adding the date the file was created or updated.
That can help troubleshoot if an older version was accidentally restored from the backup.
Example:
#robots.txt file for www.example-site.com – updated 3/22/2025
User-agent: *
#disallowing low-value content
Disallow: /bogus-folder/
Managing crawl rate
Managing the crawl rate is key to keeping your server load in check and ensuring efficient indexing.
The Crawl-delay
command lets you set a delay between bot requests.
Example:
User-agent: *
Crawl-delay: 10
In this example, you’re asking bots to wait 10 seconds between requests, preventing overload and keeping things smooth.
Advanced bots can sense when they are overloading a server, and the Crawl-delay
directive isn’t needed as much as it may have been in the past.
Dig deeper: Crawl budget: What you need to know in 2025
XML sitemap link
Although Google and Bing prefer website owners to submit their XML sitemaps via Google Search Console and Bing Webmaster Tools, it is still an accepted standard to add a link to the site’s XML sitemap at the bottom of the robots.txt file.
It may not be necessary, but including it doesn’t hurt and could be helpful.
Example:
User-agent: *
Disallow:
Sitemap: https://www.my-site.com/sitemap.xml
If you add a link to your XML sitemap, ensure the URL is fully qualified.
Common pitfalls with robots.txt
Incorrect syntax
Make sure your commands are correctly formatted and in the right order.
Mistakes can lead to misinterpretation.
Check your robots.txt for errors in Google Search Console – the robots.txt check is in Settings.
Over-restricting access
Blocking too many pages can harm the indexing of your site.
Use Disallow
commands wisely and think about the impact on search visibility.
This can apply to blocking the bots that feed the newer AI search tools.
If you block those bots, you have no chance to appear in answers those services generate
Forgetting that bots don’t always follow the protocol
Not all spiders obey the Robots Exclusion Protocol.
If you need to block bots that don’t “behave” well, you will need to take other measures to keep them out.
It’s also important to remember that blocking spiders in robots.txt does not guarantee information won’t end up in an index.
For example, Google specifically warns that pages with inbound links from other websites may appear in its index.
If you want to make sure pages don’t end up in an index, use the noindex meta tag instead.
Wrapping up
As mentioned above, it’s generally best to keep things simple with robots.txt files. Updates in how they are interpreted, though, make it a much more powerful tool than in the past.
For more insights and detailed examples, check out these articles from Google Search Central:
- Introduction to robots.txt
- Robots Refresher: page-level granularity
- Robots Refresher: robots.txt — a flexible way to control how machines explore your website
Recent Comments