Free Robots.txt Generator
Build a robots.txt file for your website quickly.
Quick Start Presets
How to Use
- Choose a preset or build rules manually.
- Add user-agent groups with Allow/Disallow paths.
- Add your Sitemap URL(s).
- Copy or download the file and upload it to your site's root directory.
Frequently Asked Questions
Where should robots.txt be placed?
It must be at the root of your domain: https://yourdomain.com/robots.txt. It won't work in subdirectories.
Does robots.txt block pages from Google?
Robots.txt prevents crawlers from accessing pages, but it doesn't remove them from search results. For that, use a noindex meta tag.
What are AI crawlers?
AI crawlers like GPTBot, CCBot, and Google-Extended scrape content for training AI models. You can block them specifically in your robots.txt.
A Standard Born in 1994, Standardised in 2022
The Robots Exclusion Protocol was designed by Martijn Koster in February 1994, almost three decades after which it was finally codified as RFC 9309 in September 2022. For 28 years it was a de facto standard that everyone agreed to follow without anyone agreeing on the details. The RFC nailed down the syntax (User-agent / Disallow / Allow lines), the precedence rules, the file-size limit (parsers must accept at least 500 KiB), and how to handle errors (4xx → crawler may access anything; 5xx → crawler must assume complete disallow). Most major search-engine crawlers conformed to roughly the same behaviour before the RFC, but small differences mattered.
Where the File Lives
A robots.txt file must be served at the exact URL /robots.txt from the root of your origin (one per scheme + host + port). Subdirectories don't work; /blog/robots.txt is just a 404 to crawlers. Each subdomain needs its own (www.example.com/robots.txt and blog.example.com/robots.txt are independent files). The file is plain text, served as text/plain, encoded UTF-8 (the BOM is allowed but strongly discouraged).
The Syntax in One Page
# Comments start with #
User-agent: * # Apply to all crawlers
Disallow: /admin/ # Block this directory
Disallow: /search # Block search results
Allow: /admin/login # Allow this path even within /admin/
User-agent: Googlebot # Specific Googlebot rules
Disallow: /test/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/news-sitemap.xml
Key behaviours:
- Group structure. A "group" starts with one or more
User-agent:lines and continues withAllow:/Disallow:rules until the nextUser-agent:begins a new group. - Specificity wins. The most specific user-agent group applies to a given crawler. Googlebot reads its
User-agent: Googlebotrules and ignores theUser-agent: *rules entirely. - Empty
Disallow:means "allow everything."Disallow: /means "block everything." The presence/absence of the slash is critical. - Wildcards.
*matches any sequence of characters;$matches end-of-URL.Disallow: /*.pdf$blocks all PDFs. Officially these are Google extensions, but most major crawlers accept them. - Case sensitivity. User-agent names are case-insensitive (Googlebot = googlebot). URL paths in rules are case-sensitive (
/Page≠/page) on case-sensitive filesystems.
What robots.txt Does NOT Do
The most common misunderstanding, and one Google calls out directly in its own documentation: robots.txt prevents crawling, not indexing. If a page is linked from elsewhere on the web, Google may index the URL (and show it in search results with a stub like "No information available for this page") even though it never crawled the page itself. To genuinely keep a page out of search results, use a <meta name="robots" content="noindex"> tag on the page itself, or an X-Robots-Tag: noindex HTTP header. The crawler has to be allowed to access the page in order to see the noindex directive, meaning you should not Disallow a page in robots.txt if you've also added noindex to it, because the crawler will never see the noindex.
Other things robots.txt doesn't do:
- Hide sensitive URLs. The file is publicly readable; anyone can fetch
yoursite.com/robots.txt. Listing paths like/admin/or/internal-tools/tells the world those URLs exist. For genuine secrecy, use authentication; for "just don't index it," use noindex. - Stop malicious crawlers. Bad actors ignore robots.txt entirely. It's a request to well-behaved crawlers, not enforcement.
- Block embed-card crawlers in any reliable way. Facebook's
facebookexternalhit, Twitter'sTwitterbot, LinkedIn'sLinkedInBot, Slack and Discord bots all read robots.txt, but if you block them, your shared links won't render previews on those platforms. Allow them explicitly if you want share cards.
User Agents Worth Knowing
| User-agent | Owner | What it crawls for |
|---|---|---|
Googlebot | Web Search indexing | |
Bingbot | Microsoft | Bing search indexing |
DuckDuckBot | DuckDuckGo | DuckDuckGo search |
Slurp | Yahoo | Yahoo search |
YandexBot | Yandex | Russian search |
Baiduspider | Baidu | Chinese search |
facebookexternalhit | Meta | Facebook share-card metadata |
LinkedInBot | LinkedIn share preview | |
Twitterbot | X / Twitter | Tweet card metadata |
Slackbot | Slack | Slack link unfurling |
Discordbot | Discord | Discord link previews |
The AI-Crawler Question
Since 2023, a wave of AI-training crawlers has appeared, and many sites have added robots.txt rules to opt out of AI training. The major ones to know:
- GPTBot: OpenAI's training crawler.
- ChatGPT-User: OpenAI's on-demand crawler when ChatGPT users ask it to fetch a URL.
- OAI-SearchBot: OpenAI's search-product crawler.
- Google-Extended: Google's separate token for AI training (Bard / Gemini), independent of Googlebot. Lets you allow Search but block AI training.
- ClaudeBot / Claude-User / Claude-SearchBot: Anthropic's various crawlers.
- PerplexityBot: Perplexity AI.
- CCBot: Common Crawl, the open dataset most LLMs train on.
- Bytespider: ByteDance / TikTok.
- Applebot-Extended: Apple Intelligence's separate token, similar to Google-Extended.
Two important things to know: (1) the opt-out is voluntary, since only crawlers that respect robots.txt are affected, and (2) the line between "search" and "AI training" is blurring quickly, so blocking all AI crawlers may also affect how your content appears in AI-summarised search results. Use the "Block AI Crawlers" preset above as a starting point, then decide which trade-offs make sense for your site.
Crawl-Delay and Why Google Ignores It
The non-standard Crawl-delay: directive asks crawlers to wait N seconds between requests to your server. Bing, Yandex, and many smaller crawlers honour it. Google does not. Google's documentation states explicitly that Googlebot ignores Crawl-delay. Search Console used to expose a manual Crawl rate setting, but Google deprecated it for most sites in early 2024 and now adjusts crawl rate automatically based on server response. If your goal is to slow Googlebot specifically, robots.txt is the wrong tool.
Sitemap Directive
Listing your sitemap in robots.txt is a hint to crawlers about where to find your URL list. Use absolute URLs (full https://), one per line. You can list multiple sitemaps for sites that split content into separate sitemaps (a main sitemap, a news sitemap, a video sitemap, an image sitemap). The Sitemap directive isn't formally part of the original robots.txt protocol, but every major search engine reads it.
Common Mistakes
- Disallowing CSS / JavaScript files. Google uses rendered content for ranking. If Googlebot can't fetch your
/css/or/js/, it can't render your pages correctly, which hurts SEO. Don't block resource directories. - Confusing Disallow with noindex. Disallow stops crawling; the page may still appear in search via inbound links. Use noindex meta tags for actual indexing control.
- Listing private URLs. Anything in robots.txt is publicly readable. Don't advertise
/admin/or/wp-admin/if you don't want attackers to know they exist; instead, use proper authentication and rely on noindex. - Empty
Disallow:when you meantDisallow: /. Empty allows everything;Disallow: /blocks everything. They're opposites. - Trying to block embed crawlers and then wondering why share previews don't work. If you Disallow facebookexternalhit, your shared links won't render Facebook cards. Allow social-media bots explicitly if you want previews.
- Forgetting to add the Sitemap line. It's free, useful, and most generators omit it.
- Trusting Crawl-delay to throttle Google. It doesn't. Use Search Console.
- Using robots.txt to block by IP / region / device. The protocol has no concept of these. Use server-side rules instead.
- Not testing the file. Google retired the standalone robots.txt Tester in November 2023 and replaced it with the robots.txt Report inside Search Console; that report flags syntax errors and shows the most recent crawl. Always check it (or another open-source robots.txt validator) before deploying.
More Frequently Asked Questions
Where exactly do I put the file?
The root of your domain, accessible at exactly https://yoursite.com/robots.txt. On most hosts that means putting robots.txt in your public / htdocs / www root directory. WordPress and many CMSs generate one dynamically; check whether yours does before adding a static file (the static one wins if both exist).
Do I need a robots.txt at all?
Technically no. Without one, crawlers default to "allow everything," which is fine for most public sites. But you almost always want one to point to your sitemap, to block obvious crawl traps (search-result pages, paginated archives, parametric URLs), and increasingly to opt out of AI training. A blank or default-allow robots.txt is still useful as a place to put the Sitemap line.
How big can robots.txt be?
RFC 9309 requires parsers to accept at least 500 KiB (~500,000 bytes). Google enforces a 500 KiB limit and ignores anything beyond that. The vast majority of robots.txt files are well under 1 KiB. If yours is approaching the limit, you're probably listing too many specific URLs and should use wildcard patterns instead.
What happens if my robots.txt returns a 500 error?
Per RFC 9309, when a crawler can't fetch robots.txt due to a server error (5xx) it must assume complete disallow, meaning Google and other compliant crawlers will stop crawling your site entirely until the file is reachable again. If your robots.txt endpoint goes down, your search visibility goes down with it. Make sure it stays available.
What about Crawl-delay for Google?
Google explicitly ignores Crawl-delay. The directive does work for Bing, Yandex, and most other crawlers. The manual Crawl rate setting that Search Console used to expose was deprecated for most sites in early 2024; Google now adjusts crawl rate automatically based on how your server responds. Setting Crawl-delay in robots.txt won't break anything; it just won't change Googlebot's behaviour.
Should I block AI crawlers?
Trade-off. Blocking GPTBot, Google-Extended, ClaudeBot, etc. opts your content out of training data for those models, which is the right call if you want to limit your content's reuse. The cost: as AI-summarised search results become more common, blocked content may also be less likely to be cited or surfaced. Many publishers block AI training crawlers but allow the AI search crawlers (OAI-SearchBot, etc.) to keep being citable. The "Block AI Crawlers" preset takes the maximalist approach; tweak it to match your priorities.