Meet LLMs.txt, a proposed standard for AI website content crawling

To meet the web content crawlability and indexability needs of large language models, a new standards proposal for AI/LLMs by Australian technologist Jeremy Howard is here.
His proposed llms.txt acts somewhat similarly to robots.txt and XML sitemaps protocols, in order to allow for a crawling and readability of entire websites, putting less of a resource strain on LLMs for crawling and discovering your website content.
But it also offers an additional benefit – full content flattening – and this may be a good thing for brands and content creators.
While many content creators are interested in the proposal’s potential merits, it also has detractors.
But given the rapidly changing landscape for content produced in a world of artificial intelligence, llms.txt is certainly worth discussing.
The new proposed standard for AI accessibility to website content
Bluesky CEO Jay Graber propelled the discussion of content creator rights and data control, as it relates to being used for training in AI, on March 10 at SXSW Interactive in Austin, Texas.
Robust and ambitious in its detail, the cited proposal offers much to consider about the future of user content control within LLMs’ vast data and content appetite.
But a potentially simpler potential protocol emerged for web content creators last September, and while not as broad as the other proposal, llms.txt offers some assurance of increased control by the owner, in terms of what, and how much should be accessed.
These two proposals are not mutually exclusive, but the new llms.txt protocol seems to be further along.
Howard’s llms.txt proposal is a website crawl and indexing standard using simple markdown language.
With AI models consuming and generating infinitely vast amounts of web content, content owners are seeking better control over how their data is used, or at least, seeking to provide context on how they would like for it to be used.
Short of exceeding the astoundingly high bar of crawl capabilities of a Google or Bing, LLMs are in need of a solution that allows them to focus less on becoming a massive crawling engine, and more on the “intelligence” part of their functions, as artificial as they may be.
Theoretically, llms.txt provides a better use of technical resources for LLMs.
This article will explore:
- What llms.txt is.
- How it works.
- Some ways to think about it.
- Whether LLMs and content owners are “buying-in”.
- Why you should pay attention.
What llms.txt is and what it does
For the purpose of this article, it is best to quote Howard’s proposal to help reveal what he intends for this new standard to accomplish::
“Large language models increasingly rely on website information, but face a critical limitation: context windows are too small to handle most websites in their entirety. Converting complex HTML pages with navigation, ads, and JavaScript into LLM-friendly plain text is both difficult and imprecise.
“While websites serve both human readers and LLMs, the latter benefit from more concise, expert-level information gathered in a single, accessible location. This is particularly important for use cases like development environments, where LLMs need quick access to programming documentation and APIs.
“We propose adding a /llms.txt markdown file to websites to provide LLM-friendly content… llms.txt markdown is human and LLM readable, but is also in a precise format allowing fixed processing methods (i.e. classical programming techniques such as parsers and regex).
The potential uses for this proposed protocol are quite intriguing for GEO benefits, and I’ve been testing it since December.
In its essence, llms.txt let you provide context on how your content can be accessed and used by AI-driven models.
Similar to robots.txt, which controls how search engine crawlers (or should) interact with a website, llms.txt would establish guidelines for AI models that scrape and process content for training and response generation.
There is no real “blocking,” and robots.txt directives (ex. “Disallow:”) are not intended for the llms.txt file. When set up properly, it is rather more of a “choosing” about which content should be shown contextually or wholly to an AI platform.
You can simply place URLs of a section of a website, add URLs with summaries of a website, or even provide the full raw text of a website in single or multiple files.
The llms.txt file on one of my websites is 115,378 words long, 966 kb file size, and contains the complete flattened website text in a single .txt file, hosted on the domain root. But your file can be smaller, even potentially larger than this file size, or even broken out into multiple files. It can be stored in multiple directories of your taxonomy and architecture, as needed.
You can also create .md markdown versions of each of your web pages that you believe deserves the attention of an LLM. It is very handy when performing deep site analysis, and it is not just for the LLMs. Just as websites serve many various uses, llms.txt follows in this regard, with many possible variations for providing context to LLMs.
Generating an llms.txt or llms-full.txt file
It is almost “elegant” in its simplicity, in that it strips complete sites down to their bare linguistic and textual essence, making it easier fodder to parse by your favorite platform, for myriad uses in content development, site structure analysis, entity research, and just about anything else you can dream up.
It also provides a standardized method for website owners to explicitly allow or disallow LLMs from ingesting and utilizing their content. The proposal is gaining traction among tech industry leaders and SEO professionals as AI continues to reshape the digital landscape. The absolute utility for increasing relevance is there, with benefits for the LLM, the website owner, and the user who theoretically finds a better answer via this little textual handshake.
Llms.txt functions similarly to robots.txt, only in the sense of creating a simple text file in the root directory of their website. Much like the robots.txt file standard, it can be obeyed, or not, depending on whether or not the AI/LLM agent wants to. But to clear up a common misperception, it IS NOT intended for robots.txt directives to be included in the llms.txt file.
A few sample llms.txt files, in action
- Anthropic: https://docs.anthropic.com/llms-full.txt
- Hugging Face: https://huggingface-projects-docs-llms-txt.hf.space/accelerate/llms.txt
- Perplexity: https://docs.perplexity.ai/llms-full.txt
- LLMsTxt Manager: https://llmstxtmanager.com/llms.txt
- Zapier: https://docs.zapier.com/llms-full.txt
Adoption
Many different LLMs have voiced their support for the llms.txt standard,and many are using it, or exploring its usefulness. llms.txt Hub has compiled a list of AI developers using the standard for documentation, and claims to be one of the largest such resources for identifying them. But remember, llms.txt is not just for developers, it is for all web content owners and producers.
Website and content creators can also benefit greatly from a flattened file of their site. Once the llms.txt file is in place, full site content can be analyzed, however it may fit the needs of your research method.
llms.txt Generator Tools
With the basic protocol outlined, there are a variety of tools available to help generate your file. I have found that most will generate smaller sites for free, and larger sites can be a custom job. Of course, many website owners will choose to develop their own tool or scraper. Word of caution – research the security of any generator tool before using, and review your files before uploading. DO NOT use any tool without first vetting security. Here are a few of those free tools to check (but still subject to your own validation):
- Markdowner – A free, open-source tool that converts website content into well-structured Markdown files.
- Appify – Jacob Kopecky’s llms.txt generator.
- Website LLMs – This WordPress plugin creates your llms.txt file for you. Just set the crawl to “Post”, “pages,” or both, and you’re in business. I was one of the first ten people to download this plugin; now it is at over 3,000 downloads in just three months.
- FireCrawl – One of the first tools to emerge for the creation of llms.txt files.
While llms.txt improves content extraction clarity, it could also introduce security risks that require careful management. This article does not address those risks, but it is highly recommended that any tool is fully vetted before deploying this file.
Why llms.txt could matter for SEO and GEO
Controlling how AI models interact with your content is critical, and just having a fully flattened version of a website can make AI extraction, training, and analysis much simpler. Here are some reasons why:
- Protecting proprietary content: Prevents AI from using original content without permission, but only for the LLMs that choose to obey the directives.
- Brand Reputation Management: It theoretically gives businesses some control over how their information appears in AI-generated responses.
- Linguistic and content analysis: With a fully flattened version of your site that is easily consumable by AI, you can perform all kinds of analysis that typically require a standalone tool. Keyword frequency, taxonomy analysis, entity analysis, linking, competitive analysis, etc.
- Enhanced AI interaction: llms.txt helps LLMs interact more effectively with your website, enabling them to retrieve accurate and relevant information. No standard needed for this option, just a nice clean and flattened file of your complete content.
- Improved content visibility: By guiding AI systems to focus on specific content, llms.txt can theoretically “optimize” your website for AI indexing, potentially improving your site’s visibility in AI-powered search results. Like SEO, there are no guarantees. But on the face of it, any preference that an LLM has towards a llms.txt is a step forward.
- Better AI performance: The file ensures that LLMs can access the most valuable content on your site, leading to more accurate AI responses when users engage with tools like chatbots or AI-powered search engines. I use the “full” rendering of llms.txt, and personally do not find the summaries or URL lists any more helpful than robots.txt, or an XML sitemap.
- Competitive advantage: As AI technologies continue to evolve, having an llms.txt file can give your website a competitive edge by making it more AI-ready.
Challenges and limitations
While llms.txt offers a promising solution, several key challenges remain:
- Adoption by AI companies: Not all AI companies may adhere to the standard, and will just ignore the file, and ingest all of your content any way.
- Adoption by websites: Simply put, brands and website operators are going to have to step up and participate if llms.txt will be successful. Maybe not all, but a critical mass will be necessary. In the absence of any other type of scientific “optimization” of AI, what have we got to lose? (I still really think it is a mistake to apply an old term like “optimization” to generative AI. It just seems linguistically lazy).
- Overlap with robots.txt and XML sitemaps: Potential conflicts and inconsistencies between robots.txt, XML sitemaps, and llms.txt could create confusion. To repeat, the llms.txt file is not intended to be a substitute for robots.txt. As previously mentioned, I find the most value in the “full” rendering of the text file.
- Keyword, content, and link spamability: Much like keyword stuffing was used in the SEO days of yore, there is nothing to stop anyone from filling up their llms.txt with gratuitous loads of text, keywords, links, and content.
- Exposure of your content to competitors for their own analysis. While scraping is a basic cornerstone of the entire search industry, competitive keyword and content research is nothing new. But having this simple file lowers the bar a bit for your competitors to easily analyze what you have – and don’t have – and use to their competitive advantage.
Other contrarian views about llms.txt exist in the SEO/GEO community. I had a message chat with Pubcon and WebmasterWorld CEO Brett Tabke about llms.txt. He said he doesn’t believe it offers much utility:
- “We just don’t need people thinking they [LLMs] are different from any other spider. The dividing line between a ‘search [engine]’ and an ‘llm’ is barely arguable any more. Google, Perplexity, and ChatGPT have blurred that into a very fuzzy line with AI responses on SERPs. The only distinguishing factor is that Google is a search engine with an LLM bolted on, and ChatGPT is an LLM with a search engine bolted on. Going forward, it is obvious that Google will merge their LLM directly with the code base of the search engine and blow away any remaining lines between the two. LLMs.txt simply obfuscates that fact.”
XML sitemaps and robots.txt already serve this purpose, Tabke added.
On this point, I agree wholly. But for me, the potential value lies mostly in the “full” text rendering version of this file.
Marketer David Ogletree also has similar reservations:
- “If there is a bottom line, it is that I really don’t want people continuing this idea that there is a difference between a LLM and Google. They are one in the same to me and should be treated the same.”
The future of llms.txt and AI content governance
As AI adoption continues to grow, so does the need for structured content governance.
llms.txt represents an early effort to create transparency and control over AI content usage. Whether it becomes a widely accepted standard depends on industry support, website owner support, regulatory developments, and AI companies’ willingness to comply.
You should stay informed about llms.txt and be prepared to adapt their content strategies as AI-driven search and content discovery evolve.
The introduction of llms.txt marks a significant step toward balancing AI innovation with content ownership rights, and the “crawlability and indexability” of websites for consumption and analysis by LLMs.
You should proactively explore its implementation to safeguard your digital assets, and also provide LLMs a runway to better understand the structure and content of your site(s).
As AI continues to reshape online search and content distribution, having a defined strategy for AI interaction with your website will be essential.
llms.txt could create a little bit of science for GEO
In GEO, much like SEO, there are literally almost no scientific standards for web creators to base on. In other words, verifiable best platform practices based on specific tactics.
Any buzzy acronym containing a big “O” (optimization) is black box engineering. Or, as another tech development executive I worked with calls it, “wizardry,” “alchemy,” or “digital shamanism.”
For example:
- When Google says “create great content for users, and then you will succeed in search” – that’s an art project on your part.
- When Google says, “we follow XML sitemaps as a part of our crawler journey, and there is a place for it in Google Search Console,” well, that’s a little bit of science.
- And the same for schema.org, robots.txt, and even IndexNow. These are “agreed upon” standards that search engines tell us definitively, “we do take these protocols into consideration, though at our own discretion.”
In a world of so much uncertainty with what “can be done” for improving AI and LLM performance, llms.txt sounds like a great start.
If you have a wide content audience, it may bode well for you to get your llms.txt file going now. You never know what major or specialized LLM may want to use your content for some new purpose. And in a world shifting from the multiple decisions required of a searcher of a cluttered results page, the LLM provides the answer.
If you are playing to win, then you want your content to be that answer, as it is potentially worth a multitude of search engine searches.
I started implementing llms.txt on my own websites a few months ago, and am implementing it on all my clients’ websites. There is no harm in doing so. Anything that can potentially help “optimize” my content should be done, especially as a potentially accepted standard.
Are all the LLMs using it? It is definitely not even near critical mass, but some have reported an interest.
Can an llms.txt file also help you better access and crawl your own website for various AI uses? Absolutely.
One of the main uses I have found is in analyzing client sites in various ways. Having the entirety of your website content in a file can allow for different types of analysis that were not as easy to render previously.
Will it become a standard?
It definitely remains to be seen. llms.txt has a long road ahead, but I wouldn’t bet against it.
Where companies are looking for new ideas to improve their presence as “the answer” in LLMs, it offers one new signal for AI optimization and possibly one step ahead for connecting with LLMs in a way that was previously only comparable to search engines.
And don’t be surprised if you start hearing a lot more SEO/GEO practitioners talking about llms.txt in the near term, as a basic staple for site optimization, along with robots.txt, XML sitemaps, schema, IndexNow, and others.
Recent Comments