Overview

LLMS.txt is a proposed standard that is intended to aid in context engineering for website content. The LLMS.txt Generator crawls websites and creates standardized LLMS.txt files following the specification at llmstxt.org.

Key Features

Universal Compatibility

Automatically detects website structure and content categories for any type of site

Dynamic Section Detection

Intelligently discovers content sections from URL patterns rather than using hardcoded templates

Multiple Output Formats

Generates .txt, .md, and .json versions with comprehensive metadata

Dashboard Integration

Automatically integrates with the master dashboard for visual analysis and reporting

Getting Started

1

Clone the repository

git clone https://github.com/Airbais/airbais-tools.git
cd tools/llmstxtgenerator
2

Install dependencies

pip install -r requirements.txt
3

Set up API keys (optional)

For AI-enhanced descriptions:

export OPENAI_API_KEY="your-openai-api-key"
# OR
export ANTHROPIC_API_KEY="your-anthropic-api-key"
4

Run your first generation

python llmstxtgenerator.py https://example.com --dashboard

Quick Examples

# Generate LLMS.txt for a website
python llmstxtgenerator.py https://example.com

Configuration

Customize the tool’s behavior through config.yaml:

How It Works

1

Website Discovery

Crawls the target website respecting robots.txt and following links systematically

2

Dynamic Section Detection

Analyzes URL patterns to automatically discover content categories and sections

3

Content Extraction

Extracts titles, descriptions, and metadata from each discovered page

4

Intelligent Categorization

Groups pages by detected sections with configurable filtering and limits

5

LLMS.txt Generation

Creates structured output following the official LLMS.txt specification

6

Multi-Format Export

Generates .txt, .md, and .json versions with comprehensive metadata

7

Dashboard Integration

Creates dashboard-compatible data for visual analysis and reporting

Dynamic Section Detection

Unlike template-based approaches, the tool discovers sections directly from your website’s URL structure

URL Pattern Analysis

Analyzes URL paths to identify natural content groupings and hierarchies

Smart Filtering

Ignores common URL segments like IDs, pagination, and utility pages

Universal Compatibility

Works with any website type: documentation, e-commerce, blogs, or corporate sites

Output Structure

Results are organized by date for easy historical tracking:

results/
└── YYYY-MM-DD/
    ├── llms.txt              # Standard LLMS.txt file
    ├── llms.md               # Markdown version with metadata
    ├── llms.json             # Structured JSON data
    ├── generation_report.md  # Detailed generation report
    └── dashboard-data.json   # Dashboard integration data

Example Output

# J.Crew: Clothes, Shoes & Accessories For Women, Men & Kids

> Shop JCrew.com for the Highest Quality Women's and Men's Clothing and see the entire selection of Children's Clothing, Cashmere Sweaters, Women's Dresses and Shoes, Men's Suits, Jackets, Accessories and more.

## Company
- [Military & Medical Discount](https://jcrew.com/company/military-medical-first-responder-discount): Discounts for service members
- [J.Crew Credit Card](https://jcrew.com/company/credit-card): Store credit card information
- [Mobile App](https://jcrew.com/company/the-jcrew-mobile-app): Download the J.Crew app

## Help
- [Contact Us](https://jcrew.com/help/contact-us): Customer service and support
- [Shipping & Handling](https://jcrew.com/help/shipping-handling): Delivery information
- [Order Status](https://jcrew.com/help/order-status): Track your orders

## Shop
- [Women's Sale](https://jcrew.com/shop/cashmere-summer-sale/womens): End of season cashmere sale
- [Men's Sale](https://jcrew.com/shop/cashmere-summer-sale/mens): Men's cashmere sale items
- [Top Rated](https://jcrew.com/shop/top-rated/mens): Customer favorite products

Dashboard Features

Two dashboard options: Local tool-specific dashboard and Master multi-tool dashboard

Visual Analytics

Generation Summary

Success rates, pages crawled, and sections discovered with key metrics

Site Structure Analysis

Interactive visualization of discovered content categories and sections

Content Distribution

Breakdown of pages by section with detailed statistics

Configuration Overview

Current settings and parameters used for generation

# Launch after generation
python llmstxtgenerator.py https://example.com --dashboard

# View existing results
python llmstxtgenerator.py --dashboard-only
  • Tool-specific: Shows only LLMS.txt generator results
  • Fast Loading: Optimized for single-tool analysis
  • Detailed Metrics: Complete generation statistics and analysis

Command Line Reference

url
string

Website URL to generate LLMS.txt for (optional if using config file)

--config
string

Configuration file path (default: config.yaml)

--name
string

Website/project name (overrides auto-detection)

--description
string

Website/project description (overrides auto-detection)

--max-pages
number

Maximum number of pages to crawl

--max-depth
number

Maximum crawl depth

--no-ai
flag

Disable AI-generated descriptions

--output-dir
string

Output directory for results

--dashboard
flag

Launch dashboard after generation

--dashboard-only
flag

Launch dashboard without running generation

--verbose
flag

Enable verbose logging

--version
flag

Show program version number

Real-World Examples

# J.Crew example - discovers shopping categories
python llmstxtgenerator.py https://jcrew.com

Result: Sections like shop, sale, help, company automatically detected from URL patterns like /shop/, /sale/, /help/, /company/

Success Rate: 47% with 7 sections discovered from 100 pages

Performance Guidelines

Processing time scales with website size and complexity

Troubleshooting

Advanced Usage

Custom URL Patterns

Control which URLs to include or exclude:

generation:
  include_patterns:
    - "https://example.com/docs/.*"
    - "https://example.com/api/.*"
  exclude_patterns:
    - ".*\\.(pdf|zip|tar\\.gz)$"
    - ".*/downloads/.*"
    - ".*/temp/.*"

Section Detection Tuning

Fine-tune automatic section detection:

generation:
  min_pages_per_section: 3  # Require more pages per section
  ignore_segments:          # Additional segments to ignore
    - "temp"
    - "staging"
    - "preview"
  max_links_per_section: 15  # Limit section size

Without AI Enhancement

For faster generation without AI descriptions:

python llmstxtgenerator.py https://example.com --no-ai

Requirements

System Requirements

  • Python 3.8+
  • 256MB RAM minimum
  • Internet connection for crawling
  • Optional: API keys for AI-enhanced descriptions

Dependencies

Key Python packages (automatically installed):

  • requests - Web crawling and HTTP requests
  • beautifulsoup4 - HTML parsing and content extraction
  • pyyaml - Configuration file handling
  • openai - AI-powered descriptions (optional)
  • anthropic - Alternative AI provider (optional)
  • plotly - Dashboard visualization
  • pandas - Data processing and analysis

AI Tools Suite Integration

LLMS.txt Generator is part of the larger Airbais AI Tools Suite with centralized dashboard

Master Dashboard

Centralized view of all AI tool results at ../dashboard/

Auto-Discovery

New tools are automatically detected and integrated

Standard Format

JSON output compatible with other suite tools

Consistent Design

Shared Airbais design system across all tools

Suite Benefits

  • Unified Analysis: Combine LLMS.txt generation with intent analysis and brand evaluation
  • Historical Tracking: Compare website changes over time across all tools
  • Streamlined Workflow: Single dashboard for all AI-powered website analysis
  • Future-Ready: Architecture designed for easy addition of new analysis tools

Contributing

We welcome contributions in these key areas:

Section Detection

Improved algorithms for automatic content categorization

Content Extraction

Better handling of complex website structures and content types

Output Formats

Additional export formats and integration capabilities

Performance

Optimization for large-scale websites and faster processing

Development Guidelines

  1. Maintain Universal Compatibility: Ensure changes work across different website types
  2. Follow Dynamic Detection: Enhance the URL pattern-based section discovery
  3. Preserve Dashboard Integration: Keep dashboard-data.json format compatible
  4. Add Comprehensive Testing: Test with real websites of different types
  5. Update Documentation: Keep this MDX file current with new features

Future Roadmap

1

Enhanced AI Integration

Better content summarization and automatic description generation

2

Real-time Updates

Monitor websites for changes and automatically update LLMS.txt files

3

API Integration

Direct integration with popular CMS platforms and website builders

4

Multi-language Support

Generate LLMS.txt files for international websites with language detection

5

Advanced Analytics

Content quality scoring and SEO optimization recommendations


Part of the Airbais AI Tools Suite - Comprehensive tools for AI-powered business intelligence and content optimization.