Overview
LLMS.txt is a proposed standard that is intended to aid in context engineering for website content. The LLMS.txt Generator crawls websites and creates standardized LLMS.txt files following the specification at llmstxt.org.
Key Features
Universal Compatibility
Automatically detects website structure and content categories for any type of site
Dynamic Section Detection
Intelligently discovers content sections from URL patterns rather than using hardcoded templates
Multiple Output Formats
Generates .txt, .md, and .json versions with comprehensive metadata
Dashboard Integration
Automatically integrates with the master dashboard for visual analysis and reporting
Getting Started
1
Clone the repository
2
Install dependencies
3
Set up API keys (optional)
For AI-enhanced descriptions:
4
Run your first generation
Quick Examples
- Basic Usage
- With Custom Details
- Using Configuration
- View Results
Configuration
Customize the tool’s behavior throughconfig.yaml
:
Website Settings
Website Settings
Configure target website and basic information:
Generation Settings
Generation Settings
Control crawling behavior and section detection:
Content Analysis
Content Analysis
Configure AI-powered content analysis:
Output Options
Output Options
Control output formats and organization:
How It Works
1
Website Discovery
Crawls the target website respecting robots.txt and following links systematically
2
Dynamic Section Detection
Analyzes URL patterns to automatically discover content categories and sections
3
Content Extraction
Extracts titles, descriptions, and metadata from each discovered page
4
Intelligent Categorization
Groups pages by detected sections with configurable filtering and limits
5
LLMS.txt Generation
Creates structured output following the official LLMS.txt specification
6
Multi-Format Export
Generates .txt, .md, and .json versions with comprehensive metadata
7
Dashboard Integration
Creates dashboard-compatible data for visual analysis and reporting
Dynamic Section Detection
Unlike template-based approaches, the tool discovers sections directly from your website’s URL structure
URL Pattern Analysis
Analyzes URL paths to identify natural content groupings and hierarchies
Smart Filtering
Ignores common URL segments like IDs, pagination, and utility pages
Universal Compatibility
Works with any website type: documentation, e-commerce, blogs, or corporate sites
Output Structure
Results are organized by date for easy historical tracking:Example Output
Dashboard Features
Two dashboard options: Local tool-specific dashboard and Master multi-tool dashboard
Visual Analytics
Generation Summary
Success rates, pages crawled, and sections discovered with key metrics
Site Structure Analysis
Interactive visualization of discovered content categories and sections
Content Distribution
Breakdown of pages by section with detailed statistics
Configuration Overview
Current settings and parameters used for generation
- Local Dashboard
- Master Dashboard
- Tool-specific: Shows only LLMS.txt generator results
- Fast Loading: Optimized for single-tool analysis
- Detailed Metrics: Complete generation statistics and analysis
Command Line Reference
Website URL to generate LLMS.txt for (optional if using config file)
Configuration file path (default: config.yaml)
Website/project name (overrides auto-detection)
Website/project description (overrides auto-detection)
Maximum number of pages to crawl
Maximum crawl depth
Disable AI-generated descriptions
Output directory for results
Launch dashboard after generation
Launch dashboard without running generation
Enable verbose logging
Show program version number
Real-World Examples
- E-commerce Site
- Documentation Site
- Corporate Website
shop
, sale
, help
, company
automatically detected from URL patterns like /shop/
, /sale/
, /help/
, /company/
Success Rate: 47% with 7 sections discovered from 100 pagesPerformance Guidelines
Processing time scales with website size and complexity
Small Sites (<50 pages)
Small Sites (<50 pages)
- Processing Time: 1-3 minutes
- Recommended Settings: Default configuration works well
- Success Rate: Typically 60-80%
Medium Sites (50-200 pages)
Medium Sites (50-200 pages)
- Processing Time: 3-10 minutes
- Optimization: Consider increasing
min_pages_per_section
to 3-5 - Success Rate: Typically 40-60%
Large Sites (200+ pages)
Large Sites (200+ pages)
- Processing Time: 10+ minutes
- Optimization: Set
max_pages: 100-200
to limit scope - Success Rate: Typically 30-50% (more selective)
Troubleshooting
Low Success Rate
Low Success Rate
Symptoms: Less than 20% success rate or very few sections detectedSolutions:
- Lower
min_pages_per_section
in config - Check if the site has a consistent URL structure
- Verify the site allows crawling (check robots.txt)
- Try increasing
max_depth
for deeper crawling
No Pages Found
No Pages Found
Symptoms: “No pages crawled” or empty resultsSolutions:
- Verify URL is accessible and correct
- Check internet connection
- Ensure site doesn’t block automated crawlers
- Try adjusting
user_agent
in crawling configuration
AI Descriptions Not Working
AI Descriptions Not Working
Symptoms: Generic descriptions despite enabling AISolutions:
- Verify API key is set:
echo $OPENAI_API_KEY
- Check API key has sufficient credits/quota
- Try different AI model in configuration
- Use
--no-ai
flag to disable and test basic functionality
Dashboard Not Loading
Dashboard Not Loading
Symptoms: Dashboard shows no data or fails to startSolutions:
- Ensure generation completed successfully
- Check
dashboard-data.json
exists in results directory - Verify no JSON formatting errors in results
- Try local dashboard:
python llmstxtgenerator.py --dashboard-only
Advanced Usage
Custom URL Patterns
Control which URLs to include or exclude:Section Detection Tuning
Fine-tune automatic section detection:Without AI Enhancement
For faster generation without AI descriptions:Requirements
System Requirements
- Python 3.8+
- 256MB RAM minimum
- Internet connection for crawling
- Optional: API keys for AI-enhanced descriptions
Dependencies
Key Python packages (automatically installed):requests
- Web crawling and HTTP requestsbeautifulsoup4
- HTML parsing and content extractionpyyaml
- Configuration file handlingopenai
- AI-powered descriptions (optional)anthropic
- Alternative AI provider (optional)plotly
- Dashboard visualizationpandas
- Data processing and analysis
AI Tools Suite Integration
LLMS.txt Generator is part of the larger Airbais AI Tools Suite with centralized dashboard
Master Dashboard
Centralized view of all AI tool results at
../dashboard/
Auto-Discovery
New tools are automatically detected and integrated
Standard Format
JSON output compatible with other suite tools
Consistent Design
Shared Airbais design system across all tools
Suite Benefits
- Unified Analysis: Combine LLMS.txt generation with intent analysis and brand evaluation
- Historical Tracking: Compare website changes over time across all tools
- Streamlined Workflow: Single dashboard for all AI-powered website analysis
- Future-Ready: Architecture designed for easy addition of new analysis tools
Contributing
We welcome contributions in these key areas:Section Detection
Improved algorithms for automatic content categorization
Content Extraction
Better handling of complex website structures and content types
Output Formats
Additional export formats and integration capabilities
Performance
Optimization for large-scale websites and faster processing
Development Guidelines
- Maintain Universal Compatibility: Ensure changes work across different website types
- Follow Dynamic Detection: Enhance the URL pattern-based section discovery
- Preserve Dashboard Integration: Keep dashboard-data.json format compatible
- Add Comprehensive Testing: Test with real websites of different types
- Update Documentation: Keep this MDX file current with new features
Future Roadmap
1
Enhanced AI Integration
Better content summarization and automatic description generation
2
Real-time Updates
Monitor websites for changes and automatically update LLMS.txt files
3
API Integration
Direct integration with popular CMS platforms and website builders
4
Multi-language Support
Generate LLMS.txt files for international websites with language detection
5
Advanced Analytics
Content quality scoring and SEO optimization recommendations
Part of the Airbais AI Tools Suite - Comprehensive tools for AI-powered business intelligence and content optimization.