Overview

A professional Python tool that crawls websites and analyzes the user intents conveyed to LLMs and AI agents. Part of the Airbais AI Tools Suite, IntentCrawler extracts content, discovers intents dynamically using multiple ML techniques, and provides modern interactive dashboards with light/dark mode support.

Key Features

Intelligent Web Crawling

Respectful crawling with robots.txt compliance, automatic sitemap discovery, and configurable rate limiting

Advanced Intent Discovery

User-focused analysis (default), plus LDA topic modeling, sentence embeddings, and clustering

Modern Dashboard

Professional web interface with Airbais design system, light/dark mode, and responsive layout

Structured Exports

Outputs in llmstxt format and JSON for seamless LLM tool integration

Getting Started

Installation

1

Clone the repository

git clone https://github.com/Airbais/intent-tools.git
cd intentcrawler
2

Install dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm
3

Run your first analysis

python intentcrawler.py https://example.com --dashboard

Quick Examples

# Analyze a website
python intentcrawler.py https://example.com

Configuration

Customize the tool’s behavior through config.yaml:

How It Works

Intent Discovery Process

1

Content Extraction

Crawls website pages and extracts clean, structured content

2

Text Preprocessing

Removes noise, normalizes text, and prepares for analysis

3

Feature Extraction

  • TF-IDF: Identifies important keywords
  • Embeddings: Captures semantic meaning
  • N-grams: Detects meaningful phrases
4

Intent Clustering

  • LDA: Discovers latent topics
  • DBSCAN: Groups semantically similar content
  • Keywords: Matches known patterns
5

Intent Merging

Combines similar intents based on configurable similarity threshold

6

Naming & Scoring

Automatically generates descriptive intent names and confidence scores

ML Techniques Explained

LDA Topic Modeling

Discovers latent topics across all content with configurable topic counts

Embedding Clustering

Uses sentence transformers and DBSCAN for semantic understanding

Keyword Fallback

Configurable keywords ensure baseline intent detection

Output Structure

Results are organized by date for easy historical tracking:

results/
├── 2024-06-26/              # Today's results
│   ├── llmstxt/
│   │   ├── llms.txt         # Main llmstxt file
│   │   └── pages/           # Individual page summaries
│   ├── intent-report.json   # Detailed intent analysis
│   ├── dashboard-data.json  # Dashboard visualization data
│   ├── intent-summary.md    # Human-readable summary
│   └── llm-export.json      # Structured for LLM tools
└── 2024-06-25/              # Yesterday's results
    └── ...

Example Output

{
  "discovered_intents": [
    {
      "primary_intent": "learn_integration",
      "confidence": 0.85,
      "keywords": ["api", "integration", "connect"],
      "representative_phrases": [
        "integrate with your application",
        "api documentation and guides"
      ],
      "page_count": 23,
      "extraction_method": "lda"
    }
  ]
}

Dashboard Features

Two dashboard options: Local tool-specific dashboard and Master multi-tool dashboard

Modern Design System

Airbais Design

Professional orange/gray color scheme with Inter font family

Light/Dark Mode

Toggle themes with persistent user preferences

Responsive Layout

Works perfectly on desktop and mobile devices

Fast Performance

Optimized loading and smooth interactions

  • Total Pages: Number of pages analyzed
  • Discovered Intents: Count of unique user intents
  • Site Sections: Structural breakdown of the website
  • Confidence Indicators: Visual quality scores

Command Line Reference

url
string
required

The website URL to analyze

--config
string

Path to custom configuration file

--output
string

Override default output directory

--log-level
string

Set logging level: DEBUG, INFO, WARNING, ERROR

--dashboard
flag

Launch dashboard after analysis completes

--dashboard-only
flag

View existing results without running analysis

--dashboard-date
string

View results from specific date (YYYY-MM-DD)

--list-results
flag

List all available result dates

Performance Guidelines

Processing time increases with site size and enabled ML features

Troubleshooting

Requirements

System Requirements

  • Python 3.8+
  • See requirements.txt for full dependency list
  • Optional: GPU for faster embeddings processing

AI Tools Suite Integration

IntentCrawler is part of the larger Airbais AI Tools Suite with centralized dashboard

Master Dashboard

Centralized view of all AI tool results at ../dashboard/

Auto-Discovery

New tools are automatically detected and integrated

Standard Format

JSON output compatible with other suite tools

Consistent Design

Shared Airbais design system across all tools

Master Dashboard Benefits

  • Multi-Tool View: See results from all AI tools in one interface
  • Tool Selection: Dropdown to choose between different analysis tools
  • Date Selection: Browse historical results across all tools
  • Future-Ready: Architecture designed for easy tool addition

Future Roadmap

1

Multi-language Support

Expand beyond English content analysis

2

Real-time Tracking

Monitor intent changes over time with the master dashboard

3

A/B Testing

Compare intents across different site versions

4

Suite Expansion

Add sentiment analysis, performance monitoring, and SEO tools

5

API Access

Programmatic access to all suite tools through unified API

Contributing

We welcome contributions in these key areas:

Algorithms

Additional clustering algorithms and ML techniques

Visualization

Enhanced dashboard features and data visualization

Performance

Optimization for large-scale websites

Integrations

CMS plugins and third-party tool connections