Introduction
In the age of AI assistants and LLMs, your website isn’t just serving human visitors anymore. AI agents crawl, read, and interpret your content to answer user queries. But what exactly are these agents understanding about your site? What user intents are they extracting? Intent Crawler is a Python tool designed to answer these questions by analyzing websites through the lens of user intent discovery.
The Problem: Understanding AI’s Understanding
When ChatGPT, Claude, or other AI assistants reference your website, they’re making decisions about what your users want to accomplish. They’re identifying pain points, extracting action sequences, and categorizing user goals. Traditional analytics tell you what pages users visit, but they don’t reveal what AI agents think those users are trying to achieve.
Intent Crawler bridges this gap by:
- Crawling your website respectfully and comprehensively
- Processing content to extract meaningful signals
- Discovering user intents dynamically from actual content
- Generating structured data that shows how AI might interpret your site
Architecture Overview
Intent Crawler follows a modular pipeline architecture that processes websites in distinct stages:
Website URL → Crawler → Content Processor → Intent Extractor → Report Generator → Dashboard
Each component is designed to solve a specific challenge in understanding web content from an AI perspective.
Stage 1: Intelligent Web Crawling
The journey begins with the WebCrawler
class, which implements a sophisticated crawling strategy:
Sitemap-First Approach
def crawl(self, start_url: Optional[str] = None, use_sitemap: bool = True) -> List[CrawledPage]:
if use_sitemap:
sitemap_urls = self._discover_urls()
if sitemap_urls:
urls_to_visit.extend(sitemap_urls)
The crawler prioritizes sitemap discovery because:
- Efficiency: Sitemaps provide a complete URL list without recursive crawling
- Respect: Following sitemaps shows respect for the site’s intended structure
- Completeness: Important pages might not be linked but are listed in sitemaps
Robots.txt Compliance
def _can_fetch(self, url: str) -> bool:
if not self.robots_parser:
return True
return self.robots_parser.can_fetch(self.session.headers['User-Agent'], url)
The crawler respects robots.txt rules, checking both crawl permissions and discovering additional sitemaps listed in the robots file.
Smart Content Storage
Each crawled page stores both raw HTML and extracted content:
@dataclass
class CrawledPage:
url: str
title: str
content: str
links: List[str]
section: Optional[str] = None
metadata: Dict = None
raw_html: Optional[str] = None
This dual storage allows different processing strategies downstream without re-crawling.
Stage 2: Content Processing and Extraction
The ContentProcessor
employs multiple extraction strategies to handle diverse website structures:
Multi-Library Approach
def process_content(self, html_content: str, url: str) -> Optional[ProcessedContent]:
result = self.process_with_trafilatura(html_content, url)
if not result:
result = self.process_with_newspaper(html_content)
if not result:
result = self.process_with_beautifulsoup(html_content, url)
This fallback strategy ensures content extraction even from challenging pages:
- Trafilatura: Excellent for articles and blog posts
- Newspaper3k: Strong with news-style content
- BeautifulSoup: Fallback for custom structures
Intelligent Summarization
Rather than using the first N characters, the processor creates contextual summaries:
def _create_summary(self, content: str, title: str) -> str:
sentences = re.split(r'[.!?]+', content)
sentences = [s.strip() for s in sentences if len(s.strip()) > 20]
summary = sentences[0]
for sentence in sentences[1:]:
if len(summary + ' ' + sentence) <= self.max_summary_length:
summary += ' ' + sentence
This preserves complete thoughts and provides meaningful context for intent analysis.
Stage 3: User Intent Discovery
The heart of Intent Crawler is its intent extraction system. The UserIntentExtractor
focuses on understanding what users are trying to accomplish:
Intent Pattern Recognition
self.intent_patterns = {
'learn_and_understand': {
'signals': [
r'\b(?:how to|tutorial|guide|step by step)\b',
r'\b(?:learn|understand|explain|what is)\b'
],
'user_goals': ['acquire new skills', 'understand concepts'],
'pain_points': ['lack of knowledge', 'confusion']
},
'solve_problem': {
'signals': [
r'\b(?:troubleshoot|fix|solve|error|issue)\b',
r'\b(?:not working|broken|failed|help)\b'
],
'user_goals': ['fix issues', 'get unblocked'],
'pain_points': ['system not working', 'stuck on task']
}
}
These patterns capture different user motivations:
- Research & Compare: Users evaluating options
- Learn & Understand: Users seeking knowledge
- Solve Problems: Users facing challenges
- Implement & Integrate: Users building solutions
Multi-Signal Analysis
The extractor doesn’t rely on keywords alone. It analyzes multiple signal types:
def _analyze_page_intent(self, content: ProcessedContent) -> Dict[str, any]:
user_signals = self._extract_user_signals(full_text)
action_sequences = self._extract_action_sequences(full_text)
pain_indicators = self._extract_pain_indicators(full_text)
outcome_indicators = self._extract_outcome_indicators(full_text)
Using spaCy’s NLP capabilities, the tool identifies what users want to do:
def _extract_action_sequences(self, text: str) -> List[str]:
doc = self.nlp(text[:1000000])
action_sequences = []
for sent in doc.sents:
if sent[0].pos_ == "VERB" and sent[0].dep_ == "ROOT":
action_sequences.append(sent.text.strip())
This catches imperative sentences like “Configure your API key” or “Download the installer”.
Pain Point Detection
Understanding user frustrations helps identify problem-solving intents:
pain_patterns = [
r'\b(?:difficult|hard|challenging|complex|confusing)\b',
r'\b(?:can\'t|cannot|unable to|doesn\'t work)\b',
r'\b(?:slow|expensive|time-consuming|inefficient)\b'
]
Intent Clustering and Confidence
Pages are grouped by dominant intent with confidence scoring:
def _calculate_intent_confidence(self, analyses: List[Dict], intent_type: str) -> float:
strong_signals = sum(1 for analysis in analyses
if analysis['intent_scores'].get(intent_type, 0) > 0.2)
consistency_ratio = strong_signals / len(analyses)
avg_signal_strength = np.mean([...])
return min(consistency_ratio * avg_signal_strength, 1.0)
This ensures discovered intents are statistically significant, not just random keyword matches.
Stage 4: Dynamic Intent Discovery Methods
Beyond pattern matching, Intent Crawler offers advanced ML-based discovery:
Topic Modeling with LDA
def _apply_lda_clustering(self, tfidf_matrix, feature_names: List[str]) -> List[Dict]:
lda = LatentDirichletAllocation(
n_components=self.config.get('lda_topics', 10),
random_state=42
)
lda_matrix = lda.fit_transform(tfidf_matrix)
LDA discovers latent topics across all content, revealing intent patterns that might not match predefined categories.
Semantic Clustering with Embeddings
if self.embeddings_model:
embeddings = self.embeddings_model.encode(texts)
clustering = DBSCAN(eps=0.5, min_samples=2, metric='cosine')
clusters = clustering.fit_predict(embeddings)
Sentence transformers create semantic embeddings, allowing the tool to group conceptually similar pages even with different vocabulary.
Stage 5: Structured Output Generation
Intent Crawler generates multiple output formats for different use cases:
Following the llmstxt specification, the tool creates AI-readable summaries:
def format_as_llmstxt(self, pages: List[CrawledPage],
processed_contents: Dict[str, ProcessedContent]) -> str:
output.append(f"# {self.site_name}")
output.append(f"\n> {self._generate_site_description(pages)}")
for section, section_pages in sections.items():
output.append(f"\n## {section.title()}")
for page in section_pages[:5]:
output.append(f"- {content.title}: {content.summary}")
This format is optimized for LLM consumption with clear hierarchy and concise summaries.
Interactive Dashboard
The Dash-based dashboard provides visual intent analysis:
def create_intent_distribution_chart(self, intent_data: List[Dict]) -> go.Figure:
fig = go.Figure(data=[
go.Bar(
x=intent_names,
y=page_counts,
marker_color=confidence_scores,
text=[f"{conf:.0%}" for conf in confidence_scores]
)
])
Features include:
- Intent distribution visualization
- Section-by-section analysis
- Confidence score indicators
- Export capabilities
Technical Optimizations
- Rate Limiting: Configurable delays prevent overwhelming servers
- Concurrent Processing: Multiple analysis methods run in parallel
- Memory Management: Large texts are truncated for NLP processing
- Caching: Results are date-organized for easy retrieval
Scalability Design
def manage_results_directory(config: Dict[str, Any]) -> str:
# Automatic cleanup of old results
if keep_past_results >= 0:
cleanup_old_results(base_dir, keep_past_results, date_format)
The tool automatically manages disk space by cleaning up old results while preserving recent analyses.
Real-World Impact
Intent Crawler reveals insights that traditional analytics miss:
- Content Gaps: Discover intents users seek but your content doesn’t address
- AI Readiness: Understand how well your content serves AI agents
- User Journey Mapping: See the problems users are trying to solve
- Content Strategy: Align content creation with discovered user needs
Conclusion
Intent Crawler solves a critical problem in the AI era: understanding how machines interpret your website’s purpose. By combining respectful crawling, intelligent content processing, and sophisticated intent discovery, it provides a window into your site’s AI-perceived value.
The tool’s modular architecture makes it extensible - new intent patterns, ML models, or output formats can be added without restructuring the core pipeline. As AI agents become increasingly important traffic sources, tools like Intent Crawler help ensure your content effectively communicates user value to both human and artificial visitors.
Whether you’re optimizing for ChatGPT citations, improving content strategy, or simply curious about your site’s AI interpretation, Intent Crawler provides the insights needed to thrive in an AI-augmented web.
Responses are generated using AI and may contain mistakes.