Introduction
In the age of AI-powered search and recommendations, understanding how your brand is represented across different Large Language Models (LLMs) has become crucial for digital marketers and brand managers. This technical deep dive explores the architecture and implementation of lLM Evaluator, a sophisticated tool that analyzes brand representation across multiple LLM providers.
The Problem: Brand Visibility in the AI Era
As users increasingly rely on LLMs for recommendations, product research, and general information, brands face a new challenge: how do AI models perceive and represent their brand? Unlike traditional search engines where SEO tactics can influence rankings, LLM responses are generated from training data and can vary significantly between providers.
LLM Evaluator addresses this challenge by providing:
- Multi-LLM comparison: How does your brand fare across GPT-4, Claude, and other models?
- Sentiment analysis: Are mentions positive, negative, or neutral?
- Context awareness: Is your brand mentioned as a recommendation, comparison, or example?
- Competitive benchmarking: How do you stack up against competitors?
Architecture Overview
LLM Evaluator follows a modular pipeline architecture designed for scalability and maintainability:
Core Components
1. Configuration Manager (config.py
)
The configuration system uses markdown files for human-readable brand definitions:
class ConfigManager:
def __init__(self, config_path: str):
self.config_path = config_path
self.brand_info = {}
self.llm_configs = {}
self.prompts = {}
self.settings = {}
self.parse_config()
def parse_config(self):
"""Parse markdown configuration into structured data"""
with open(self.config_path, 'r') as f:
content = f.read()
# Extract sections using regex patterns
sections = self._extract_sections(content)
self._parse_brand_info(sections.get('brand_info', ''))
self._parse_llm_configs(sections.get('llm_configs', ''))
self._parse_prompts(sections.get('prompts', ''))
This design allows non-technical users to modify brand information, competitors, and evaluation prompts without touching code.
2. LLM Interface (llm_interface.py
)
The LLM interface provides a unified API across multiple providers:
class LLMInterface:
def __init__(self, config: Dict[str, Any]):
self.provider = config['provider']
self.model = config['model']
self.temperature = config.get('temperature', 0.7)
self.max_tokens = config.get('max_tokens', 1000)
# Initialize provider-specific clients
if self.provider == 'openai':
self.client = openai.OpenAI(api_key=config['api_key'])
elif self.provider == 'anthropic':
self.client = anthropic.Anthropic(api_key=config['api_key'])
The interface handles provider-specific authentication, request formatting, and response parsing, abstracting these details from the evaluation logic.
3. Prompt Executor (prompt_executor.py
)
The prompt executor implements intelligent caching and batch processing:
class PromptExecutor:
def __init__(self, llm_interfaces: List[LLMInterface], cache_dir: str = None):
self.llm_interfaces = llm_interfaces
self.cache = Cache(cache_dir) if cache_dir else None
def execute_prompts(self, prompts: List[str]) -> Dict[str, List[str]]:
"""Execute prompts against all LLMs with caching"""
results = {}
for llm in self.llm_interfaces:
llm_key = f"{llm.provider}_{llm.model}"
results[llm_key] = []
for prompt in tqdm(prompts, desc=f"Processing {llm_key}"):
cache_key = self._generate_cache_key(prompt, llm)
if self.cache and cache_key in self.cache:
response = self.cache[cache_key]
else:
response = llm.generate_response(prompt)
if self.cache:
self.cache[cache_key] = response
results[llm_key].append(response)
return results
The caching system significantly reduces API costs during development and testing, while the progress tracking provides user feedback during long evaluation runs.
4. Response Analyzer (analyzer.py
)
The analyzer performs sophisticated text analysis to extract brand insights:
class ResponseAnalyzer:
def __init__(self, brand_info: Dict[str, Any]):
self.brand_name = brand_info['name']
self.brand_aliases = brand_info.get('aliases', [])
self.competitors = brand_info.get('competitors', [])
self.sentiment_analyzer = TextBlob
def analyze_response(self, response: str) -> Dict[str, Any]:
"""Analyze a single LLM response for brand mentions"""
analysis = {
'mention_found': False,
'mention_position': None,
'context_type': None,
'sentiment': None,
'competitor_mentions': []
}
# Brand mention detection
brand_patterns = [self.brand_name] + self.brand_aliases
for pattern in brand_patterns:
if self._find_mention(response, pattern):
analysis['mention_found'] = True
analysis['mention_position'] = self._get_mention_position(response, pattern)
analysis['context_type'] = self._classify_context(response, pattern)
analysis['sentiment'] = self._analyze_sentiment(response, pattern)
break
# Competitor analysis
for competitor in self.competitors:
if self._find_mention(response, competitor):
analysis['competitor_mentions'].append(competitor)
return analysis
The analyzer uses regex patterns for mention detection and combines TextBlob sentiment analysis with LLM-based sentiment classification for more nuanced results.
5. Metrics Calculator (metrics.py
)
The metrics calculator generates aggregate insights across all LLMs:
class MetricsCalculator:
def calculate_metrics(self, results: Dict[str, List[Dict]]) -> Dict[str, Any]:
"""Calculate comprehensive metrics across all LLMs"""
metrics = {
'per_llm_metrics': {},
'cross_llm_metrics': {}
}
# Per-LLM metrics
for llm_key, responses in results.items():
mention_rate = sum(1 for r in responses if r['mention_found']) / len(responses)
avg_sentiment = np.mean([r['sentiment'] for r in responses if r['sentiment']])
metrics['per_llm_metrics'][llm_key] = {
'mention_rate': mention_rate,
'average_sentiment': avg_sentiment,
'total_responses': len(responses),
'mentions_found': sum(1 for r in responses if r['mention_found'])
}
# Cross-LLM comparative metrics
metrics['cross_llm_metrics'] = {
'consensus_score': self._calculate_consensus(results),
'sentiment_alignment': self._calculate_sentiment_alignment(results),
'mention_rate_variance': self._calculate_variance(results)
}
return metrics
These metrics provide both individual LLM performance and comparative analysis, helping users understand consistency across providers.
Advanced Features
Intelligent Caching System
The disk-based caching system uses content hashing to ensure cache validity:
def _generate_cache_key(self, prompt: str, llm: LLMInterface) -> str:
"""Generate unique cache key for prompt + LLM configuration"""
content = f"{prompt}_{llm.provider}_{llm.model}_{llm.temperature}_{llm.max_tokens}"
return hashlib.md5(content.encode()).hexdigest()
This approach allows for fine-grained cache invalidation when LLM parameters change while maintaining cache hits for identical requests.
Sentiment Analysis Hybrid Approach
The tool implements a hybrid sentiment analysis system:
def _analyze_sentiment(self, response: str, brand_mention: str) -> float:
"""Hybrid sentiment analysis using TextBlob + LLM"""
# Extract context around brand mention
context = self._extract_context(response, brand_mention, window=100)
# TextBlob baseline
textblob_sentiment = TextBlob(context).sentiment.polarity
# LLM-based sentiment for nuanced analysis
llm_sentiment = self._llm_sentiment_analysis(context)
# Weighted combination
return 0.7 * llm_sentiment + 0.3 * textblob_sentiment
This approach combines the speed of rule-based analysis with the nuanced understanding of LLM-based sentiment classification.
Context Classification
The system classifies brand mentions into specific contexts:
def _classify_context(self, response: str, brand_mention: str) -> str:
"""Classify the context of brand mention"""
context_patterns = {
'recommendation': [r'I recommend', r'suggest', r'should try'],
'comparison': [r'compared to', r'versus', r'better than'],
'example': [r'for example', r'such as', r'like'],
'explanation': [r'is a', r'provides', r'offers']
}
mention_context = self._extract_context(response, brand_mention, window=50)
for context_type, patterns in context_patterns.items():
if any(re.search(pattern, mention_context, re.IGNORECASE) for pattern in patterns):
return context_type
return 'general'
This classification helps users understand not just whether their brand is mentioned, but how it’s being positioned.
Integration and Output
Dashboard Integration
The tool generates dashboard-compatible JSON output:
def generate_dashboard_data(self, results: Dict[str, Any]) -> Dict[str, Any]:
"""Generate dashboard-compatible output"""
dashboard_data = {
'metadata': {
'brand_name': self.brand_name,
'evaluation_date': datetime.now().isoformat(),
'llm_providers': list(results.keys())
},
'summary_metrics': {
'overall_mention_rate': self._calculate_overall_mention_rate(results),
'sentiment_distribution': self._calculate_sentiment_distribution(results),
'consensus_score': self._calculate_consensus_score(results)
},
'detailed_results': results
}
return dashboard_data
Report Generation
The system also generates human-readable reports:
def generate_text_report(self, metrics: Dict[str, Any]) -> str:
"""Generate comprehensive text report"""
report = []
report.append(f"# Brand Evaluation Report: {self.brand_name}")
report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
report.append("")
# Summary section
report.append("## Executive Summary")
report.append(f"- Overall mention rate: {metrics['overall_mention_rate']:.1%}")
report.append(f"- Average sentiment: {metrics['average_sentiment']:.2f}")
report.append(f"- LLM consensus: {metrics['consensus_score']:.1%}")
return "\n".join(report)
Concurrent Processing
The tool implements concurrent LLM requests to improve performance:
async def execute_prompts_async(self, prompts: List[str]) -> Dict[str, List[str]]:
"""Execute prompts across multiple LLMs concurrently"""
tasks = []
for llm in self.llm_interfaces:
for prompt in prompts:
task = asyncio.create_task(self._execute_single_prompt(llm, prompt))
tasks.append(task)
results = await asyncio.gather(*tasks)
return self._organize_results(results)
Rate Limiting
Built-in rate limiting prevents API quota exhaustion:
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def generate_response(self, prompt: str) -> str:
"""Generate response with retry logic and rate limiting"""
time.sleep(self.rate_limit_delay)
try:
return self._make_api_call(prompt)
except Exception as e:
logger.warning(f"API call failed: {e}")
raise
Conclusion
LLM Evaluator demonstrates how to build a sophisticated AI monitoring system that provides actionable insights for brand management. By combining multiple LLM providers, intelligent caching, and comprehensive analysis, the tool offers a robust solution for understanding brand representation in the age of AI.
The modular architecture ensures maintainability and extensibility, while the markdown-based configuration system makes it accessible to non-technical users. The comprehensive metrics and dashboard integration provide both high-level insights and detailed analysis for data-driven brand management decisions.
As LLMs become increasingly important in shaping consumer perceptions, tools like this will be essential for brands seeking to understand and optimize their AI-era presence.
Responses are generated using AI and may contain mistakes.