Introduction

In the age of AI-powered search and recommendations, understanding how your brand is represented across different Large Language Models (LLMs) has become crucial for digital marketers and brand managers. This technical deep dive explores the architecture and implementation of lLM Evaluator, a sophisticated tool that analyzes brand representation across multiple LLM providers.

The Problem: Brand Visibility in the AI Era

As users increasingly rely on LLMs for recommendations, product research, and general information, brands face a new challenge: how do AI models perceive and represent their brand? Unlike traditional search engines where SEO tactics can influence rankings, LLM responses are generated from training data and can vary significantly between providers.

LLM Evaluator addresses this challenge by providing:

  • Multi-LLM comparison: How does your brand fare across GPT-4, Claude, and other models?
  • Sentiment analysis: Are mentions positive, negative, or neutral?
  • Context awareness: Is your brand mentioned as a recommendation, comparison, or example?
  • Competitive benchmarking: How do you stack up against competitors?

Architecture Overview

LLM Evaluator follows a modular pipeline architecture designed for scalability and maintainability:

Core Components

1. Configuration Manager (config.py)

The configuration system uses markdown files for human-readable brand definitions:

class ConfigManager:
    def __init__(self, config_path: str):
        self.config_path = config_path
        self.brand_info = {}
        self.llm_configs = {}
        self.prompts = {}
        self.settings = {}
        self.parse_config()
    
    def parse_config(self):
        """Parse markdown configuration into structured data"""
        with open(self.config_path, 'r') as f:
            content = f.read()
        
        # Extract sections using regex patterns
        sections = self._extract_sections(content)
        self._parse_brand_info(sections.get('brand_info', ''))
        self._parse_llm_configs(sections.get('llm_configs', ''))
        self._parse_prompts(sections.get('prompts', ''))

This design allows non-technical users to modify brand information, competitors, and evaluation prompts without touching code.

2. LLM Interface (llm_interface.py)

The LLM interface provides a unified API across multiple providers:

class LLMInterface:
    def __init__(self, config: Dict[str, Any]):
        self.provider = config['provider']
        self.model = config['model']
        self.temperature = config.get('temperature', 0.7)
        self.max_tokens = config.get('max_tokens', 1000)
        
        # Initialize provider-specific clients
        if self.provider == 'openai':
            self.client = openai.OpenAI(api_key=config['api_key'])
        elif self.provider == 'anthropic':
            self.client = anthropic.Anthropic(api_key=config['api_key'])

The interface handles provider-specific authentication, request formatting, and response parsing, abstracting these details from the evaluation logic.

3. Prompt Executor (prompt_executor.py)

The prompt executor implements intelligent caching and batch processing:

class PromptExecutor:
    def __init__(self, llm_interfaces: List[LLMInterface], cache_dir: str = None):
        self.llm_interfaces = llm_interfaces
        self.cache = Cache(cache_dir) if cache_dir else None
        
    def execute_prompts(self, prompts: List[str]) -> Dict[str, List[str]]:
        """Execute prompts against all LLMs with caching"""
        results = {}
        
        for llm in self.llm_interfaces:
            llm_key = f"{llm.provider}_{llm.model}"
            results[llm_key] = []
            
            for prompt in tqdm(prompts, desc=f"Processing {llm_key}"):
                cache_key = self._generate_cache_key(prompt, llm)
                
                if self.cache and cache_key in self.cache:
                    response = self.cache[cache_key]
                else:
                    response = llm.generate_response(prompt)
                    if self.cache:
                        self.cache[cache_key] = response
                
                results[llm_key].append(response)
        
        return results

The caching system significantly reduces API costs during development and testing, while the progress tracking provides user feedback during long evaluation runs.

4. Response Analyzer (analyzer.py)

The analyzer performs sophisticated text analysis to extract brand insights:

class ResponseAnalyzer:
    def __init__(self, brand_info: Dict[str, Any]):
        self.brand_name = brand_info['name']
        self.brand_aliases = brand_info.get('aliases', [])
        self.competitors = brand_info.get('competitors', [])
        self.sentiment_analyzer = TextBlob
        
    def analyze_response(self, response: str) -> Dict[str, Any]:
        """Analyze a single LLM response for brand mentions"""
        analysis = {
            'mention_found': False,
            'mention_position': None,
            'context_type': None,
            'sentiment': None,
            'competitor_mentions': []
        }
        
        # Brand mention detection
        brand_patterns = [self.brand_name] + self.brand_aliases
        for pattern in brand_patterns:
            if self._find_mention(response, pattern):
                analysis['mention_found'] = True
                analysis['mention_position'] = self._get_mention_position(response, pattern)
                analysis['context_type'] = self._classify_context(response, pattern)
                analysis['sentiment'] = self._analyze_sentiment(response, pattern)
                break
        
        # Competitor analysis
        for competitor in self.competitors:
            if self._find_mention(response, competitor):
                analysis['competitor_mentions'].append(competitor)
        
        return analysis

The analyzer uses regex patterns for mention detection and combines TextBlob sentiment analysis with LLM-based sentiment classification for more nuanced results.

5. Metrics Calculator (metrics.py)

The metrics calculator generates aggregate insights across all LLMs:

class MetricsCalculator:
    def calculate_metrics(self, results: Dict[str, List[Dict]]) -> Dict[str, Any]:
        """Calculate comprehensive metrics across all LLMs"""
        metrics = {
            'per_llm_metrics': {},
            'cross_llm_metrics': {}
        }
        
        # Per-LLM metrics
        for llm_key, responses in results.items():
            mention_rate = sum(1 for r in responses if r['mention_found']) / len(responses)
            avg_sentiment = np.mean([r['sentiment'] for r in responses if r['sentiment']])
            
            metrics['per_llm_metrics'][llm_key] = {
                'mention_rate': mention_rate,
                'average_sentiment': avg_sentiment,
                'total_responses': len(responses),
                'mentions_found': sum(1 for r in responses if r['mention_found'])
            }
        
        # Cross-LLM comparative metrics
        metrics['cross_llm_metrics'] = {
            'consensus_score': self._calculate_consensus(results),
            'sentiment_alignment': self._calculate_sentiment_alignment(results),
            'mention_rate_variance': self._calculate_variance(results)
        }
        
        return metrics

These metrics provide both individual LLM performance and comparative analysis, helping users understand consistency across providers.

Advanced Features

Intelligent Caching System

The disk-based caching system uses content hashing to ensure cache validity:

def _generate_cache_key(self, prompt: str, llm: LLMInterface) -> str:
    """Generate unique cache key for prompt + LLM configuration"""
    content = f"{prompt}_{llm.provider}_{llm.model}_{llm.temperature}_{llm.max_tokens}"
    return hashlib.md5(content.encode()).hexdigest()

This approach allows for fine-grained cache invalidation when LLM parameters change while maintaining cache hits for identical requests.

Sentiment Analysis Hybrid Approach

The tool implements a hybrid sentiment analysis system:

def _analyze_sentiment(self, response: str, brand_mention: str) -> float:
    """Hybrid sentiment analysis using TextBlob + LLM"""
    # Extract context around brand mention
    context = self._extract_context(response, brand_mention, window=100)
    
    # TextBlob baseline
    textblob_sentiment = TextBlob(context).sentiment.polarity
    
    # LLM-based sentiment for nuanced analysis
    llm_sentiment = self._llm_sentiment_analysis(context)
    
    # Weighted combination
    return 0.7 * llm_sentiment + 0.3 * textblob_sentiment

This approach combines the speed of rule-based analysis with the nuanced understanding of LLM-based sentiment classification.

Context Classification

The system classifies brand mentions into specific contexts:

def _classify_context(self, response: str, brand_mention: str) -> str:
    """Classify the context of brand mention"""
    context_patterns = {
        'recommendation': [r'I recommend', r'suggest', r'should try'],
        'comparison': [r'compared to', r'versus', r'better than'],
        'example': [r'for example', r'such as', r'like'],
        'explanation': [r'is a', r'provides', r'offers']
    }
    
    mention_context = self._extract_context(response, brand_mention, window=50)
    
    for context_type, patterns in context_patterns.items():
        if any(re.search(pattern, mention_context, re.IGNORECASE) for pattern in patterns):
            return context_type
    
    return 'general'

This classification helps users understand not just whether their brand is mentioned, but how it’s being positioned.

Integration and Output

Dashboard Integration

The tool generates dashboard-compatible JSON output:

def generate_dashboard_data(self, results: Dict[str, Any]) -> Dict[str, Any]:
    """Generate dashboard-compatible output"""
    dashboard_data = {
        'metadata': {
            'brand_name': self.brand_name,
            'evaluation_date': datetime.now().isoformat(),
            'llm_providers': list(results.keys())
        },
        'summary_metrics': {
            'overall_mention_rate': self._calculate_overall_mention_rate(results),
            'sentiment_distribution': self._calculate_sentiment_distribution(results),
            'consensus_score': self._calculate_consensus_score(results)
        },
        'detailed_results': results
    }
    
    return dashboard_data

Report Generation

The system also generates human-readable reports:

def generate_text_report(self, metrics: Dict[str, Any]) -> str:
    """Generate comprehensive text report"""
    report = []
    report.append(f"# Brand Evaluation Report: {self.brand_name}")
    report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    report.append("")
    
    # Summary section
    report.append("## Executive Summary")
    report.append(f"- Overall mention rate: {metrics['overall_mention_rate']:.1%}")
    report.append(f"- Average sentiment: {metrics['average_sentiment']:.2f}")
    report.append(f"- LLM consensus: {metrics['consensus_score']:.1%}")
    
    return "\n".join(report)

Performance Considerations

Concurrent Processing

The tool implements concurrent LLM requests to improve performance:

async def execute_prompts_async(self, prompts: List[str]) -> Dict[str, List[str]]:
    """Execute prompts across multiple LLMs concurrently"""
    tasks = []
    
    for llm in self.llm_interfaces:
        for prompt in prompts:
            task = asyncio.create_task(self._execute_single_prompt(llm, prompt))
            tasks.append(task)
    
    results = await asyncio.gather(*tasks)
    return self._organize_results(results)

Rate Limiting

Built-in rate limiting prevents API quota exhaustion:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def generate_response(self, prompt: str) -> str:
    """Generate response with retry logic and rate limiting"""
    time.sleep(self.rate_limit_delay)
    
    try:
        return self._make_api_call(prompt)
    except Exception as e:
        logger.warning(f"API call failed: {e}")
        raise

Conclusion

LLM Evaluator demonstrates how to build a sophisticated AI monitoring system that provides actionable insights for brand management. By combining multiple LLM providers, intelligent caching, and comprehensive analysis, the tool offers a robust solution for understanding brand representation in the age of AI.

The modular architecture ensures maintainability and extensibility, while the markdown-based configuration system makes it accessible to non-technical users. The comprehensive metrics and dashboard integration provide both high-level insights and detailed analysis for data-driven brand management decisions.

As LLMs become increasingly important in shaping consumer perceptions, tools like this will be essential for brands seeking to understand and optimize their AI-era presence.