LLM Evaluator - Airbais AI Tools

Overview

A comprehensive tool to evaluate how brands are mentioned and represented in Large Language Model (LLM) responses based on user intents you specify. This will tell you if your brand pops in AI responses when it should.

Key Features

Multi-LLM Evaluation

Evaluate brand mentions across multiple LLMs simultaneously with comparative analysis

Sentiment Analysis

Hybrid sentiment analysis using TextBlob and LLM-based approaches for accurate brand perception

Context Detection

Identifies whether mentions are recommendations, comparisons, examples, or explanations

Dashboard Integration

Automatically integrates with the master dashboard for visual analysis and reporting

Getting Started

Clone the repository

git clone https://github.com/Airbais/intent-tools.git
cd intentcrawler

Install Dependencies

Navigate to the llmevaluator directory and install required packages:

cd llmevaluator
pip install -r requirements.txt

Configure Environment

Set up your API keys and environment variables:

cp .env.example .env
# Edit .env with your API keys

You’ll need API keys for OpenAI and/or Anthropic depending on which LLMs you want to evaluate.

Create Configuration

Create a markdown configuration file with your brand information and evaluation prompts:

# Brand Configuration
## Brand Information
- **Name**: YourBrand
- **Website**: https://yourbrand.com
- **Aliases**: ["Your Brand", "YB"]
- **Competitors**: ["Competitor A", "Competitor B"]
## LLMs
- name: gpt4
  provider: openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 300
- name: claude
  provider: anthropic
  model: claude-3-sonnet-20240229
  temperature: 0.5
  max_tokens: 300

Run Evaluation

Execute the evaluation with dashboard integration:

python llmevaluator.py config.md --dashboard

Configuration

Brand Information Setup

Configure your brand details for accurate detection and analysis:

## Brand Information
- **Name**: FastMCP
- **Website**: https://gofastmcp.com
- **Aliases**: ["Fast MCP", "FastMCP Protocol"]
- **Competitors**: ["SlowMCP", "StandardMCP"]

Name: Primary brand name to track
Website: Official website URL for reference tracking
Aliases: Alternative names or spellings of your brand
Competitors: Competitor brands to track for comparison

Multi-LLM Configuration

Configure multiple LLMs for comparative analysis:

## LLMs
- name: gpt4
  provider: openai
  model: gpt-4
  temperature: 0.7
  max_tokens: 300
- name: claude
  provider: anthropic
  model: claude-3-sonnet-20240229
  temperature: 0.5
  max_tokens: 300
- name: gpt35
  provider: openai
  model: gpt-3.5-turbo
  temperature: 0.7
  max_tokens: 300

Each LLM configuration includes:

name: Unique identifier for the LLM
provider: openai or anthropic
model: Specific model name
temperature: Response randomness (0-1)
max_tokens: Maximum response length

Evaluation Prompts

Design prompts that naturally elicit brand mentions:

## Evaluation Prompts
### Category: Getting Started
1. How would I create a [domain] server?
2. What is the fastest way to create an [domain] server?
3. What are some frameworks I can use to create [domain] servers and clients?

### Category: Development
1. How would I create an [domain] client?
2. What clients support integration of my [domain] tools?
3. How do I integrate [domain] tools with existing applications?

### Category: Best Practices
1. How should I manage security in my [domain] tool?
2. What are the best practices for [domain] server development?
3. How do I ensure my [domain] tools are performant and scalable?

Use [domain] placeholders that will naturally lead to mentions of your brand or product category.

Evaluation Settings

Fine-tune evaluation behavior and caching:

## Evaluation Settings
- **Cache Responses**: true
- **Sentiment Analysis Method**: hybrid
- **Cache Expire Hours**: 24
- **Batch Size**: 10

Cache Responses: Enable response caching to reduce API costs
Sentiment Analysis Method: “hybrid” combines TextBlob and LLM analysis
Cache Expire Hours: How long to keep cached responses
Batch Size: Number of prompts to process simultaneously

How It Works

Configuration Loading

The tool parses your markdown configuration file to extract brand information, LLM settings, and evaluation prompts.

Multi-LLM Prompt Execution

Each prompt is sent to all configured LLMs simultaneously, with responses cached to optimize API usage and costs.

Brand Mention Detection

Responses are analyzed to detect mentions of your brand, aliases, and competitors using pattern matching and context analysis.

Sentiment Analysis

Each mention is analyzed for sentiment using a hybrid approach combining TextBlob and LLM-based sentiment analysis for accuracy.

Context Classification

Mentions are classified by context (recommendation, comparison, example, explanation) and position within the response.

Comparative Metrics

When multiple LLMs are used, additional metrics are calculated including consensus scores and sentiment alignment between LLMs.

Report Generation

Comprehensive reports are generated with per-LLM metrics, comparative analysis, and dashboard-compatible data structures.

Output Structure

Results are organized in timestamped directories for easy tracking and comparison:

results/
└── 2024-01-15/
    ├── dashboard-data.json    # Dashboard-compatible data with multi-LLM structure
    ├── raw_results.json       # Detailed evaluation results for all LLMs
    ├── metrics_summary.json   # Aggregate metrics across all LLMs
    └── evaluation_report.txt  # Human-readable report with comparative analysis

Dashboard Data
Metrics Summary
Detailed Results

The dashboard-data.json file contains structured data optimized for the master dashboard:

{
  "metadata": {
    "timestamp": "2025-07-05T07:43:12.089714",
    "llms": [
      {
        "name": "gpt4",
        "provider": "openai",
        "model": "gpt-4"
      },
      {
        "name": "claude",
        "provider": "anthropic",
        "model": "claude-3-sonnet-20240229"
      }
    ],
    "prompt_count": 13,
    "brand": "FastMCP"
  },
  "llm_metrics": {
    "gpt4": {
      "total_prompts": 13,
      "total_brand_mentions": 7,
      "average_sentiment": 0.0,
      "mention_rate": 0.54
    },
    "claude": {
      "total_prompts": 13,
      "total_brand_mentions": 8,
      "average_sentiment": 0.7,
      "mention_rate": 0.62
    }
  }
}

Dashboard Features

The evaluation results automatically integrate with the master dashboard, providing:

Multi-LLM Comparison

Side-by-side comparison of how different LLMs mention and represent your brand

Sentiment Analysis

Visual sentiment distribution with context and position tracking

Comparative Metrics

Consensus scores, sentiment alignment, and mention rate variance between LLMs

Category Performance

Performance breakdown by prompt categories (Getting Started, Development, etc.)

Command Line Reference

config_file

string

required

Path to the markdown configuration file containing brand information and evaluation prompts

--dashboard

flag

Launch the dashboard after evaluation completes

--dashboard-only

flag

Launch dashboard without running evaluation (view existing results)

--dashboard-date

string

Launch dashboard for specific result date (format: YYYY-MM-DD)

--output-dir

string

Custom output directory for results (default: ./results)

--no-cache

flag

Disable response caching for this evaluation run

--clear-cache

flag

Clear existing cache before running evaluation

--dry-run

flag

Validate configuration without executing prompts

--list-results

flag

Show all available result dates

--log-level

string

Set logging level (DEBUG, INFO, WARNING, ERROR)

Usage Examples

# Run evaluation with configuration file
python llmevaluator.py config.md

# With dashboard launch
python llmevaluator.py config.md --dashboard

Metrics Explained

Brand Mention Metrics

Total Mentions: Count of brand name appearances across all responses
Mention Rate: Average mentions per prompt (mentions ÷ prompts)
Position Distribution: Where mentions appear (beginning, middle, end of responses)
Context Types: How the brand is mentioned (recommendation, comparison, example, explanation)

Sentiment Analysis

Score Range: -1.0 (negative) to +1.0 (positive)
Labels: Positive, Negative, Neutral, or Not Mentioned
Method: Hybrid approach combining TextBlob and LLM-based sentiment analysis for improved accuracy

Multi-LLM Comparative Metrics

When evaluating multiple LLMs, additional insights are generated:

Consensus Score

Percentage of prompts where all LLMs agree on mentioning (or not mentioning) the brand

Sentiment Alignment

How closely LLMs agree on brand sentiment (0-100% agreement)

Mention Rate Variance

Statistical variance in mention rates across different LLMs

Performance Guidelines

Multi-LLM evaluations can be expensive. Consider these optimization strategies:

Enable Caching: Set Cache Responses: true to reuse responses and reduce API costs
Batch Processing: Use reasonable batch sizes (5-10 prompts) to optimize throughput
Targeted Prompts: Design prompts that naturally elicit brand mentions in your domain
Cache Management: Use --clear-cache only when necessary to avoid unnecessary API calls

A typical evaluation with 13 prompts across 3 LLMs results in 39 API calls. With caching enabled, subsequent runs with the same prompts cost nothing.

Troubleshooting

API Key Configuration

Problem: “No API key found” errors Solution: Ensure your .env file contains valid API keys:

OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here

Check that the environment variables are properly loaded by running:

python -c "import os; print('OpenAI:', bool(os.getenv('OPENAI_API_KEY'))); print('Anthropic:', bool(os.getenv('ANTHROPIC_API_KEY')))"

Rate Limiting Issues

Problem: API rate limit errors or timeouts Solution: The tool implements automatic retries with exponential backoff. For persistent issues:

Reduce batch size in your configuration
Add delays between requests
Check your API tier limits with your provider
Use caching to minimize repeat requests

Cache Problems

Problem: Stale or incorrect cached resultsSolution: Clear the cache and run fresh evaluation:

python llmevaluator.py config.md --clear-cache

You can also manually delete the cache directory:

rm -rf ./cache

Dashboard Integration

Problem: Results not appearing in dashboardSolution: Ensure the evaluation completed successfully and check:

Results directory contains dashboard-data.json
Master dashboard is running from the correct tools directory
No JSON formatting errors in the results file

Requirements

System Requirements

Python: 3.8 or higher
Memory: 512MB RAM minimum
Storage: 100MB for cache and results
Network: Internet connection for API access

API Requirements

OpenAI API Key: For GPT models (gpt-4, gpt-3.5-turbo)
Anthropic API Key: For Claude models (claude-3-opus, claude-3-sonnet)

Dependencies

Key Python packages (automatically installed):

openai - OpenAI API client
anthropic - Anthropic API client
textblob - Sentiment analysis
diskcache - Response caching
tqdm - Progress tracking
pandas - Data processing
plotly - Dashboard visualization

Integration

LLM Evaluator is designed to work seamlessly with the broader AI Tools Suite:

Master Dashboard: Automatic integration with shared visualization platform
Intent Crawler: Compare brand mentions against discovered user intents
Shared Infrastructure: Common caching, logging, and configuration patterns
Cross-Tool Analysis: Combine insights from multiple evaluation tools

Advanced Usage

Programmatic Integration

from src.config import ConfigurationManager
from src.llm_interface import LLMInterface
from src.prompt_executor import PromptExecutor
from src.analyzer import ResponseAnalyzer

# Load configuration
config_manager = ConfigurationManager('config.md')
settings = config_manager.get_evaluation_settings()
brand_info = config_manager.get_brand_info()
prompts = config_manager.get_prompts()

# Initialize components
llm_interface = LLMInterface.create_from_config(settings)
executor = PromptExecutor(llm_interface)
analyzer = ResponseAnalyzer(brand_info, settings)

# Execute evaluation
results = executor.execute_batch(prompts, settings)
metrics = analyzer.analyze_multi_llm_results(results)

print(f"Brand mentioned in {metrics.mention_rate:.1%} of responses")
print(f"Average sentiment: {metrics.average_sentiment:.2f}")

Custom Sentiment Analysis

# Override sentiment analysis method
settings.sentiment_analysis_method = "llm_only"  # or "textblob_only"

# Custom sentiment prompts
settings.sentiment_prompt_template = """
Analyze the sentiment toward {brand_name} in this text: {response_text}
Rate from -1 (very negative) to +1 (very positive): 
"""

Contributing

When contributing to LLM Evaluator:

Maintain Dashboard Compatibility: Ensure changes don’t break dashboard integration
Follow Multi-LLM Patterns: New features should support multiple LLM evaluation
Add Comprehensive Logging: Use the existing logging framework for debugging
Update Documentation: Keep this MDX file current with new features
Test with Real APIs: Validate changes with actual LLM providers
Consider Costs: Be mindful of API usage in new features

Future Roadmap

Additional LLM Providers: Support for Cohere, AI21, and other providers
Advanced Analytics: Statistical significance testing for comparative metrics
Real-time Monitoring: Continuous brand mention tracking across time
Custom Scoring Models: User-defined sentiment and relevance scoring
Export Capabilities: PDF reports and data export formats
A/B Testing Framework: Compare different prompt strategies

Part of the Airbais AI Tools Suite - Comprehensive tools for AI-powered business intelligence and brand analysis.

Get Started

What's New

Tools

Automation

Learn

​Overview

​Key Features

Multi-LLM Evaluation

Sentiment Analysis

Context Detection

Dashboard Integration

​Getting Started

​Configuration

​How It Works

​Output Structure

​Dashboard Features

Multi-LLM Comparison

Sentiment Analysis

Comparative Metrics

Category Performance

​Command Line Reference

​Usage Examples

​Metrics Explained

​Brand Mention Metrics

​Sentiment Analysis

​Multi-LLM Comparative Metrics

Consensus Score

Sentiment Alignment

Mention Rate Variance

​Performance Guidelines

​Troubleshooting

​Requirements

​System Requirements

​API Requirements

​Dependencies

​Integration

​Advanced Usage

​Programmatic Integration

​Custom Sentiment Analysis

​Contributing

​Future Roadmap

Overview

Key Features

Getting Started

Configuration

How It Works

Output Structure

Dashboard Features

Command Line Reference

Usage Examples

Metrics Explained

Brand Mention Metrics

Sentiment Analysis

Multi-LLM Comparative Metrics

Performance Guidelines

Troubleshooting

Requirements

System Requirements

API Requirements

Dependencies

Integration

Advanced Usage

Programmatic Integration

Custom Sentiment Analysis

Contributing

Future Roadmap