How We Rank AI Agents

Real benchmarks. Real data. No bullshit.

Our Approach

TopClanker aggregates scores from established, peer-reviewed benchmarks published by the AI research community. We don't make up numbers. Every score links back to published results.

Think of it like Rotten Tomatoes for AI:

  • 🎯 Benchmark Score Our "Tomatometer" - Weighted aggregate of published academic benchmarks
  • 🍿 Community Score Our "Popcorn Meter" - User voting (coming soon)

Benchmarks We Use

Reasoning Reasoning Models

MMLU (Massive Multitask Language Understanding)

57 subjects covering STEM, humanities, social sciences. Multiple-choice questions from elementary to professional level.

Source: Hendrycks et al., 2021

Weight in category: 40%

GPQA (Graduate-Level Questions)

Diamond-level graduate questions in physics, biology, and chemistry. Tests expert-level reasoning.

Source: Published research benchmarks

Weight in category: 30%

LMSYS Chatbot Arena

Real-world human preference ranking via blind pairwise comparisons. Over 1M+ votes.

Source: LMSYS Org

Weight in category: 30%

Math Math Models

GSM8K (Grade School Math 8K)

8,500 grade-school level math word problems. Tests multi-step reasoning and arithmetic.

Source: Cobbe et al., 2021

Weight in category: 40%

MATH

12,500 competition mathematics problems with step-by-step solutions. Tests advanced mathematical reasoning.

Source: Published math benchmarks

Weight in category: 40%

AIME (Math Competition)

American Invitational Mathematics Examination problems. High-school competition level.

Source: Competition benchmarks

Weight in category: 20%

Research Research Models

MMLU (General Knowledge)

Same as reasoning, but weighted for breadth of knowledge.

Weight in category: 35%

MMMU (Multimodal Understanding)

Multimodal questions requiring visual reasoning and document understanding.

Source: Published research

Weight in category: 30%

Citation Accuracy

Manual testing of fact-checking and source attribution.

Source: Internal testing

Weight in category: 35%

Learning Learning/Coding Models

HumanEval

164 hand-written programming problems. Tests code generation and correctness.

Source: Chen et al., 2021

Weight in category: 40%

SWE-bench Verified

Real-world software engineering tasks. Tests ability to fix bugs and write production code.

Source: Published coding benchmarks

Weight in category: 40%

Adaptive Performance

Testing context retention and learning from feedback.

Source: Internal testing

Weight in category: 20%

Scoring Formula

Category Score Calculation

Category Score = Σ (Benchmark Score × Weight)

Example for Reasoning:

= (MMLU × 0.40) + (GPQA × 0.30) + (Arena Elo × 0.30)

Overall Score (Coming Soon)

The overall TopClanker score will average across all categories where a model has been evaluated, with additional weighting for:

  • Privacy rating (+5% for high privacy)
  • Open source (+3% bonus for open models)
  • Recency (newer benchmarks weighted slightly higher)

Privacy Rating

Privacy ratings are based on publicly available information about data handling:

High Privacy

No training on user data, clear data retention policies, GDPR compliant, allows data deletion.

Examples: Claude (no training), most open-source models (self-hosted)

Medium Privacy

May train on user data with opt-out, 30-day retention, some data sharing with partners.

Examples: GPT-4 (opt-out available), Gemini (Google integration)

Low Privacy

Trains on user data by default, unclear retention, extensive data collection.

Note: Very few major models fall into this category as of 2024

Update Schedule

  • Monthly: Update with new published benchmark results
  • Immediately: Add new models when major releases occur
  • Quarterly: Review and adjust category weights based on community feedback
  • As needed: Methodology updates (with transparency reports)

Data Sources

All benchmark data comes from:

  • • Official model release papers and technical reports
  • • LMSYS Chatbot Arena leaderboard (updated continuously)
  • • Papers with Code leaderboards
  • • Hugging Face Open LLM Leaderboard
  • • Independent third-party evaluations (when available)

Every score on TopClanker includes a source link to the original benchmark data.

Our Commitments

No paid placements: Rankings are based solely on benchmark performance.

Open methodology: This page explains exactly how we calculate scores.

Source everything: Every claim links to published research.

Update transparently: Methodology changes are documented and dated.

Community input: User voting will complement (not replace) benchmark scores.

Limitations & Caveats

We're transparent about what our rankings can and can't tell you:

  • Benchmarks aren't perfect: They test specific capabilities, not all real-world performance.
  • Scores change: Models get updated, new benchmarks emerge. Our rankings reflect current data.
  • Context matters: The "best" model depends on your use case. A high overall score doesn't mean it's best for YOUR task.
  • Gaming is possible: Labs can optimize for benchmarks. We try to use diverse tests to minimize this.
  • Not all models are tested equally: Some models have more published benchmarks than others.

Questions or Feedback?

Think we're missing an important benchmark? Disagree with our weighting? Found an error?

Email us: [email protected]