How We Rank AI Agents

Real benchmarks. Real data. No bullshit.

Our Approach

TopClanker aggregates scores from established, peer-reviewed benchmarks published by the AI research community. We don't make up numbers. Every score links back to published results.

Think of it like Rotten Tomatoes for AI:

  • 🎯 Benchmark Score Our "Tomatometer" - Weighted aggregate of published academic benchmarks
  • 🍿 Community Score Our "Popcorn Meter" - User voting (coming soon)

Benchmarks We Use

Reasoning Reasoning Models

MMLU (Massive Multitask Language Understanding)

57 subjects covering STEM, humanities, social sciences. Multiple-choice questions from elementary to professional level.

Source: Hendrycks et al., 2021

Weight in category: 40%

GPQA (Graduate-Level Questions)

Diamond-level graduate questions in physics, biology, and chemistry. Tests expert-level reasoning.

Source: Published research benchmarks

Weight in category: 30%

LMSYS Chatbot Arena

Real-world human preference ranking via blind pairwise comparisons. Over 1M+ votes.

Source: LMSYS Org

Weight in category: 30%

Math Math Models

GSM8K (Grade School Math 8K)

8,500 grade-school level math word problems. Tests multi-step reasoning and arithmetic.

Source: Cobbe et al., 2021

Weight in category: 40%

MATH

12,500 competition mathematics problems with step-by-step solutions. Tests advanced mathematical reasoning.

Weight in category: 40%

AIME (Math Competition)

American Invitational Mathematics Examination problems. High-school competition level.

Weight in category: 20%

Research Research Models

MMLU (General Knowledge)

Same as reasoning, but weighted for breadth of knowledge.

Weight in category: 35%

MMMU (Multimodal Understanding)

Multimodal questions requiring visual reasoning and document understanding.

Weight in category: 30%

Citation Accuracy

Manual testing of fact-checking and source attribution.

Weight in category: 35%

Learning Learning/Coding Models

HumanEval

164 hand-written programming problems. Tests code generation and correctness.

Source: Chen et al., 2021

Weight in category: 40%

SWE-bench Verified

Real-world software engineering tasks. Tests ability to fix bugs and write production code.

Weight in category: 40%

Adaptive Performance

Testing context retention and learning from feedback.

Weight in category: 20%

Scoring Formula

Category Score Calculation

Category Score = Σ (Benchmark Score × Weight)

Example for Reasoning:

= (MMLU × 0.40) + (GPQA × 0.30) + (Arena Elo × 0.30)

Overall Score

  • Privacy rating (+5% for high privacy)
  • Open source (+3% bonus for open models)
  • Recency (newer benchmarks weighted slightly higher)

Privacy Rating

High Privacy

No training on user data, clear data retention policies, GDPR compliant, allows data deletion.

Medium Privacy

May train on user data with opt-out, 30-day retention, some data sharing with partners.

Low Privacy

Trains on user data by default, unclear retention, extensive data collection.

Update Schedule

  • Monthly: Update with new published benchmark results
  • Immediately: Add new models when major releases occur
  • Quarterly: Review and adjust category weights based on community feedback

Data Sources

  • • Official model release papers and technical reports
  • • LMSYS Chatbot Arena leaderboard (updated continuously)
  • • Papers with Code leaderboards
  • • Hugging Face Open LLM Leaderboard
  • • Independent third-party evaluations (when available)

Our Commitments

No paid placements: Rankings are based solely on benchmark performance.

Open methodology: This page explains exactly how we calculate scores.

Source everything: Every claim links to published research.

Community input: User voting will complement (not replace) benchmark scores.

Limitations & Caveats

  • Benchmarks aren't perfect: They test specific capabilities, not all real-world performance.
  • Scores change: Models get updated, new benchmarks emerge.
  • Context matters: The "best" model depends on your use case.
  • Gaming is possible: Labs can optimize for benchmarks. We use diverse tests to minimize this.

Questions or Feedback?

Think we're missing an important benchmark? Disagree with our weighting? Found an error?

Email us: rankings@topclanker.com