Our Approach
TopClanker aggregates scores from established, peer-reviewed benchmarks published by the AI research community. We don't make up numbers. Every score links back to published results.
Think of it like Rotten Tomatoes for AI:
- 🎯 Benchmark Score Our "Tomatometer" - Weighted aggregate of published academic benchmarks
- 🍿 Community Score Our "Popcorn Meter" - User voting (coming soon)
Benchmarks We Use
Reasoning Reasoning Models
MMLU (Massive Multitask Language Understanding)
57 subjects covering STEM, humanities, social sciences. Multiple-choice questions from elementary to professional level.
Source: Hendrycks et al., 2021
Weight in category: 40%
GPQA (Graduate-Level Questions)
Diamond-level graduate questions in physics, biology, and chemistry. Tests expert-level reasoning.
Source: Published research benchmarks
Weight in category: 30%
LMSYS Chatbot Arena
Real-world human preference ranking via blind pairwise comparisons. Over 1M+ votes.
Source: LMSYS Org
Weight in category: 30%
Math Math Models
GSM8K (Grade School Math 8K)
8,500 grade-school level math word problems. Tests multi-step reasoning and arithmetic.
Source: Cobbe et al., 2021
Weight in category: 40%
MATH
12,500 competition mathematics problems with step-by-step solutions. Tests advanced mathematical reasoning.
Source: Published math benchmarks
Weight in category: 40%
AIME (Math Competition)
American Invitational Mathematics Examination problems. High-school competition level.
Source: Competition benchmarks
Weight in category: 20%
Research Research Models
MMLU (General Knowledge)
Same as reasoning, but weighted for breadth of knowledge.
Weight in category: 35%
MMMU (Multimodal Understanding)
Multimodal questions requiring visual reasoning and document understanding.
Source: Published research
Weight in category: 30%
Citation Accuracy
Manual testing of fact-checking and source attribution.
Source: Internal testing
Weight in category: 35%
Learning Learning/Coding Models
HumanEval
164 hand-written programming problems. Tests code generation and correctness.
Source: Chen et al., 2021
Weight in category: 40%
SWE-bench Verified
Real-world software engineering tasks. Tests ability to fix bugs and write production code.
Source: Published coding benchmarks
Weight in category: 40%
Adaptive Performance
Testing context retention and learning from feedback.
Source: Internal testing
Weight in category: 20%
Scoring Formula
Category Score Calculation
Category Score = Σ (Benchmark Score × Weight)
Example for Reasoning:
= (MMLU × 0.40) + (GPQA × 0.30) + (Arena Elo × 0.30)
Overall Score (Coming Soon)
The overall TopClanker score will average across all categories where a model has been evaluated, with additional weighting for:
- • Privacy rating (+5% for high privacy)
- • Open source (+3% bonus for open models)
- • Recency (newer benchmarks weighted slightly higher)
Privacy Rating
Privacy ratings are based on publicly available information about data handling:
High Privacy
No training on user data, clear data retention policies, GDPR compliant, allows data deletion.
Examples: Claude (no training), most open-source models (self-hosted)
Medium Privacy
May train on user data with opt-out, 30-day retention, some data sharing with partners.
Examples: GPT-4 (opt-out available), Gemini (Google integration)
Low Privacy
Trains on user data by default, unclear retention, extensive data collection.
Note: Very few major models fall into this category as of 2024
Update Schedule
- Monthly: Update with new published benchmark results
- Immediately: Add new models when major releases occur
- Quarterly: Review and adjust category weights based on community feedback
- As needed: Methodology updates (with transparency reports)
Data Sources
All benchmark data comes from:
- • Official model release papers and technical reports
- • LMSYS Chatbot Arena leaderboard (updated continuously)
- • Papers with Code leaderboards
- • Hugging Face Open LLM Leaderboard
- • Independent third-party evaluations (when available)
Every score on TopClanker includes a source link to the original benchmark data.
Our Commitments
✓ No paid placements: Rankings are based solely on benchmark performance.
✓ Open methodology: This page explains exactly how we calculate scores.
✓ Source everything: Every claim links to published research.
✓ Update transparently: Methodology changes are documented and dated.
✓ Community input: User voting will complement (not replace) benchmark scores.
Limitations & Caveats
We're transparent about what our rankings can and can't tell you:
- • Benchmarks aren't perfect: They test specific capabilities, not all real-world performance.
- • Scores change: Models get updated, new benchmarks emerge. Our rankings reflect current data.
- • Context matters: The "best" model depends on your use case. A high overall score doesn't mean it's best for YOUR task.
- • Gaming is possible: Labs can optimize for benchmarks. We try to use diverse tests to minimize this.
- • Not all models are tested equally: Some models have more published benchmarks than others.
Questions or Feedback?
Think we're missing an important benchmark? Disagree with our weighting? Found an error?
Email us: [email protected]