Methodology, sources and neutrality
This page is the trust base. Production must publish prompt sets, scoring rules, sampling parameters, sources, update cadence, benchmark budgets and the open-source runner repository.
Open methodology
Each task bucket describes prompts, sampling, scoring and update cadence.
Visible freshness
Every data point carries update time, source type and benchmark version.
Cost control
Benchmark jobs reserve budget caps, caching, sampling and monthly cost fields.
Task buckets
- Coding: Repository edits, bug fixes, unit-test reasoning and API usage.
- Writing: Product copy, long-form prose, editing and tone control.
- Translation: Bidirectional English/Chinese translation with terminology consistency.
- Math: Word problems, algebraic reasoning and structured calculation.
- Tool Calling: JSON output, function selection and multi-step tool use.
- Chinese: Chinese writing, knowledge, instructions and domestic model coverage.
- Reasoning: Multi-step logic, planning and hard instruction following.
- Support: Low-cost, fast, polite support replies at volume.
- Summarization: Long-context summarization and evidence-preserving compression.
- Low Latency: Interactive user-facing flows where TTFT matters most.
Required fields for every data point
- Last updated time and timezone
- Source type: official pricing page, API, crawl, measured or pending
- Model version, region, sampling parameters and prompt-set version
- Affiliate disclosure, never affecting rank
Open-source status
Phase 1 reserves the fields, but the benchmark runner repository has not been created yet. Current methodology version: methodology-draft-v0.1.