Methodology, sources and neutrality

This page is the trust base. Production must publish prompt sets, scoring rules, sampling parameters, sources, update cadence, benchmark budgets and the open-source runner repository.

Open methodology

Each task bucket describes prompts, sampling, scoring and update cadence.

Visible freshness

Every data point carries update time, source type and benchmark version.

Cost control

Benchmark jobs reserve budget caps, caching, sampling and monthly cost fields.

Task buckets

Coding: Repository edits, bug fixes, unit-test reasoning and API usage.
Writing: Product copy, long-form prose, editing and tone control.
Translation: Bidirectional English/Chinese translation with terminology consistency.
Math: Word problems, algebraic reasoning and structured calculation.
Tool Calling: JSON output, function selection and multi-step tool use.
Chinese: Chinese writing, knowledge, instructions and domestic model coverage.
Reasoning: Multi-step logic, planning and hard instruction following.
Support: Low-cost, fast, polite support replies at volume.
Summarization: Long-context summarization and evidence-preserving compression.
Low Latency: Interactive user-facing flows where TTFT matters most.

Required fields for every data point

Last updated time and timezone
Source type: official pricing page, API, crawl, measured or pending
Model version, region, sampling parameters and prompt-set version
Affiliate disclosure, never affecting rank

Open-source status

Phase 1 reserves the fields, but the benchmark runner repository has not been created yet. Current methodology version: methodology-draft-v0.1.