Reasoning with small models

The motivating question is unusually crisp: can orchestration make weak models reason like a strong one? Rather than fine-tune anything, the project keeps every worker model at or below 1B parameters (a 0.6B model emerged as the strongest local reasoner) and asks how much of the gap to a frontier model can be recovered through orchestration alone. Discipline is enforced by three fixed anchors — a single-model greedy floor, the system under test, and the frontier model run with full chain-of-thought on the identical seeded sample (default n=50) — so floor, system, and ceiling are always comparable. The benchmark is GSM8K, split into a standard tier and a hardest-problems tier.

The method is a ladder of self-contained experiments (each with its own architecture note, config, runner, and results), regenerating an overview chart after every run. The trail spans self-consistency voting, multi-model ensembles, mixture-of-agents, verifier-gated voting, program-aided reasoning, all-to-all debate, problem decomposition, and token-level logit fusion. The winning architecture was the simplest in spirit: ask the single strongest model to solve each problem under several different reasoning framings (list-quantities-first, restate, re-check, solve-two-ways), then take a majority vote across the pooled samples.

What is notable is that the project does not just report the win; it explains every loss mechanistically. Self-consistency of the best model (+18 points) beat every multi-agent scheme, because you cannot bootstrap reliability from sub-1B judges that share the generator's blind spots. Plain self-consistency saturates because a single framing makes the same systematic mistake every sample; framing diversity de-correlates those errors so the vote can recover them (standard tier 56 to 84%, and the win generalizes to the hard tier, 32 to 48%). Debate produced quantified groupthink, decomposition by weak models collapsed below baseline, and program-aided tool-use hurt because the bottleneck on GSM8K is comprehension, not arithmetic.

Honest context: this is a private research harness, not a product, and the claims are scoped to GSM8K and to locally available sub-1B models. A mechanistic probe traced decomposition failure to low effective depth and a sharp drop in the model's confidence in a fact once it is buried in problem context. The practical vote-ceiling is about 84% on the standard tier; an oracle union across methods reaches roughly 92%, but capturing it would require a trustworthy per-problem selector that no sub-1B model provides — an acknowledged open problem rather than a solved one.