The Model Proposes, the Machine Disposes

The premise most people start with is wrong in a useful way. Large language models are not bad at arithmetic. They are not doing arithmetic at all. A transformer generating "347 times 89 is 30,883" is not multiplying. It is sampling the most probable continuation of a string, one token at a time, from a distribution shaped by every product it saw during training. Sometimes the sample is correct. Roughly one time in eight on a hard benchmark it is confidently, silently wrong. For a chatbot that tail is an annoyance. For anything auditable, a ledger, a fire-control solution, a compliance calculation, a silent wrong number is worse than a loud failure, because nothing in the system knows it happened.

So the question is not "how do we make the model better at math." The question is where to put the boundary between the thing that is good at intent and the thing that is correct by construction. That boundary is the whole design.

Arithmetic failure is structural, not a prompting problem

Three properties of the architecture guarantee the failure. None of them yields to a better prompt.

Tokenization destroys digit alignment. Byte-pair encoding was built to compress natural language, and it splits numbers by training frequency, not by place value. Depending on the tokenizer, "1234" is one token and "12345" is two, chopped somewhere in the middle. Addition is a positional, digit-aligned operation with carry propagation. The model never sees the digits in fixed positional slots to begin with. Right-to-left digit grouping and single-digit tokenization in newer models help the low end, and they do not touch the deeper problem underneath.

A fixed-depth forward pass sits in the wrong complexity class. A transformer of fixed depth and polynomial width computes functions in a class related to constant-depth threshold circuits, TC-zero. Iterated multiplication is conjectured to live outside it. A single forward pass, however wide, cannot implement general multi-step arithmetic as one shot of computation. This is why chain-of-thought helps at all: it externalizes intermediate state into the token stream and trades depth for sequence length, buying serial steps the architecture cannot take internally. It raises the accuracy number. It does not change what the model is. Each emitted step is still a probabilistic sample, and carry errors compound across a long product with no mechanism to catch them.

There is no register and no rollback. An arithmetic logic unit has registers, deterministic state transitions, and a carry bit that is either set or not. A transformer's only working memory during generation is the growing context, and with chain-of-thought it is simulating the algorithm in probability space rather than executing it. At temperature zero the fourteenth digit of a product is still an argmax over a fuzzy distribution. One wrong carry and the remainder is wrong, delivered with the same fluent confidence as a correct answer.

Prompting, fine-tuning, and scale all move the benchmark number. They do not move the category. You are asking a statistical sequence model to be a calculator, and the residual error stays silent.

What this means (plain English): A language model does not calculate the way a calculator does. It predicts what the next chunk of text should probably be, based on patterns it absorbed from reading enormous amounts of writing. It is autocomplete with a PhD. When it "adds," it is guessing what a correct answer tends to look like, not running the addition. That guess is usually right and occasionally wrong — and the model has no way to know which, so a wrong number comes out sounding exactly as confident as a right one.

The boundary principle

The correct architecture stops trying to make the model good at the middle of the computation. It uses the model for the two things it is genuinely, uniquely good at (turning ambiguous natural-language intent into a precise formal specification, and composing steps into a plan) and hands the computation itself to deterministic software.

State it as one line: the model decides what to compute, software decides the result. Intent parsing and program synthesis on the neural side. Execution on the deterministic side. Explanation back on the neural side. The model brackets the computation on both ends and never touches the number in the middle.

There is a clean experiment that shows this is the right cut. Program-aided language work compared two conditions with the same model and the same generated code. In one, a real interpreter runs the program. In the other, the model is asked to predict what its own program would output, to mentally execute it. Running the code lands in the low seventies on the target benchmark. Asking the model to simulate its own correct code collapses to the low twenties. Same reasoning. Same program. The only variable is whether a deterministic machine executed it. That gap is the entire thesis in a single table.

What this means (plain English): Use each part for what it is good at. The AI is brilliant at understanding a messy human request — "split this bill three ways after a 20% tip" — and turning it into a precise instruction. It is bad at doing the resulting sum. So let it write the instruction, hand that to an ordinary reliable calculator to actually run, and then let the AI explain the answer in plain words. The AI touches both ends and never the number in the middle. In the experiment above, the same AI writing the same steps got the right answer far more often when a real calculator ran them than when it tried to run them in its own head.

The pattern landscape, briefly

Everyone converging on this has landed on a few shapes:

Program-of-Thoughts and program-aided generation. The model emits code, an interpreter runs it, the model reads the result. Offloads the arithmetic and the logic to the runtime.
Code execution tools. The productized form: a sandboxed Python or JavaScript environment the model calls. General compute, not only arithmetic.
Computer algebra offload. SymPy, Wolfram, for exact symbolic algebra, calculus, and rational arithmetic where a float is the wrong instrument.
MCP compute servers. Deterministic compute exposed as tools any agent can dispatch to. The interoperable, composable version, and the one that fits an operating system rather than a single chat loop.
Verifiers and self-consistency. Sample many solutions, vote or check. Raises the floor. Still probabilistic, still no guarantee.
Neurosymbolic verification. A theorem prover in the loop. The model proposes, Lean or an equivalent checks. DeepMind's AlphaProof and AlphaGeometry reached International Mathematical Olympiad medal level this way, silver-medal performance in 2024, by pairing a language model with a symbolic engine and a proof checker. The answer is not "probably right." It is machine-checked.

The shape of the benchmark evidence is monotonic and it points one way. MATH moves from the low forties on chain-of-thought alone to the mid-eighties with code execution plus verification. GSM8K shows the same jump under program-aided generation. The more of the computation you move onto a deterministic engine, and the more of the output you machine-check, the higher the score climbs, and, the part that matters more than the score, the more the result can be trusted.

What this means (plain English): The whole industry has quietly reached the same conclusion — do not trust the AI to compute; make it hand the work to something that computes reliably. The strongest version does not just calculate the answer, it proves it: a separate program checks the AI's reasoning step by step, the way a referee checks a math proof. That is how AI systems earned International Mathematical Olympiad medals — not by being sure, but by being checked.

Building it into an agent OS

Tool use inside a chat product is one decision. A deterministic compute substrate inside a multi-agent kernel is a different one, with these load-bearing faces.

Routing: who decides to dispatch

Two schools. In model-decided routing, the model chooses whether to call the calculator. Simple, and it can decline, at which point you are back to hallucinated arithmetic wearing a confident tone. In orchestrator-enforced routing, the kernel intercepts: any output typed as a computed quantity has to come from the deterministic path or it does not get emitted. For a sovereign or auditable posture, the second is the only defensible default. The model is not trusted to self-report whether it did the math correctly, because it cannot tell.

The sharp formulation is a type-system one. A computed quantity is a type, and values of that type can only be constructed by the compute layer. The model can propose the computation. It can never mint the value. The neural layer simply lacks the capability to produce a verified number, the same way an unprivileged process lacks the capability to write outside its own memory.

What this means (plain English): Someone has to decide when to reach for the calculator. If you leave that decision to the AI, it will sometimes skip the calculator and wing it — confidently. So the safer design does not ask: the system simply refuses to let any number reach the user unless a trusted calculator produced it. The AI is allowed to request a calculation. It is never allowed to be the answer.

Where the boundary actually sits

Neural: natural language into a formal spec, which quantity, which units, which operation, which constraints. Deterministic: spec into value. Neural again: value into explanation. The model wraps the computation on both sides and the arithmetic core stays sealed in the middle, untouched by anything probabilistic.

Sandboxing, which is where sovereign posture earns its keep

Running model-generated code is running untrusted code, and the threat model is prompt-injection-to-remote-execution: adversarial input talks the model into emitting code that exfiltrates or pivots. The isolation tiers, in ascending order of both safety and cost:

In-process restricted evaluators. A math grammar over a locked-down abstract syntax tree. No arbitrary code, only the expression language you exposed. Microseconds. Tight. Limited to what the grammar allows, which for the ninety-five percent arithmetic case is the point.
isolated-vm. Real V8 isolates in Node, no built-ins by default. Note that vm2, the thing people reach for first, is deprecated and shipped sandbox-escape CVEs. Do not use it. isolated-vm is the current in-process answer.
WASM runtimes. Python via Pyodide, or any WASM target. Memory-safe, capability-gated, portable.
gVisor. A userspace kernel that intercepts syscalls, container-grade isolation with a far smaller attack surface than raw runc.
Firecracker microVMs. Hardware virtualization, boot on the order of a hundred-and-some milliseconds, one disposable VM per execution. The strongest practical isolation for untrusted code at scale, which is why it sits under Lambda.

The trade is latency and memory against blast radius. A bounded expression evaluator is a warm few microseconds and holds nothing. A microVM is a heavier, slower, hard-walled box you throw away after one use. Route by risk. The arithmetic grammar handles the common case; escalate to a microVM only when arbitrary code is genuinely required, and never run untrusted code anywhere softer than the risk warrants.

What this means (plain English): If the AI is going to write and run little programs, assume some of those programs are booby-trapped — a malicious user can trick the AI into writing harmful code. So you run that code in a locked room with no windows. There is a range of locked rooms, from a lightweight cage that only allows arithmetic (fast, very safe, very limited) up to a throwaway virtual computer that gets destroyed after a single use (slower, but nothing can escape it). You match the strength of the cage to how dangerous the code could be — and you never put risky code in a flimsy one.

Determinism, reproducibility, auditability

For regulated or defense work the compute path has to be three things. Deterministic: identical inputs produce identical outputs, bit for bit, which means pinning the runtime, pinning library versions, and controlling floating-point mode. Auditable: every computed value carries provenance, the spec that produced it, the exact expression executed, the runtime version, the inputs, a hash, enough to replay it cold. In a system built this way the model's sentence is a view over an auditable computation, not the source of truth. And float-disciplined: 0.1 plus 0.2 is not 0.3 in IEEE 754, and for money or any long sum that is the difference between a reproducible ledger and a rounding incident nobody can reconstruct. Reach for arbitrary-precision decimal and exact rationals wherever the domain allows.

What this means (plain English): For banking, defense, or anything regulated, "the AI said so" is not good enough. Every number needs a paper trail: what was asked, exactly what was calculated, which version of the software ran it, and a fingerprint that lets anyone re-run it later and get the identical result. This is also why the calculator, not the AI, holds the truth — you cannot get a bit-for-bit reproducible answer out of a system that is guessing. (And a quirk most people never learn: computers get 0.1 + 0.2 slightly wrong by default. Fine for a science graph, a disaster for a ledger — so the money math uses tools built for exactness.)

The compiler is the reasoning oracle

When the compute layer rejects the model's program, a syntax error, a type error, a unit mismatch, a violated constraint, that rejection is not a failure to swallow. It is a specification handed back. The loop runs: model emits program, the runtime and type checker and unit checker evaluate it, and on failure the structured error becomes the next constraint the model has to satisfy. A units library that refuses to add meters to seconds is teaching the model the spec through its own error messages. This is test-driven development with the model as the thing under test: the failing check is the spec, and the model's job is to make it pass. Treat the type system and the validators as ground truth and let their errors drive the repair.

What this means (plain English): When the AI writes a program that does not add up — mixing units, breaking a rule — the error message the computer spits back is not a dead end. It is free tutoring. The system hands the error straight back to the AI and says "fix this," and the AI tries again. A tool that refuses to add "meters" to "seconds" is quietly teaching the AI the rules, with no human in the loop.

Structured I/O across the seam

Numbers have to cross the neural boundary without being re-tokenized into fuzz. Constrained decoding and structured outputs, JSON-schema-shaped or grammar-constrained generation, mean the model emits a spec object rather than prose you have to parse back out. On the return trip the value is injected as structured data. A schema validates the spec before it reaches the compute layer and validates the result before it re-enters the model's context. The model should never be transcribing a fourteen-digit number by hand. It references a value by handle and lets the substrate hold the digits.

What this means (plain English): Numbers have to pass between the AI and the calculator without getting garbled in translation. The rule is: never make the AI retype a long number by hand. It passes numbers around like sealed envelopes it is not allowed to open, and a strict inspector checks everything going in and coming back out. The AI handles the meaning; the machinery handles the digits.

The TypeScript stack, concretely

math.js for expression parsing, matrices, units, and a restricted evaluator as the default bounded tier. decimal.js and big.js for arbitrary-precision decimal, which is where money and exact sums live. fraction.js for exact rationals. For symbolic work, shell out to SymPy through the sandbox or hit a Wolfram endpoint. isolated-vm for the in-process JavaScript isolation tier, Pyodide over WASM for portable Python, gVisor and Firecracker for the strong tiers. Zod at the seam, spec in and value out, on both sides of every crossing. And each tier wrapped as an MCP compute server so every agent in the kernel dispatches through the same audited path rather than each one inventing its own way to be wrong.

A staged build

The deterministic kernel. A math, units, and decimal core, pure, deterministic, exact by default, no model anywhere near it. This is the trusted computing base for numbers, and it is worth building first because everything else leans on it.
Sandboxed compute as MCP. Wrap the kernel and a code-execution tier behind MCP, isolated-vm at the base, escalating to gVisor and Firecracker by risk. Structured spec in, structured value plus provenance out.
The audit and repair loop. Orchestrator-enforced routing so computed values come only from the compute path, structured-error feedback so the type and unit checkers drive model repair, provenance on every value that leaves the substrate.
The verification tier. For claims that must be proven rather than merely computed, a Lean-backed check where the model proposes and the prover disposes. Expensive. Reserve it for the high-assurance subset that earns it.

The part you already knew

If you have read this far you have already arrived at the conclusion, which is that the intelligence in a well-built agent OS was never the arithmetic. It is knowing what to compute, expressing it precisely enough that a deterministic machine can execute it, and checking the result against a spec that cannot be argued out of its answer. The model's fluency with numbers was never the asset worth having. Its fluency with intent is. Build the boundary so the model proposes and the machine disposes, seal the computation in a deterministic core the model can call but never author, and the silent-error tail closes, because the model is no longer the thing doing the math.

That is the design. Everything else is which sandbox tier, and how much you are willing to pay in milliseconds for how much you refuse to pay in trust.

The claims in this piece were verified against primary sources: the program-aided ablation (a real interpreter versus the model simulating its own code) is from the PAL work of Gao et al. (arXiv 2211.10435), where GSM8K runs at 72.0% executed and collapses to 23.2% self-simulated; the MATH progression from the low forties to the mid-eighties with code execution and self-verification is from Zhou et al. (arXiv 2308.07921); the TC-zero circuit-class ceiling is Merrill and Sabharwal (arXiv 2207.00729); and the IMO silver-medal result is DeepMind's AlphaProof and AlphaGeometry 2 (Nature, 2025), which scored 28/42 at IMO 2024. Fast-moving model-scoreboard numbers evolve — verify the latest against primary model cards before citing them in a design document.