Formal verification just got a lot more accessible. Mistral AI released Leanstral 1.5 on July 2, 2026, an open-weights model under the Apache 2.0 license, purpose-built for Lean 4, the proof assistant capable of expressing complex mathematical objects and software specifications including properties of Rust programs. It’s part of the Mistral Small 4 family and available free during beta through Mistral’s Labs API.
What the model actually is
The architecture is confirmed via the Hugging Face model card: a Mixture of Experts design with 128 experts and 4 active per token, totaling 119B parameters with 6.5B activated per inference. Context window is 256,000 tokens. Multimodal input, text and images, with text output. Recommended temperature is 1.0, and Mistral advises against using the reasoning mode for simple prompts while flagging it for complex ones. These architectural specifics are deposited facts from the model card itself, not marketing claims.
Why it matters
Formal verification has stayed a specialist discipline for a simple reason: it requires proof engineers fluent in theorem-proving languages. Most engineering teams don’t have them. A capable open-weights model that speaks Lean 4, carries an Apache 2.0 license, and runs free during beta removes the highest barrier, access cost, from the evaluation decision. Teams that couldn’t justify a dedicated proof engineer can now run a trial without budget approval.
According to Mistral AI’s own testing, Leanstral 1.5 identified 5 previously unknown bugs across 57 open-source repositories. That’s a practitioner signal worth taking seriously, even though it comes from the vendor. Bug discovery in real codebases is a harder, messier test than any benchmark, and it’s the one that actually maps to what engineering teams need.
Disputed Claim
The benchmark story, read carefully
Self-reported benchmarks. Read carefully. According to Mistral AI, the model scores 100% on the miniF2F formal mathematics benchmark and solves 587 of 672 problems on PutnamBench, a competition mathematics dataset maintained by researchers at UT Austin. Mistral AI also reports 87% on its own FATE-H benchmark and 34% on FATE-X. None of these results have been independently evaluated. The T3 coverage repeating these figures all traces back to Mistral’s announcement. No Epoch AI evaluation exists yet. The miniF2F saturation and PutnamBench results describe performance on structured mathematical proof tasks, which is legitimately relevant to the model’s use case, but “100% on miniF2F” means the benchmark has hit its ceiling, not that the model is infallible in production.
What to watch
Independent evaluation is the next gate. Until Epoch AI or an academic group reproduces the PutnamBench and FATE results, treat all performance claims as vendor-reported. The FATE benchmark is Mistral’s own framework, it doesn’t have the same standing as PutnamBench, which has academic hosting at UT Austin. Watch for community testing via the Hugging Face model card, where developers will begin logging real-world results. If the bug-discovery rate holds up across broader repository testing, that’s the metric that will matter to engineering teams.
TJS synthesis
Don’t migrate production verification pipelines on vendor benchmarks alone. The Apache 2.0 license and free beta make the evaluation cost essentially zero, run it against your own Lean 4 specifications before drawing conclusions. The architecture is solid and the confirmed facts are genuinely interesting. What isn’t confirmed yet is whether the performance holds outside Mistral’s test conditions. Run your own evaluation. Wait for independent benchmarks before committing workflow changes.
What to Watch
Sources: Nyu, Mistral AI.
Sources: Huggingface, Nyu, Mistral AI.