MiniMax M3 is available now. The model launched June 1 via the MiniMax API and OpenRouter, with a 1,048,576-token context window built on MiniMax Sparse Attention (MSA) architecture and support for text, image, and video inputs. Launch pricing is $0.30 per million input tokens and $1.20 per million output tokens, a 50% discount from the standard rate of $0.60/$2.40. That standard rate still puts it well below what GPT-5.5 and Gemini 3.1 Pro charge for comparable context sizes.
BenchLM.ai’s independent evaluation places M3 at #29 of 119 models on the provisional leaderboard with an overall score of 76/100, and #12 of 28 on the verified leaderboard. BenchLM’s category breakdowns show an agentic score of 82.4/100 and a coding score of 87.4/100, both from the same BenchLM source as the confirmed overall ranking. BenchLM is an independent third-party evaluator, not a vendor metric.
On coding benchmarks, MiniMax claims M3 scores approximately 59% on SWE-bench Pro, which the company says edges out GPT-5.5 at 58.6% and Gemini 3.1 Pro at 54.2%. Independent replication of those figures is pending. The SWE-bench Pro claims originate from MiniMax’s own announcement, amplified through social posts, not from a replication study. Don’t treat the head-to-head numbers as confirmed until Epoch AI or a comparable independent evaluator weighs in.
MiniMax describes M3 as the first open-weight model to combine native image, video, and computer-use capabilities with a 1M-token context window. That claim hasn’t been verified against prior releases.
API Pricing, Input / Output (per million tokens)
Why this matters for your stack
The pricing gap is the real story here. At $0.30/$1.20 per million tokens on launch pricing, M3 costs roughly 5-10% of comparable closed-model API rates, according to VentureBeat’s analysis. For teams running high-volume agentic workflows, where context window usage and token throughput compound quickly, that cost differential makes M3 worth evaluating even before the open weights drop. The 1M context window is already available via API, so you don’t have to wait.
The agentic ranking (BenchLM #13, score 82.4/100) holds up independently. That’s not a vendor number. For teams evaluating models for tool-use and multi-step reasoning tasks, BenchLM’s agentic category is one of the more rigorous available assessment frameworks.
The catch is open-weight timing. Company leadership committed to releasing weights on Hugging Face within 10 days of June 1. That’s a commitment, not a completed action. If you’re making architecture decisions contingent on local or self-hosted deployment, wait for the actual release before finalizing.
What to Watch
Disputed Claim
What to watch
Three things remain unresolved: independent replication of the SWE-bench Pro figures, the actual open-weight release on Hugging Face (expected by June 11), and the technical report that would detail M3’s architecture, training data, and context handling at scale. Epoch AI evaluation is pending. The BenchLM ranking is solid, but coding benchmark performance at production scale, latency, throughput, cost per completed task, isn’t captured in leaderboard scores.
TJS synthesis
M3’s BenchLM rankings are real. The pricing is real. The SWE-bench Pro head-to-head isn’t independently confirmed, so price and context window are the evaluation anchors right now. Wait for independent benchmarks before migrating production agentic workloads. If you’re cost-sensitive and can tolerate self-reported benchmark uncertainty, the API is worth testing against your own evaluation suite this week, the weights aren’t a prerequisite for that.