Open Source AI Models: VibeThinker-3B Claims Frontier Reasoning at 6.7GB VRAM

June 17, 2026 3 min read arXiv Qualified Moderate

Tech Jacks Solutions AI News Coverage

WeiboAI released VibeThinker-3B this week, a 3.1-billion-parameter open-source model that the authors claim matches or exceeds much larger frontier models on verifiable reasoning tasks. The weights are MIT-licensed and live on Hugging Face, runnable on a single consumer GPU.

open-source-ai small-language-models ai-benchmarks verifiable-reasoning weibo-ai vibethinker inference-cost

AIME26 score (self-reported), 94.3

Key Takeaways

VibeThinker-3B (3.1B parameters, MIT license) claims AIME26: 94.3 and LiveCodeBench v6: 80.2 - all figures are self-reported in the authors' arXiv technical report, not independently verified
Model runs at FP16 in 6.7GB VRAM, consumer GPU territory, with weights and training code publicly available under MIT license on Hugging Face and GitHub
Post-training pipeline uses curriculum SFT, multi-domain RL, and offline self-distillation; authors argue verifiable reasoning compresses into small parameter counts differently than factual recall
Independent evaluation is pending, hold production decisions until third-party benchmarks confirm or contest the self-reported scores

Model Release

VibeThinker-3B

OrganizationWeiboAI (Sina Weibo AI Division)

TypeOpen Source LLM

Parameters3.1B

Benchmark[SELF-REPORTED] AIME26: 94.3 (97.1 w/ test-time scaling); LiveCodeBench v6: 80.2; IFEval: 93.4

AvailabilityHugging Face (WeiboAI/VibeThinker-3B) + GitHub (WeiboAI/VibeThinker), MIT license

Verification

Qualified Vendor arXiv technical report (arXiv:2606.16140), author-affiliated No independent benchmark evaluation available. Epoch AI or third-party reproduction pending.

The model fits in 6.7GB of VRAM at FP16. That’s a gaming-class GPU. And according to the authors’ technical report on arXiv, it scores 94.3 on AIME26, improving to 97.1 with test-time scaling, benchmark territory that, until recently, only models with hundreds of billions of parameters were claiming.

Take those numbers carefully. They’re self-reported. VibeThinker-3B is a product of Sina Weibo’s AI division (WeiboAI), and the benchmarks come from the authors’ own technical report, not an independent evaluation lab. The paper is titled “VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models,” authored by Sen Xu, Shixi Liu, Wei Wang, and six colleagues, submitted to arXiv June 15, 2026.

The weights and training code are publicly available under MIT license, Hugging Face for weights, GitHub for code. That openness matters: independent researchers can reproduce the benchmarks. Until they do, treat the numbers as a hypothesis worth testing, not a specification to build around.

What the architecture actually does is interesting regardless of where the final scores land. The post-training pipeline combines three stages: curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. The authors call this the “Spectrum-to-Signal” paradigm. The core claim is that verifiable logical reasoning, the kind you can check against a correct answer, can be compressed into small parameter counts in ways that open-domain factual recall cannot. The paper describes this as the “Parametric Compression-Coverage Hypothesis.” That framing is the authors’ own and isn’t independently verified from the abstract text, but it names something practitioners have intuited for a while: not all capabilities scale the same way.

VibeThinker-3B Specs (Self-Reported)

Attribute	Value	Source
Parameters	3.1B	arXiv:2606.16140
Base model	Qwen2.5-Coder-3B	arXiv:2606.16140
VRAM (FP16)	6.7GB	Model card (unconfirmed)
Context window	32,000 tokens	arXiv:2606.16140
License	MIT	arXiv:2606.16140
AIME26	94.3 (97.1 w/ TTS)	Self-reported
LiveCodeBench v6	80.2	Self-reported
IFEval	93.4	Self-reported

The catch is the task specificity. VibeThinker-3B targets verifiable reasoning, math proofs, code correctness, structured logic. Those are domains where you can score outputs objectively. The model also reports 80.2 on LiveCodeBench v6 and 93.4 on IFEval, per the same technical report. General-domain tasks are a different story. If your production workload involves open-ended synthesis, retrieval over messy corpora, or multimodal inputs, a 3B specialist doesn’t replace a frontier generalist. Know what the benchmark measures before you plan a migration.

The broader pattern here matters. This model drops the same week that Epoch AI’s FrontierMath v2 audit revealed errors in a substantial proportion of benchmark problems. Two stories, one lesson: benchmark scores on verifiable tasks are only as good as the benchmark’s integrity and the evaluator’s independence. VibeThinker’s AIME26 number is worth watching precisely because AIME problems are checkable, but only if a third party runs the check.

The part nobody mentions in small-model announcements: inference cost at volume. At 6.7GB VRAM and MIT license, the per-query cost on consumer or mid-tier cloud hardware is substantially lower than frontier API pricing. For teams running high-throughput verifiable reasoning pipelines – automated code review, structured data extraction, logic verification, that cost gap is significant even before the benchmark question is settled.

Unanswered Questions

Do the self-reported benchmark scores hold under independent third-party evaluation?
How does latency compare to frontier APIs at production query volumes?
Does the verifiable-reasoning advantage extend to domain-specific tasks (legal, scientific, financial)?
What's the cost-per-query differential vs. frontier model APIs at 10K+ daily queries?

Watch for independent reproductions on AIME26 and LiveCodeBench in the coming weeks. If the scores hold under third-party evaluation, VibeThinker-3B becomes a serious infrastructure decision for teams running reasoning-specific workloads on constrained budgets.

Don’t migrate production workloads on self-reported benchmarks. Wait for independent evaluation, and when FrontierMath v2’s updated leaderboard data becomes available, cross-check whether VibeThinker’s verifiable reasoning scores hold on that corrected benchmark set.

View Source

More Technology intelligence

View all Technology

Gallery

Contacts