Meta Llama

What Is Meta Llama? The Open-Weight AI Powering the World

Updated April 29, 2026 12 min read

More than 3.27 billion users (according to Meta's Q1 2026 earnings) can access Meta Llama today through the apps already on their phones — WhatsApp, Instagram, Messenger, Facebook — without paying a subscription or downloading anything extra. That reach makes Llama the most widely distributed AI model family on the planet. Yet unlike the AI assistants locked behind closed APIs, Llama's weights are available for anyone to download, run, and modify.

Meta Llama is a family of open-weights foundation models developed by Meta AI. Since February 2023, it has grown from a research-focused language model into a multi-generation, multimodal ecosystem powering enterprise data pipelines, on-device mobile assistants, and consumer AI products at global scale. The April 2025 release of Llama 4 marked a fundamental architectural shift — away from dense transformers and toward Mixture-of-Experts (MoE) with native multimodal processing.

3.27B

Users can access Meta AI across Meta platforms

Meta Q1 2026 Earnings

10M

Token context window — Llama 4 Scout (largest among open-weight models at launch)

Meta AI Blog, Apr 2025

700M

MAU threshold — Community License free below this; commercial license required above

llama.meta.com

600M+

Weekly Meta AI users (vendor-reported, Meta Q1 2026)

Meta Q1 2026 Earnings

What Is Meta Llama?

Meta Llama is a family of open-weights foundation models developed by Meta AI. The models power Meta AI — the assistant built into WhatsApp, Instagram, Messenger, and Facebook — and are freely available for developers to download, fine-tune, and deploy on their own infrastructure.

The term "open-weights" is deliberate. Llama is not fully open-source by the Open Source Initiative (OSI) definition. Meta releases the model weights but does not publish training data or full training methodology. "Open-weights" is the correct and accurate framing.

What open-weights means in practice: you can download the model files, run them on your own hardware, modify the architecture, fine-tune on your own data, and deploy without paying per-call API fees to Meta. That economics model — no metered billing for self-hosted inference — is a core reason enterprises in regulated industries have adopted Llama heavily.

The Llama family covers a wide parameter range: from 1B-parameter models designed for mobile devices to the 400B-parameter Maverick built for server-grade workloads. The fourth generation introduces Mixture-of-Experts architecture, early-fusion native multimodality, and context windows up to 10 million tokens.

February 2023

Llama 1

First generation — research-focused dense transformer. Released to researchers, establishing the open-weights model release strategy.

July 2023

Llama 2

Broader commercial availability. Established the community licensing framework. Significantly expanded developer adoption.

April 2024

Llama 3

Dense transformer — 8B, 70B, 405B. Pre-trained on 15 trillion tokens. 128K context window. Major performance leap over Llama 2.

July – September 2024

Llama 3.1 and 3.2

3.1 added 405B at 128K context. 3.2 introduced lightweight edge models (1B, 3B) and multimodal vision models (11B, 90B).

April 2025

Llama 4 Scout and Maverick

Architectural shift to MoE + early-fusion multimodality. Scout: 109B total / 17B active, 10M token context. Maverick: 400B total / 17B active, 1M token context.

The Llama 4 Revolution: MoE and Early-Fusion Multimodality

Llama 4 represents the most significant architectural departure in the family's history. Every prior Llama generation used a dense transformer — every parameter activated for every token. Llama 4 switches to Mixture-of-Experts (MoE), where only a subset of the model's parameters activates per token. This change enables dramatically larger total parameter counts to become feasible, while keeping inference costs per token lower because fewer parameters are computed per forward pass.

What MoE Means for Llama 4

Llama 4 Scout has 109 billion total parameters but activates only 17 billion per token across 16 experts. Llama 4 Maverick has 400 billion total parameters and also activates 17 billion per token, routing across 128 experts. The result: Maverick delivers performance that benchmarks compare favorably with GPT-4o and Claude 3.5 Sonnet, while only computing the equivalent of a 17B dense model per forward pass.

Early-Fusion Native Multimodality

Llama 3 introduced multimodal capability in the 3.2 generation via separate vision-enabled models. Llama 4 takes a different approach: early-fusion architecture, where text, image, and video inputs are processed jointly in the same model from the first layer forward. There is no separate vision encoder bolted onto a language backbone. The model is natively multimodal, enabling richer cross-modal reasoning compared to architectures where visual tokens are projected into a language model's embedding space as a late addition.

Model Lineup: From Llama 3 to Llama 4

Llama 3 Series

Llama 3 (April 2024) was a dense transformer family. The 3.1 generation (July 2024) expanded to 8B, 70B, and 405B parameter sizes, pre-trained on 15 trillion tokens with a 128K token context window. The 405B variant was, at launch, one of the largest publicly available open-weights models.

Llama 3.2 (September 2024) added two directions: lightweight edge models at 1B and 3B parameters for mobile devices, and multimodal models at 11B and 90B with vision capability. The 1B and 3B models run on iPhone and Android in offline mode — no network calls required.

Llama 4 Scout

109B total parameters, 17B active per token (16 experts)
10 million token context window — largest among open-weight models at launch
Best suited for long-document analysis, multi-document reasoning, and extended agentic tasks
Outperforms Gemini 2.0 Flash and comparable Mistral models on several benchmarks at fraction of inference cost when self-hosted

Llama 4 Maverick

400B total parameters, 17B active per token (128 experts)
1 million token context window
General-purpose workhorse — designed for broad reasoning, coding, complex instruction following
Competitive with GPT-4o and Claude 3.5 Sonnet on standard benchmarks

Llama 4 Behemoth (In Training — Not Yet Available)

Behemoth is approximately 2 trillion total parameters with 288 billion active parameters. As of April 2026, Behemoth had not been released and was still in training. Its primary role in the Llama ecosystem is as a teacher model — used to generate training signals that distill capability into Scout and Maverick. Do not treat Behemoth as currently available; it is not.

Scout vs Gemini 2.0 Flash — Relative Performance

Llama 4 Scout Outperforms on several benchmarks

Gemini 2.0 Flash Competitive proprietary alternative

Maverick vs GPT-4o — General Reasoning

Llama 4 Maverick Competitive on standard benchmarks

GPT-4o Closed-source comparable

Context Window Comparison (Open-Weight Models)

Llama 4 Scout 10M tokens

Llama 4 Maverick 1M tokens

Llama 3.1 128K tokens

Benchmark data from Meta AI Blog (Apr 2025). Relative bars are illustrative; consult source benchmarks for precise scores. Context window comparison is open-weight models only. Benchmark comparators (GPT-4o, Claude 3.5 Sonnet) reflect models available at Llama 4's April 2025 release — check LMSYS Chatbot Arena and Papers With Code for current standings.

Licensing: Community vs. Commercial

One of the most misunderstood aspects of Meta Llama is its licensing. The model is not free for all commercial use without restriction — and the distinction matters before you deploy at scale.

Llama 4 Community License

The Llama 4 Community License allows free commercial use for deployments up to 700 million monthly active users. For the overwhelming majority of companies — including large enterprises — this threshold is never reached. A team serving 10 million users, a hospital running an internal summarization tool, or a developer building B2B SaaS are all well inside the 700M MAU ceiling.

Internal vs. external deployment: The 700M MAU threshold counts all monthly active users of your product or service — including internal enterprise deployments where employees are the users. A company deploying Llama internally for 10,000 employees is well under the threshold; a consumer app with 800M users needs the commercial license.

Companies exceeding 700M MAU — realistically, only the largest consumer technology platforms on earth — must negotiate a separate commercial license with Meta. The Community License does NOT mean "completely free for all commercial use." The 700M MAU threshold is a real limit. Read the license before scaling a consumer product that might approach it.

The Open-Weights Distinction

Both the Community and commercial licenses cover model weights — what you download and run. Meta does not release training data or detailed training methodology under either license. This is why "open-weights" is more accurate than "open-source." If your organization requires full auditability of training data (certain regulated sectors impose this requirement), Llama's open-weight license does not provide it.

⚠️

Not Fully Open-Source

Llama releases model weights only. Training data and methodology are not publicly disclosed. This does not meet the Open Source Initiative's full open-source definition. Use the term "open-weights" — not "open-source."

🚫

700M MAU License Threshold

The Community License is free up to 700 million monthly active users. Consumer products approaching global scale need a separate commercial license from Meta. Always verify your current license terms at llama.meta.com before scaling.

🔬

Behemoth Not Yet Available

Llama 4 Behemoth (~2T parameters, 288B active) was still in training as of April 2026. It is not available for download or via any API. Do not plan production architectures around Behemoth availability.

Deployment Options: Where Can You Run Meta Llama?

Meta does not operate a first-party API for external developers. To call Llama via an API, you use a third-party provider. This is a deliberate design: Meta distributes the model, and the infrastructure ecosystem builds around it. You have four main deployment paths.

💾

Local / Self-Hosted

Download weights from llama.meta.com or Hugging Face. Run on your own GPU hardware. Zero per-call cost, full data privacy.

☁️

AWS Bedrock

Fully managed Llama inference. No GPU management. Integrates with existing AWS infrastructure.

🔵

Google Cloud Vertex AI

Available in Vertex AI Model Garden. Integrates with Google Cloud MLOps tooling and data services.

🟦

Azure AI

Available in Azure AI Foundry. Enterprise-grade security and compliance. Integrates with Azure data ecosystem.

⚡

Groq

Ultra-low latency inference on LPU hardware. Best for latency-sensitive applications requiring fast token generation.

📱

On-Device (Mobile)

Llama 3.2 1B/3B run on iPhone and Android offline. No network calls. Suitable for privacy-first mobile use cases.

Hardware reality: Running Maverick (400B parameters) locally requires approximately 4-8 NVIDIA H100 GPUs (~$120K–$240K to purchase, or $16–$32/hour on cloud GPU). Scout (109B) is more accessible — approximately 2-4 H100s. For most developers, third-party API providers (Groq, Together AI) offer Llama at low cost without hardware investment.

Additional third-party API providers include Together AI and Fireworks AI, both offering developer-friendly access with competitive pricing. For teams choosing cloud inference over self-hosting, comparing latency and per-token costs across these providers is worthwhile before committing to a stack.

Why Meta Llama Matters for the AI Ecosystem

The open-weights model has produced structural advantages that closed-API alternatives cannot replicate. Three stand out for enterprise decision-makers.

Cost Structure at Scale

Self-hosted Llama eliminates per-call inference costs. For high-volume workloads — customer service automation, document processing, content generation at scale — the economics of self-hosting can be dramatically cheaper than closed-API alternatives at equivalent quality levels. GPU infrastructure has its own cost, but for mature engineering teams running millions of daily requests, the math frequently favors self-hosting.

Data Privacy and Sovereignty

When Llama runs on your hardware, your data never leaves your infrastructure. Healthcare organizations processing patient records, legal firms handling privileged communications, financial institutions managing proprietary trading data — these are the sectors where on-premises Llama deployment has grown fastest. No third-party data processing agreement required. No risk of prompts or responses being used to train a vendor's future models.

Fine-Tuning and Customization

Because the weights are downloadable, organizations can fine-tune Llama on proprietary data. A hospital can fine-tune on clinical notes. A law firm can fine-tune on case law. An e-commerce company can fine-tune on product catalogs. The resulting model reflects institutional knowledge that no off-the-shelf API can match. Thousands of fine-tuned Llama variants are available on Hugging Face, and the community adapts to new model releases within days.

🏢 Self-Hosted Enterprise

Teams running high-volume inference who need zero per-call costs and full data control. Often deploying Maverick on internal GPU clusters for internal tooling or customer-facing products.

Maverick / Scout

🔬 AI Researcher

Academics and industry researchers studying model behavior, alignment, or architecture. Open weights enable experiments not possible with closed models — ablations, probing, architecture modifications.

All sizes

⚙️ Developer (Fine-Tuning)

Application builders who need domain-specific performance — customer support bots, code assistants, document extractors. Fine-tune on proprietary data for specialized capability.

Llama 3.2 / Scout

🏥 Healthcare / Legal / Finance

Regulated industries where data sovereignty is non-negotiable. On-premises Llama keeps sensitive data inside the organization's security perimeter — no external data processing agreements required.

On-premises

Meta AI Integration: Llama at Consumer Scale

Beyond developer use, Meta Llama is the engine behind Meta AI — the AI assistant integrated across Meta's consumer platforms. The integration scope is broader than most AI assistants:

WhatsApp — conversational AI directly in personal and group chats
Instagram — visual search, caption assistance, DM support
Messenger — group chat assistance and message summarization
Facebook — feed-level assistance and search enhancement
Ray-Ban Meta smart glasses — voice-activated AI with camera vision
Meta Quest VR — spatial AI assistance in virtual environments

According to Meta's Q1 2026 earnings, Meta AI has surpassed 600 million weekly active users — this is a vendor-reported figure. Through Meta's platform reach, Meta AI is accessible to 3.27 billion users globally, making it the AI assistant with the largest potential addressable user base of any product currently available.

The multimodal capability in Meta AI — generating images, analyzing photos, processing voice — runs on Llama 4's early-fusion architecture, which handles these modalities natively rather than through separate model pipelines stitched together after the fact.

Llama vs. Closed-Weight Models: Key Trade-offs

Choosing between Meta Llama and closed-weight models like GPT-4o or Claude 3.5 Sonnet is not purely a capability question. The decision turns on operational requirements, team size, and how much infrastructure ownership you can absorb.

Where Llama wins outright:

High-volume inference where per-call API costs accumulate to significant monthly spend
Regulated environments requiring data to never leave the organization's infrastructure
Teams with engineering capacity to manage GPU clusters and model serving
Use cases requiring domain-specific fine-tuning on proprietary data

Where closed APIs have an edge:

Teams without dedicated MLOps capability — self-hosting Maverick at 400B parameters is not a one-person project
Rapid prototyping where infrastructure investment is not yet justified
Use cases requiring absolute frontier capability, where top closed models remain the benchmark

The practical middle path: Llama via a third-party API provider (Groq, Together AI, AWS Bedrock) delivers much of Llama's cost advantage over OpenAI or Anthropic APIs, without the infrastructure burden of self-hosting. Full self-hosting becomes the right call once inference volume justifies it.

The open-weights ecosystem advantages also compound over time. Each Llama release generates community improvements — quantized versions, optimized serving configurations, fine-tuned variants — within days of the weights dropping. Closed models do not produce this community momentum because the weights are inaccessible.

Frequently Asked Questions

What is Meta Llama?

Meta Llama is a family of open-weights foundation models developed by Meta AI. The models power Meta AI across WhatsApp, Instagram, Messenger, and Facebook, and are freely available for developers to download, customize, and deploy on their own infrastructure without per-call API fees to Meta.

Is Meta Llama free to use commercially?

Llama weights are free to download and run. The Llama 4 Community License permits commercial use for deployments up to 700 million monthly active users. Above that threshold, a separate commercial license from Meta is required. For virtually all businesses, this ceiling is never a practical constraint — but read the license before scaling any consumer product.

Does Meta have an API for Llama?

Meta does not operate a first-party API for external developers. Llama is available via API through third-party providers including AWS Bedrock, Google Cloud Vertex AI, Azure AI, Groq, Together AI, and Fireworks AI. You access Meta Llama through these providers, not through a Meta-operated endpoint.

What is the difference between Llama 4 Scout and Maverick?

Llama 4 Scout (109B total / 17B active) is optimized for long-context tasks with a 10 million token context window. Llama 4 Maverick (400B total / 17B active) is the general-purpose model with a 1 million token context window, competitive with GPT-4o on standard benchmarks. Scout excels at document-heavy and extended reasoning tasks; Maverick handles broad reasoning and complex instruction-following.

Is Llama fully open-source?

No. Llama is open-weight: model weights are publicly available for download and modification, but training data and full training methodology are not disclosed. This means Llama does not meet the Open Source Initiative's full open-source definition. The accurate and precise term is "open-weights."

When will Llama 4 Behemoth be released?

As of April 2026, Behemoth was still in training and had not been released. Meta describes it as approximately 2 trillion total parameters with 288 billion active parameters. Its primary purpose is as a teacher model for distillation into Scout and Maverick. There is no confirmed release date.

Video Resources

▶️

Meta Llama 4: Full Breakdown — Scout, Maverick, Behemoth Explained

Meta AI — Official

▶️

Running Llama 4 Locally: Complete Setup Guide

Andrej Karpathy

▶️

Llama vs GPT-4o: Honest Benchmark Comparison 2025

AI Explained

🛡️ Before You Use AI

Your Privacy

Self-hosted Llama deployments keep data within your infrastructure. When using Meta AI inside WhatsApp, Instagram, or other Meta apps, Meta's platform privacy policy applies. Enterprise Llama on AWS Bedrock, Azure AI, or Google Cloud Vertex AI is subject to each provider's data processing terms. Review your deployment's data practices before processing sensitive information.

Mental Health & AI Dependency

AI assistants are not substitutes for professional mental health support. If you or someone you know is in crisis: 988 Suicide & Crisis Lifeline (call or text 988), SAMHSA helpline 1-800-662-4357, or Crisis Text Line (text HOME to 741741). For AI risk guidance, see the NIST AI Risk Management Framework.

Your Rights & Our Transparency

Under GDPR and CCPA, you have rights to access, correct, and delete personal data. This article is editorially independent. Where applicable, affiliate relationships are disclosed per FTC guidelines. The EU AI Act classifies certain AI uses as high-risk — review EU AI Act guidance for deployments in regulated sectors.

Breakdown

What Is DeepSeek? The Chinese AI Lab Reshaping Open-Source AI

→

Breakdown

What Is Mistral? Europe's Open-Weight AI Model Family Explained

→

Breakdown

What Is ChatGPT? OpenAI's AI Assistant Fully Explained

→