What hardware do I need to run Qwen locally?

It depends on the model size. Qwen3-8B runs on 5-6GB VRAM (RTX 3060 or M1 Mac with 16GB). Qwen3-32B needs ~20GB VRAM (RTX 4090). Qwen3-30B-A3B (MoE) needs 19-24GB VRAM on an RTX 4090 and delivers up to 25 tokens/second. The 397B flagship requires 192-256GB RAM (M3 Ultra or H200 GPU server).

How do I run Qwen locally with Ollama?

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh (Mac/Linux) or irm https://ollama.com/install.ps1 | iex (Windows). Then run: ollama pull qwen3:8b && ollama run qwen3:8b. Toggle thinking mode with /set think (on) or /set nothink (off).

Can I use Qwen locally with the OpenAI SDK?

Yes. Ollama provides an OpenAI-compatible API at http://localhost:11434/v1/. Point your OpenAI SDK client to base_url='http://localhost:11434/v1/' with api_key='ollama' (ignored locally). Endpoints /v1/chat/completions and /v1/responses are supported.

How do I use Qwen locally with Claude Code?

Qwen3.7-Max natively supports the Anthropic API protocol. Set two environment variables: ANTHROPIC_BASE_URL (your endpoint) and ANTHROPIC_API_KEY (your key). Specify the model via the --model flag (e.g., claude --model qwen-max). Claude Code's tool-use, file-edit, and subagent features all work without an adapter layer.

Qwen

How to Run Qwen Locally: Complete Setup Guide (2026)

Running Qwen locally means zero API costs, full data privacy, and inference speeds that beat hosted endpoints on the right hardware. This guide covers three deployment paths — Ollama for instant setup, llama.cpp for hardware-level control, and vLLM for production throughput — plus how to expose a local OpenAI-compatible API, connect your IDE, and integrate with Claude Code. Every command in this guide is verified from official Qwen documentation and community testing as of May 2026. If you are new to the platform, our overview of what Qwen is covers the model background first. To check whether local really beats a hosted API for your volume, our open-source vs frontier TCO calculator compares the all-in costs.

Prerequisites: Pick Your Run Qwen Locally Path

Hardware Quick Reference (4-bit Q4_K_M quantization)

5-6GB

VRAM for Qwen3-8B (RTX 3060 / M1 16GB)

20GB

VRAM for Qwen3-32B (RTX 4090)

24GB

VRAM for Qwen3-30B-A3B MoE (RTX 4090)

25 t/s

Tokens/second on RTX 4090 (30B-A3B MoE)

Before choosing a deployment method, match your hardware to a model size. The table below uses 4-bit quantization (Q4_K_M) as the baseline — the standard tradeoff that preserves most quality while fitting larger models into available VRAM. FP16 (full precision) requires roughly double the VRAM shown. The MoE (Mixture-of-Experts) models in the table — Qwen3-30B-A3B — have unusually low VRAM requirements for their stated parameter count: despite 30 billion total parameters, only ~3 billion activate per token, letting you run a frontier-tier reasoning model on a single RTX 4090. For a breakdown of every size and release, see our guide to the Qwen3 model family.

Model	VRAM (Q4)	Recommended Hardware	Use Case
Qwen3-0.6B	~1GB	Any 4GB GPU / Jetson Nano	Edge, classification, glue tasks
Qwen3-8B	~5-6GB	RTX 3060 12GB / M1 16GB	Coding, chat, agents
Qwen3-14B	~10GB	RTX 4070 12GB / Mac 16GB	Strong reasoning, multilingual
Qwen3-32B	~20GB	RTX 4090 (24GB)	Best single-GPU quality
Qwen3-30B-A3B MoE	19-24GB	RTX 3090/4090	25 tok/s coding — top local choice
Qwen3.5-397B-A17B	~214GB (4-bit)	M3 Ultra 256GB / H200 + 240GB RAM	Frontier-tier, server-class

Prerequisites Checklist

Identify target model size and confirm VRAM availability (use table above)

For GPU acceleration: NVIDIA GPU with up-to-date CUDA drivers (12.x+), or Apple Silicon Mac

Terminal access and basic command-line familiarity

Python 3.9+ (for llama.cpp downloads and vLLM) — or skip to Ollama if you want zero setup

For 397B: at least 192GB system RAM (3-bit) or 256GB (4-bit) unified/combined memory

Choosing your method: Ollama is the right starting point for 95% of developers. It installs in 30 seconds, pulls models with one command, and exposes an OpenAI-compatible API automatically. Use llama.cpp if you need granular CUDA layer control or need to run models that Ollama doesn't yet support. Use vLLM when you are serving a team or running batch workloads that need maximum throughput.

Method 1: Run Qwen Locally with Ollama (Recommended)

Ollama Setup — 5 Steps

Install Ollama

One command installs the Ollama service, CLI, and GPU detection automatically.

Pull a Qwen model

Select the model tag that fits your VRAM. Pin the exact tag for reproducibility.

Run interactively

Start the interactive session and optionally toggle thinking mode.

Create a custom Modelfile

Bake in system prompts, context length, and temperature for repeatable agent sessions.

Use the local API

Point any OpenAI SDK client at http://localhost:11434/v1/ to use Qwen programmatically.

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

irm https://ollama.com/install.ps1 | iex

Confirm the install worked: ollama --version. If NVIDIA GPU drivers are installed, Ollama auto-detects and uses the GPU. No CUDA configuration required.

Step 2: Pull a Model

Choose the model tag that fits your VRAM. For most developers on a 16-24GB system, qwen3:32b or qwen3:30b-a3b (MoE, fastest) are the best starting points:

# RTX 3060 / M1 16GB
ollama pull qwen3:8b

# RTX 4090 — best quality single GPU
ollama pull qwen3:32b

# RTX 3090/4090 — fastest inference (MoE, 25 tok/s)
ollama pull qwen3:30b-a3b

# Explicit quantization tag (recommended for reproducibility)
ollama pull qwen3:32b-q4_K_M

Pin your tags. ollama pull qwen3:8b resolves to the library's current latest alias — that alias moves when new versions publish. For tooling and scripts, always pin the explicit tag (qwen3:8b-q4_K_M) so your environment stays reproducible.

Step 3: Run Interactively and Toggle Thinking Mode

Qwen models have a dual-mode reasoning engine: thinking mode runs full chain-of-thought before answering; non-thinking mode returns direct answers with lower latency. Toggle it during a session using slash commands:

ollama run qwen3:32b

# Inside the session:
/set think     # Enable chain-of-thought reasoning
/set nothink   # Disable — direct, low-latency answers

You can also pass thinking mode tokens inline in your prompt: prepend /think or /no_think as the first token in your message. Use thinking for multi-step coding, math, and logic; use non-thinking for quick lookups, translations, and high-throughput pipelines.

Step 4: Create a Custom Modelfile

A Modelfile bakes in your preferred system prompt, context window, and sampling parameters so you don't reconfigure every session. This is especially useful for coding agents that need consistent behavior:

FROM qwen3:32b
PARAMETER num_ctx 32768
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.0
SYSTEM """You are an expert senior software engineer. Provide concise, efficient code solutions. Avoid unnecessary explanations unless asked. Always use modern syntax and best practices."""

ollama create qwen-coder -f Modelfile
ollama run qwen-coder

Important: set repeat_penalty 1.0 for Qwen. Ollama's default repeat penalty of 1.1 causes quality degradation on code generation tasks for the Qwen-Next model family. Setting it to 1.0 (disabled) restores expected output quality. This applies to all Qwen3 and Qwen3.5 variants.

Context window note: Qwen3.5 models advertise 256K context, but Ollama's runtime default depends on your available VRAM: under 24GB = 4K default; 24-48GB = 32K default; 48GB+ = 256K. The model maximum and the Ollama runtime default are different values. Always set num_ctx explicitly in your Modelfile or API call.

Method 2: Run Qwen Locally with llama.cpp

llama.cpp gives you direct control over CUDA layer offloading, context window allocation, and GGUF quantization selection. GGUF (GPT-Generated Unified Format) is the file format for quantized models — it packages weights and metadata in a single portable file that llama.cpp loads directly, without requiring a Python environment or model hub download manager. Use llama.cpp when you need to tune hardware utilization beyond what Ollama exposes, or when running very large models via CPU+GPU hybrid offloading.

1. Build llama.cpp with CUDA Support

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON   # Use -DGGML_CUDA=OFF for Mac/CPU-only
cmake --build build --config Release

2. Download GGUF Weights

Use the HuggingFace CLI with hf_transfer for significantly faster downloads on large files:

pip install huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

# Download Qwen3-32B Q4_K_M GGUF
huggingface-cli download unsloth/Qwen3-32B-GGUF \
  Qwen3-32B-Q4_K_M.gguf --local-dir .

# For 397B (4-bit dynamic quant, ~214GB):
huggingface-cli download unsloth/Qwen3.5-397B-A17B-GGUF \
  --include "UD-Q4_K_XL*" --local-dir ./models/Qwen3.5

3. Start the Inference Server

Qwen3 and Qwen3.5 use a Jinja template argument to control thinking mode. Pass it via --chat-template-kwargs:

./build/bin/llama-server \
  -m Qwen3-32B-Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --chat-template-kwargs '{"enable_thinking":true}'

.\build\bin\llama-server.exe `
  -m Qwen3-32B-Q4_K_M.gguf `
  --port 8080 --ctx-size 32768 --n-gpu-layers 99 `
  --chat-template-kwargs "{\"enable_thinking\":false}"

Qwen3.5 small model note (0.8B/2B/4B/9B): Thinking is disabled by default on these sizes. You must explicitly pass enable_thinking: true to activate it. For Qwen3.6 models, thinking is enabled by default.

For models that exceed your GPU VRAM, reduce --n-gpu-layers below 99. Keeping some layers in system RAM is slower than full GPU offloading, but the model runs without an out-of-memory error. Experiment: start at 60 layers on an RTX 4090 for a 32B model, then increase until you hit your VRAM ceiling. For multi-GPU setups, add --split-mode row to distribute model layers across available cards.

FREE TEMPLATE

AI Risk Management Template

Identify, assess, and mitigate AI deployment risks

Download Free →

Method 3: Run Qwen Locally with vLLM (Production)

vLLM is the right choice for team-serving scenarios: multiple concurrent users, batch inference, or automated pipelines that require maximum throughput. It natively supports Qwen's hybrid Gated DeltaNet attention architecture and Multi-Token Prediction (MTP) speculative decoding, which increases generation speed beyond standard autoregressive inference.

pip install "vllm>=0.19.0" # Required for Qwen3.6 support

Standard Multi-GPU Serving

vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 4 \
  --max-model-len 262144

The API endpoint is OpenAI-compatible at http://localhost:8000/v1.

Production Configuration Flags

vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \
  --enable-auto-tool-choice \         # Enable tool/function calling
  --tool-call-parser qwen3_coder \    # Qwen-specific tool parser
  --num-scheduler-steps 4 \          # MTP speculative decoding
  --limit-mm-per-prompt image=0,video=0  # Text-only: skip vision encoder

The --num-scheduler-steps 4 flag activates Multi-Token Prediction (MTP), which generates multiple tokens per decoding step rather than one — effectively multiplying throughput. The --limit-mm-per-prompt image=0,video=0 flag skips loading the vision encoder entirely, freeing significant VRAM for a larger text KV cache when vision capabilities are not needed.

vLLM's internal memory manager automatically tunes logical block sizes so the linear attention (DeltaNet) layers and full attention layers share identical physical GPU memory footprints, avoiding fragmentation under heavy concurrent load.

Step 5: Use the OpenAI-Compatible API

Once Ollama (or llama.cpp/vLLM) is running, every OpenAI SDK or HTTP client you already own can point at your local server without modification. If you would rather call Alibaba's hosted endpoint instead of a local server, our Qwen API guide covers the cloud setup. Ollama exposes a drop-in REST endpoint at http://localhost:11434/v1/ that accepts the same request schema as the OpenAI Chat Completions API.

Lines of adapter code needed to point an existing OpenAI SDK integration at local Qwen via Ollama. Change base_url and api_key, nothing else.

Ollama OpenAI-compatible API documentation, 2026

Python — OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",   # required by SDK, ignored by local endpoint
)

response = client.chat.completions.create(
    model="qwen3:32b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain MoE architecture in one paragraph."}
    ],
    temperature=0.7,
    top_p=0.8,
    extra_body={"options": {"repeat_penalty": 1.0}},  # CRITICAL for Qwen
)

print(response.choices[0].message.content)

The api_key="ollama" value is a placeholder required by the SDK's validation logic — the local Ollama endpoint does not authenticate or validate it. Any non-empty string works.

Node.js — OpenAI SDK

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1/",
  apiKey: "ollama",
});

const completion = await client.chat.completions.create({
  model: "qwen3:8b",
  messages: [{ role: "user", content: "Write a Python quicksort." }],
  temperature: 0.6,
  top_p: 0.95,
});

console.log(completion.choices[0].message.content);

Raw HTTP — cURL

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7
  }'

Context Window Caveats

Ollama defaults ≠ model maximum

Ollama's runtime context depends on available VRAM, not the model's stated maximum: <24 GB VRAM → 4K tokens; 24–48 GB → 32K; 48 GB+ → 256K. Set num_ctx explicitly in a Modelfile or via the API options object to override.

Override context per-request via API

To set context length per API call without a Modelfile rebuild, add "options": {"num_ctx": 32768} to your request body. This bypasses the VRAM-tiered runtime default for that request only.

Both /v1/chat/completions and /v1/responses are supported. The /v1/responses endpoint is non-stateful — you must include the full conversation history in each request. For llama.cpp server use http://localhost:8080/v1/; for vLLM use http://localhost:8000/v1/. Practically, this means you can swap any of the three backends mid-project by changing one variable — useful when testing whether a larger model improves output quality before committing to the hardware cost.

Step 6: IDE and Tool Integrations

Once Ollama or llama.cpp is running on localhost, point these tools at it. No vendor-specific adapters, no account setup, no rate limits. The three integrations below cover the workflows that matter most for developers running Qwen locally: terminal agents, IDE code assistants, and Alibaba's own CLI. You can browse the rest of our Qwen coverage on the Qwen tools hub.

Claude Code (Terminal Agent)

Qwen3.7-Max natively supports the Anthropic API protocol, making it a drop-in for Claude Code's CLI. Set three environment variables — no plugin, no adapter, no config file required:

export ANTHROPIC_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export ANTHROPIC_API_KEY="your-alibaba-cloud-key"

claude --model qwen-max   # --model sets model; ANTHROPIC_MODEL env var does not exist

export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export ANTHROPIC_API_KEY="ollama"

claude --model qwen3:32b   # --model sets model; ANTHROPIC_MODEL env var does not exist

Claude Code's full feature set — tool use, file editing, multi-file context, subagent orchestration — operates through this three-variable configuration. No adapter layer is involved. Anthropic's multi-agent orchestration logic is tuned for Claude models, so reasoning quality on complex agentic workflows may differ from the native Claude experience.

Continue Extension (VS Code / JetBrains)

Continue is an open-source AI coding assistant that supports Ollama natively. Add Qwen to your ~/.continue/config.json:

{
  "models": [
    {
      "title": "Qwen3-32B (Local)",
      "provider": "ollama",
      "model": "qwen3:32b",
      "apiBase": "http://localhost:11434",
      "contextLength": 32768
    },
    {
      "title": "Qwen3-8B (Local — Fast)",
      "provider": "ollama",
      "model": "qwen3:8b",
      "apiBase": "http://localhost:11434",
      "contextLength": 4096
    }
  ]
}

After saving the config, reload VS Code or JetBrains and select the model from the Continue panel. The Ollama server must be running before the extension sends its first request. Continue's autocomplete feature works with the same Ollama provider — set the tabAutocompleteModel field to a smaller, faster model like qwen3:8b to keep latency low.

Qwen Code CLI

Alibaba's native Qwen Code CLI supports VS Code, Zed, and JetBrains via native extensions. Configure the endpoint in ~/.qwen/settings.json:

{
  "model": "qwen3:32b",
  "endpoint": "http://localhost:11434/v1/",
  "apiKey": "ollama"
}

Run qwen serve to start the CLI in daemon mode, keeping the local context warm between invocations. Refer to the official Qwen Code documentation for the latest IDE extension installation steps, as the distribution method may change across releases.

Sampling Parameters Reference

Qwen's recommended sampling presets differ by task type. Using mismatched parameters — especially the wrong repetition_penalty — causes measurably degraded output quality. Note that repetition_penalty must always be 1.0 for all Qwen tasks regardless of preset.

Recommended Sampling Presets (from official Qwen documentation)

Coding (Thinking On)

temp=0.6 · top_p=0.95 · top_k=20 · min_p=0.0 · presence_penalty=0.0 · repetition_penalty=1.0

General (Thinking On)

temp=1.0 · top_p=0.95 · top_k=20 · min_p=0.0 · presence_penalty=1.5 · repetition_penalty=1.0

Instruct (Thinking Off)

temp=0.7 · top_p=0.8 · top_k=20 · min_p=0.0 · presence_penalty=1.5 · repetition_penalty=1.0

Always 1.0

repetition_penalty for ALL Qwen tasks — Ollama's default 1.1 degrades code generation quality. Override explicitly in every Modelfile or API request.

Troubleshooting

The most common local Qwen issues fall into four categories: sampling misconfiguration, context window misunderstanding, thinking mode behavior on small models, and GPU memory management. Each entry below maps a symptom to its grounded fix.

Common Issues & Fixes

Cause: The default Ollama repeat_penalty is 1.1. Qwen is trained to self-regulate repetition internally, and the external penalty interferes with this mechanism, causing quality degradation on code and long-form generation tasks.

Fix: Explicitly set repeat_penalty 1.0 in your Modelfile, or pass "options": {"repeat_penalty": 1.0} in each API request body.

FROM qwen3:32b
PARAMETER repeat_penalty 1.0

Cause: Ollama sets context length at startup based on available VRAM — not the model's stated maximum. Under 24 GB VRAM, the runtime default is 4K tokens. This is a resource management decision, not a model limitation.

Fix: Set num_ctx explicitly. Add PARAMETER num_ctx 32768 to your Modelfile, or include "options": {"num_ctx": 32768} in each API call. Ensure your VRAM can accommodate the requested context — plan for roughly 0.5–1 MB of KV cache per 1K tokens at Q4 quantization.

Cause: Qwen3.5 small models ship with thinking disabled by default — the opposite of Qwen3 models. The /set think command or /think inline token must be sent explicitly to activate chain-of-thought.

Fix for llama.cpp: Pass --chat-template-kwargs '{"enable_thinking":true}' when launching the server. This is required for Qwen3.5 small models but not for standard Qwen3 models.

llama-server -m Qwen3.5-9B-Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --chat-template-kwargs '{"enable_thinking":true}'

Cause: The model in its current quantization does not fit entirely in GPU VRAM. Most common when loading a 32B model on a 20–22 GB VRAM card.

Options, in priority order:

Switch to a smaller quantization — Q8 to Q4_K_M roughly halves VRAM usage
Use partial offloading — reduce --n-gpu-layers below 99 to keep some layers in system RAM
Consider Qwen3-30B-A3B (MoE) — fits in 19–24 GB VRAM on an RTX 4090 because only ~3B parameters activate per token
For llama.cpp multi-GPU: add --split-mode row to distribute the model across available GPU cards

Cause: PowerShell parses double-quoted strings differently from bash. Single-quote wrapping used in bash examples fails on Windows.

Fix: Use PowerShell's escaped inner-quote syntax:

llama-server -m model.gguf --port 8080 `
  --chat-template-kwargs "{\"enable_thinking\":false}"

Cause: Qwen3.6-series models require vLLM 0.19.0 or later. Earlier releases do not support the hybrid Gated DeltaNet attention architecture used in Qwen3.6-35B-A3B and the 397B flagship. Earlier Qwen3 models may work with older vLLM versions.

pip install "vllm>=0.19.0"
python -c "import vllm; print(vllm.__version__)"  # verify

Frequently Asked Questions

It depends on the model size. Qwen3-8B runs on 5–6 GB VRAM — an RTX 3060 12 GB or an Apple M-series Mac with 16 GB unified memory. Qwen3-32B needs approximately 20 GB VRAM, putting it in RTX 4090 territory. Qwen3-30B-A3B (MoE architecture) is the efficiency sweet spot: it fits in 19–24 GB VRAM on an RTX 4090 and delivers up to 25 tokens per second because only ~3 billion parameters activate per token despite a 30B total parameter count.

The 397B flagship requires 192–256 GB of unified or system-addressable memory — an Apple M3 Ultra with maximum RAM, or an NVIDIA H200 paired with 240 GB of system RAM. For most practitioners, the 8B or 32B range is the practical target for local use.

Install Ollama using the one-line installer — curl -fsSL https://ollama.com/install.sh | sh on Mac and Linux, or irm https://ollama.com/install.ps1 | iex in Windows PowerShell. Then pull and run in two commands:

ollama pull qwen3:8b
ollama run qwen3:8b

In the interactive session, use /set think to enable chain-of-thought reasoning or /set nothink for faster direct answers. For scripted use, Ollama's OpenAI-compatible API is available at http://localhost:11434/v1/ as soon as the server starts.

Yes. Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1/. Point any OpenAI SDK client to this base URL with api_key="ollama" — a required-but-ignored placeholder. Both /v1/chat/completions and /v1/responses endpoints are supported, and the same approach works for llama.cpp (http://localhost:8080/v1/) and vLLM (http://localhost:8000/v1/).

Qwen3.7-Max supports the Anthropic API protocol natively, making it a drop-in for Claude Code with three environment variables and no adapter layer:

export ANTHROPIC_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export ANTHROPIC_API_KEY="your-key"
claude

Claude Code's tool-use, file-edit, and multi-file context features all work through this configuration. For self-hosted Qwen via Ollama, substitute ANTHROPIC_BASE_URL="http://localhost:11434/v1" and ANTHROPIC_API_KEY="ollama", then run claude --model qwen3:32b. Complex multi-agent orchestration tasks may produce different results than native Claude, as Anthropic's orchestration logic is optimized for Claude models.

Video Resources

Running Qwen Locally with Ollama

Step-by-step install, pull, run, and configure

Qwen GGUF Setup with llama.cpp

CUDA build, GGUF download, server launch

vLLM Production Serving Guide

Multi-GPU, MTP speculative decoding, tool calling

Go Deeper

Resources from across Tech Jacks Solutions

FREEAI Risk Management Template

Identify, assess, and mitigate AI deployment risks

EU AI Act Guide

Check your compliance obligations under the EU AI Act

FREEAI Bias Assessment

Evaluate bias risks before deploying any AI system

What Is Agentic AI?

Understand the architecture behind autonomous AI agents

AI Career Paths

Explore roles that work with these tools daily

All hardware requirements, CLI commands, and integration patterns in this article are grounded in official Qwen documentation, the Ollama installation guide, the llama.cpp project repository, vLLM release documentation, and the Continue extension documentation. VRAM figures represent 4-bit (Q4_K_M) quantization — FP16 requires approximately 2× the listed VRAM. Ollama context window defaults reflect runtime behavior at publication (May 2026) and may change with future Ollama releases. Verify current defaults at ollama.com.

Qwen and Qwen3 are trademarks of Alibaba Group. Ollama is a product of Ollama Inc. llama.cpp is an open-source project by Georgi Gerganov and contributors. vLLM is developed by the vLLM team and open-source contributors. NVIDIA CUDA and RTX are trademarks of NVIDIA Corporation. Apple, Mac, M-series, and Apple Silicon are trademarks of Apple Inc. Claude Code and Anthropic are trademarks of Anthropic PBC. Continue is developed by Continue Dev, Inc. Tech Jacks Solutions is independent and not affiliated with any of the above companies. All trademarks are the property of their respective owners.

Gallery

Contacts