How to Run Qwen Locally: Complete Setup Guide (2026)
Running Qwen locally means zero API costs, full data privacy, and inference speeds that beat hosted endpoints on the right hardware. This guide covers three deployment paths — Ollama for instant setup, llama.cpp for hardware-level control, and vLLM for production throughput — plus how to expose a local OpenAI-compatible API, connect your IDE, and integrate with Claude Code. Every command in this guide is verified from official Qwen documentation and community testing as of May 2026.
Prerequisites: Pick Your Run Qwen Locally Path
Before choosing a deployment method, match your hardware to a model size. The table below uses 4-bit quantization (Q4_K_M) as the baseline — the standard tradeoff that preserves most quality while fitting larger models into available VRAM. FP16 (full precision) requires roughly double the VRAM shown. The MoE (Mixture-of-Experts) models in the table — Qwen3-30B-A3B — have unusually low VRAM requirements for their stated parameter count: despite 30 billion total parameters, only ~3 billion activate per token, letting you run a frontier-tier reasoning model on a single RTX 4090.
| Model | VRAM (Q4) | Recommended Hardware | Use Case |
|---|---|---|---|
| Qwen3-0.6B | ~1GB | Any 4GB GPU / Jetson Nano | Edge, classification, glue tasks |
| Qwen3-8B | ~5-6GB | RTX 3060 12GB / M1 16GB | Coding, chat, agents |
| Qwen3-14B | ~10GB | RTX 4070 12GB / Mac 16GB | Strong reasoning, multilingual |
| Qwen3-32B | ~20GB | RTX 4090 (24GB) | Best single-GPU quality |
| Qwen3-30B-A3B MoE | 19-24GB | RTX 3090/4090 | 25 tok/s coding — top local choice |
| Qwen3.5-397B-A17B | ~214GB (4-bit) | M3 Ultra 256GB / H200 + 240GB RAM | Frontier-tier, server-class |
Choosing your method: Ollama is the right starting point for 95% of developers. It installs in 30 seconds, pulls models with one command, and exposes an OpenAI-compatible API automatically. Use llama.cpp if you need granular CUDA layer control or need to run models that Ollama doesn't yet support. Use vLLM when you are serving a team or running batch workloads that need maximum throughput.
Method 1: Run Qwen Locally with Ollama (Recommended)
http://localhost:11434/v1/ to use Qwen programmatically.Step 1: Install Ollama
curl -fsSL https://ollama.com/install.sh | shirm https://ollama.com/install.ps1 | iexConfirm the install worked: ollama --version. If NVIDIA GPU drivers are installed, Ollama auto-detects and uses the GPU. No CUDA configuration required.
Step 2: Pull a Model
Choose the model tag that fits your VRAM. For most developers on a 16-24GB system, qwen3:32b or qwen3:30b-a3b (MoE, fastest) are the best starting points:
# RTX 3060 / M1 16GB
ollama pull qwen3:8b
# RTX 4090 — best quality single GPU
ollama pull qwen3:32b
# RTX 3090/4090 — fastest inference (MoE, 25 tok/s)
ollama pull qwen3:30b-a3b
# Explicit quantization tag (recommended for reproducibility)
ollama pull qwen3:32b-q4_K_MPin your tags. ollama pull qwen3:8b resolves to the library's current latest alias — that alias moves when new versions publish. For tooling and scripts, always pin the explicit tag (qwen3:8b-q4_K_M) so your environment stays reproducible.
Step 3: Run Interactively and Toggle Thinking Mode
Qwen models have a dual-mode reasoning engine: thinking mode runs full chain-of-thought before answering; non-thinking mode returns direct answers with lower latency. Toggle it during a session using slash commands:
ollama run qwen3:32b
# Inside the session:
/set think # Enable chain-of-thought reasoning
/set nothink # Disable — direct, low-latency answersYou can also pass thinking mode tokens inline in your prompt: prepend /think or /no_think as the first token in your message. Use thinking for multi-step coding, math, and logic; use non-thinking for quick lookups, translations, and high-throughput pipelines.
Step 4: Create a Custom Modelfile
A Modelfile bakes in your preferred system prompt, context window, and sampling parameters so you don't reconfigure every session. This is especially useful for coding agents that need consistent behavior:
FROM qwen3:32b
PARAMETER num_ctx 32768
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.0
SYSTEM """You are an expert senior software engineer. Provide concise, efficient code solutions. Avoid unnecessary explanations unless asked. Always use modern syntax and best practices."""ollama create qwen-coder -f Modelfile
ollama run qwen-coderImportant: set repeat_penalty 1.0 for Qwen. Ollama's default repeat penalty of 1.1 causes quality degradation on code generation tasks for the Qwen-Next model family. Setting it to 1.0 (disabled) restores expected output quality. This applies to all Qwen3 and Qwen3.5 variants.
Context window note: Qwen3.5 models advertise 256K context, but Ollama's runtime default depends on your available VRAM: under 24GB = 4K default; 24-48GB = 32K default; 48GB+ = 256K. The model maximum and the Ollama runtime default are different values. Always set num_ctx explicitly in your Modelfile or API call.
Method 2: Run Qwen Locally with llama.cpp
llama.cpp gives you direct control over CUDA layer offloading, context window allocation, and GGUF quantization selection. GGUF (GPT-Generated Unified Format) is the file format for quantized models — it packages weights and metadata in a single portable file that llama.cpp loads directly, without requiring a Python environment or model hub download manager. Use llama.cpp when you need to tune hardware utilization beyond what Ollama exposes, or when running very large models via CPU+GPU hybrid offloading.
1. Build llama.cpp with CUDA Support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # Use -DGGML_CUDA=OFF for Mac/CPU-only
cmake --build build --config Release2. Download GGUF Weights
Use the HuggingFace CLI with hf_transfer for significantly faster downloads on large files:
pip install huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
# Download Qwen3-32B Q4_K_M GGUF
huggingface-cli download unsloth/Qwen3-32B-GGUF \
Qwen3-32B-Q4_K_M.gguf --local-dir .
# For 397B (4-bit dynamic quant, ~214GB):
huggingface-cli download unsloth/Qwen3.5-397B-A17B-GGUF \
--include "UD-Q4_K_XL*" --local-dir ./models/Qwen3.53. Start the Inference Server
Qwen3 and Qwen3.5 use a Jinja template argument to control thinking mode. Pass it via --chat-template-kwargs:
./build/bin/llama-server \
-m Qwen3-32B-Q4_K_M.gguf \
--port 8080 \
--ctx-size 32768 \
--n-gpu-layers 99 \
--chat-template-kwargs '{"enable_thinking":true}'.\build\bin\llama-server.exe `
-m Qwen3-32B-Q4_K_M.gguf `
--port 8080 --ctx-size 32768 --n-gpu-layers 99 `
--chat-template-kwargs "{\"enable_thinking\":false}"Qwen3.5 small model note (0.8B/2B/4B/9B): Thinking is disabled by default on these sizes. You must explicitly pass enable_thinking: true to activate it. For Qwen3.6 models, thinking is enabled by default.
For models that exceed your GPU VRAM, reduce --n-gpu-layers below 99. Keeping some layers in system RAM is slower than full GPU offloading, but the model runs without an out-of-memory error. Experiment: start at 60 layers on an RTX 4090 for a 32B model, then increase until you hit your VRAM ceiling. For multi-GPU setups, add --split-mode row to distribute model layers across available cards.
Method 3: Run Qwen Locally with vLLM (Production)
vLLM is the right choice for team-serving scenarios: multiple concurrent users, batch inference, or automated pipelines that require maximum throughput. It natively supports Qwen's hybrid Gated DeltaNet attention architecture and Multi-Token Prediction (MTP) speculative decoding, which increases generation speed beyond standard autoregressive inference.
pip install "vllm>=0.19.0" # Required for Qwen3.6 supportStandard Multi-GPU Serving
vllm serve Qwen/Qwen3.6-35B-A3B \
--tensor-parallel-size 4 \
--max-model-len 262144The API endpoint is OpenAI-compatible at http://localhost:8000/v1.
Production Configuration Flags
vllm serve Qwen/Qwen3.6-35B-A3B \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--enable-auto-tool-choice \ # Enable tool/function calling
--tool-call-parser qwen3_coder \ # Qwen-specific tool parser
--num-scheduler-steps 4 \ # MTP speculative decoding
--limit-mm-per-prompt image=0,video=0 # Text-only: skip vision encoderThe --num-scheduler-steps 4 flag activates Multi-Token Prediction (MTP), which generates multiple tokens per decoding step rather than one — effectively multiplying throughput. The --limit-mm-per-prompt image=0,video=0 flag skips loading the vision encoder entirely, freeing significant VRAM for a larger text KV cache when vision capabilities are not needed.
vLLM's internal memory manager automatically tunes logical block sizes so the linear attention (DeltaNet) layers and full attention layers share identical physical GPU memory footprints, avoiding fragmentation under heavy concurrent load.
Step 5: Use the OpenAI-Compatible API
Once Ollama (or llama.cpp/vLLM) is running, every OpenAI SDK or HTTP client you already own can point at your local server without modification. Ollama exposes a drop-in REST endpoint at http://localhost:11434/v1/ that accepts the same request schema as the OpenAI Chat Completions API.
base_url and api_key — nothing else.Python — OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama", # required by SDK, ignored by local endpoint
)
response = client.chat.completions.create(
model="qwen3:32b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain MoE architecture in one paragraph."}
],
temperature=0.7,
top_p=0.8,
extra_body={"options": {"repeat_penalty": 1.0}}, # CRITICAL for Qwen
)
print(response.choices[0].message.content)The api_key="ollama" value is a placeholder required by the SDK's validation logic — the local Ollama endpoint does not authenticate or validate it. Any non-empty string works.
Node.js — OpenAI SDK
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1/",
apiKey: "ollama",
});
const completion = await client.chat.completions.create({
model: "qwen3:8b",
messages: [{ role: "user", content: "Write a Python quicksort." }],
temperature: 0.6,
top_p: 0.95,
});
console.log(completion.choices[0].message.content);Raw HTTP — cURL
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.7
}'Ollama's runtime context depends on available VRAM, not the model's stated maximum: <24 GB VRAM → 4K tokens; 24–48 GB → 32K; 48 GB+ → 256K. Set num_ctx explicitly in a Modelfile or via the API options object to override.
To set context length per API call without a Modelfile rebuild, add "options": {"num_ctx": 32768} to your request body. This bypasses the VRAM-tiered runtime default for that request only.
Both /v1/chat/completions and /v1/responses are supported. The /v1/responses endpoint is non-stateful — you must include the full conversation history in each request. For llama.cpp server use http://localhost:8080/v1/; for vLLM use http://localhost:8000/v1/. Practically, this means you can swap any of the three backends mid-project by changing one variable — useful when testing whether a larger model improves output quality before committing to the hardware cost.
Step 6: IDE and Tool Integrations
Once Ollama or llama.cpp is running on localhost, point these tools at it. No vendor-specific adapters, no account setup, no rate limits. The three integrations below cover the workflows that matter most for developers running Qwen locally: terminal agents, IDE code assistants, and Alibaba's own CLI.
Claude Code (Terminal Agent)
Qwen3.7-Max natively supports the Anthropic API protocol, making it a drop-in for Claude Code's CLI. Set three environment variables — no plugin, no adapter, no config file required:
export ANTHROPIC_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export ANTHROPIC_API_KEY="your-alibaba-cloud-key"
claude --model qwen-max # --model sets model; ANTHROPIC_MODEL env var does not existexport ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export ANTHROPIC_API_KEY="ollama"
claude --model qwen3:32b # --model sets model; ANTHROPIC_MODEL env var does not existClaude Code's full feature set — tool use, file editing, multi-file context, subagent orchestration — operates through this three-variable configuration. No adapter layer is involved. Anthropic's multi-agent orchestration logic is tuned for Claude models, so reasoning quality on complex agentic workflows may differ from the native Claude experience.
Continue Extension (VS Code / JetBrains)
Continue is an open-source AI coding assistant that supports Ollama natively. Add Qwen to your ~/.continue/config.json:
{
"models": [
{
"title": "Qwen3-32B (Local)",
"provider": "ollama",
"model": "qwen3:32b",
"apiBase": "http://localhost:11434",
"contextLength": 32768
},
{
"title": "Qwen3-8B (Local — Fast)",
"provider": "ollama",
"model": "qwen3:8b",
"apiBase": "http://localhost:11434",
"contextLength": 4096
}
]
}After saving the config, reload VS Code or JetBrains and select the model from the Continue panel. The Ollama server must be running before the extension sends its first request. Continue's autocomplete feature works with the same Ollama provider — set the tabAutocompleteModel field to a smaller, faster model like qwen3:8b to keep latency low.
Qwen Code CLI
Alibaba's native Qwen Code CLI supports VS Code, Zed, and JetBrains via native extensions. Configure the endpoint in ~/.qwen/settings.json:
{
"model": "qwen3:32b",
"endpoint": "http://localhost:11434/v1/",
"apiKey": "ollama"
}Run qwen serve to start the CLI in daemon mode, keeping the local context warm between invocations. Refer to the official Qwen Code documentation for the latest IDE extension installation steps, as the distribution method may change across releases.
Sampling Parameters Reference
Qwen's recommended sampling presets differ by task type. Using mismatched parameters — especially the wrong repetition_penalty — causes measurably degraded output quality. Note that repetition_penalty must always be 1.0 for all Qwen tasks regardless of preset.
Troubleshooting
The most common local Qwen issues fall into four categories: sampling misconfiguration, context window misunderstanding, thinking mode behavior on small models, and GPU memory management. Each entry below maps a symptom to its grounded fix.
Cause: The default Ollama repeat_penalty is 1.1. Qwen is trained to self-regulate repetition internally, and the external penalty interferes with this mechanism, causing quality degradation on code and long-form generation tasks.
Fix: Explicitly set repeat_penalty 1.0 in your Modelfile, or pass "options": {"repeat_penalty": 1.0} in each API request body.
FROM qwen3:32b
PARAMETER repeat_penalty 1.0Cause: Ollama sets context length at startup based on available VRAM — not the model's stated maximum. Under 24 GB VRAM, the runtime default is 4K tokens. This is a resource management decision, not a model limitation.
Fix: Set num_ctx explicitly. Add PARAMETER num_ctx 32768 to your Modelfile, or include "options": {"num_ctx": 32768} in each API call. Ensure your VRAM can accommodate the requested context — plan for roughly 0.5–1 MB of KV cache per 1K tokens at Q4 quantization.
Cause: Qwen3.5 small models ship with thinking disabled by default — the opposite of Qwen3 models. The /set think command or /think inline token must be sent explicitly to activate chain-of-thought.
Fix for llama.cpp: Pass --chat-template-kwargs '{"enable_thinking":true}' when launching the server. This is required for Qwen3.5 small models but not for standard Qwen3 models.
llama-server -m Qwen3.5-9B-Q4_K_M.gguf \
--port 8080 \
--ctx-size 32768 \
--n-gpu-layers 99 \
--chat-template-kwargs '{"enable_thinking":true}'Cause: The model in its current quantization does not fit entirely in GPU VRAM. Most common when loading a 32B model on a 20–22 GB VRAM card.
Options, in priority order:
- Switch to a smaller quantization — Q8 to Q4_K_M roughly halves VRAM usage
- Use partial offloading — reduce
--n-gpu-layersbelow 99 to keep some layers in system RAM - Consider Qwen3-30B-A3B (MoE) — fits in 19–24 GB VRAM on an RTX 4090 because only ~3B parameters activate per token
- For llama.cpp multi-GPU: add
--split-mode rowto distribute the model across available GPU cards
Cause: PowerShell parses double-quoted strings differently from bash. Single-quote wrapping used in bash examples fails on Windows.
Fix: Use PowerShell's escaped inner-quote syntax:
llama-server -m model.gguf --port 8080 `
--chat-template-kwargs "{\"enable_thinking\":false}"Cause: Qwen3.6-series models require vLLM 0.19.0 or later. Earlier releases do not support the hybrid Gated DeltaNet attention architecture used in Qwen3.6-35B-A3B and the 397B flagship. Earlier Qwen3 models may work with older vLLM versions.
pip install "vllm>=0.19.0"
python -c "import vllm; print(vllm.__version__)" # verifyFrequently Asked Questions
It depends on the model size. Qwen3-8B runs on 5–6 GB VRAM — an RTX 3060 12 GB or an Apple M-series Mac with 16 GB unified memory. Qwen3-32B needs approximately 20 GB VRAM, putting it in RTX 4090 territory. Qwen3-30B-A3B (MoE architecture) is the efficiency sweet spot: it fits in 19–24 GB VRAM on an RTX 4090 and delivers up to 25 tokens per second because only ~3 billion parameters activate per token despite a 30B total parameter count.
The 397B flagship requires 192–256 GB of unified or system-addressable memory — an Apple M3 Ultra with maximum RAM, or an NVIDIA H200 paired with 240 GB of system RAM. For most practitioners, the 8B or 32B range is the practical target for local use.
Install Ollama using the one-line installer — curl -fsSL https://ollama.com/install.sh | sh on Mac and Linux, or irm https://ollama.com/install.ps1 | iex in Windows PowerShell. Then pull and run in two commands:
ollama pull qwen3:8b
ollama run qwen3:8bIn the interactive session, use /set think to enable chain-of-thought reasoning or /set nothink for faster direct answers. For scripted use, Ollama's OpenAI-compatible API is available at http://localhost:11434/v1/ as soon as the server starts.
Yes. Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1/. Point any OpenAI SDK client to this base URL with api_key="ollama" — a required-but-ignored placeholder. Both /v1/chat/completions and /v1/responses endpoints are supported, and the same approach works for llama.cpp (http://localhost:8080/v1/) and vLLM (http://localhost:8000/v1/).
Qwen3.7-Max supports the Anthropic API protocol natively, making it a drop-in for Claude Code with three environment variables and no adapter layer:
export ANTHROPIC_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export ANTHROPIC_API_KEY="your-key"
claudeClaude Code's tool-use, file-edit, and multi-file context features all work through this configuration. For self-hosted Qwen via Ollama, substitute ANTHROPIC_BASE_URL="http://localhost:11434/v1" and ANTHROPIC_API_KEY="ollama", then run claude --model qwen3:32b. Complex multi-agent orchestration tasks may produce different results than native Claude, as Anthropic's orchestration logic is optimized for Claude models.