Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Qwen

How to Run Qwen Locally: Complete Setup Guide (2026)

Running Qwen locally means zero API costs, full data privacy, and inference speeds that beat hosted endpoints on the right hardware. This guide covers three deployment paths — Ollama for instant setup, llama.cpp for hardware-level control, and vLLM for production throughput — plus how to expose a local OpenAI-compatible API, connect your IDE, and integrate with Claude Code. Every command in this guide is verified from official Qwen documentation and community testing as of May 2026.


Prerequisites: Pick Your Run Qwen Locally Path

Hardware Quick Reference (4-bit Q4_K_M quantization)
5-6GB
VRAM for Qwen3-8B (RTX 3060 / M1 16GB)
20GB
VRAM for Qwen3-32B (RTX 4090)
24GB
VRAM for Qwen3-30B-A3B MoE (RTX 4090)
25 t/s
Tokens/second on RTX 4090 (30B-A3B MoE)

Before choosing a deployment method, match your hardware to a model size. The table below uses 4-bit quantization (Q4_K_M) as the baseline — the standard tradeoff that preserves most quality while fitting larger models into available VRAM. FP16 (full precision) requires roughly double the VRAM shown. The MoE (Mixture-of-Experts) models in the table — Qwen3-30B-A3B — have unusually low VRAM requirements for their stated parameter count: despite 30 billion total parameters, only ~3 billion activate per token, letting you run a frontier-tier reasoning model on a single RTX 4090.

Model VRAM (Q4) Recommended Hardware Use Case
Qwen3-0.6B~1GBAny 4GB GPU / Jetson NanoEdge, classification, glue tasks
Qwen3-8B~5-6GBRTX 3060 12GB / M1 16GBCoding, chat, agents
Qwen3-14B~10GBRTX 4070 12GB / Mac 16GBStrong reasoning, multilingual
Qwen3-32B~20GBRTX 4090 (24GB)Best single-GPU quality
Qwen3-30B-A3B MoE19-24GBRTX 3090/409025 tok/s coding — top local choice
Qwen3.5-397B-A17B~214GB (4-bit)M3 Ultra 256GB / H200 + 240GB RAMFrontier-tier, server-class
Prerequisites Checklist
Identify target model size and confirm VRAM availability (use table above)
For GPU acceleration: NVIDIA GPU with up-to-date CUDA drivers (12.x+), or Apple Silicon Mac
Terminal access and basic command-line familiarity
Python 3.9+ (for llama.cpp downloads and vLLM) — or skip to Ollama if you want zero setup
For 397B: at least 192GB system RAM (3-bit) or 256GB (4-bit) unified/combined memory

Choosing your method: Ollama is the right starting point for 95% of developers. It installs in 30 seconds, pulls models with one command, and exposes an OpenAI-compatible API automatically. Use llama.cpp if you need granular CUDA layer control or need to run models that Ollama doesn't yet support. Use vLLM when you are serving a team or running batch workloads that need maximum throughput.


Method 1: Run Qwen Locally with Ollama (Recommended)

Ollama Setup — 5 Steps
1
Install Ollama
One command installs the Ollama service, CLI, and GPU detection automatically.
2
Pull a Qwen model
Select the model tag that fits your VRAM. Pin the exact tag for reproducibility.
3
Run interactively
Start the interactive session and optionally toggle thinking mode.
4
Create a custom Modelfile
Bake in system prompts, context length, and temperature for repeatable agent sessions.
5
Use the local API
Point any OpenAI SDK client at http://localhost:11434/v1/ to use Qwen programmatically.

Step 1: Install Ollama

macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows (PowerShell)
irm https://ollama.com/install.ps1 | iex

Confirm the install worked: ollama --version. If NVIDIA GPU drivers are installed, Ollama auto-detects and uses the GPU. No CUDA configuration required.

Step 2: Pull a Model

Choose the model tag that fits your VRAM. For most developers on a 16-24GB system, qwen3:32b or qwen3:30b-a3b (MoE, fastest) are the best starting points:

bash
# RTX 3060 / M1 16GB ollama pull qwen3:8b # RTX 4090 — best quality single GPU ollama pull qwen3:32b # RTX 3090/4090 — fastest inference (MoE, 25 tok/s) ollama pull qwen3:30b-a3b # Explicit quantization tag (recommended for reproducibility) ollama pull qwen3:32b-q4_K_M

Pin your tags. ollama pull qwen3:8b resolves to the library's current latest alias — that alias moves when new versions publish. For tooling and scripts, always pin the explicit tag (qwen3:8b-q4_K_M) so your environment stays reproducible.

Step 3: Run Interactively and Toggle Thinking Mode

Qwen models have a dual-mode reasoning engine: thinking mode runs full chain-of-thought before answering; non-thinking mode returns direct answers with lower latency. Toggle it during a session using slash commands:

bash — interactive session
ollama run qwen3:32b # Inside the session: /set think # Enable chain-of-thought reasoning /set nothink # Disable — direct, low-latency answers

You can also pass thinking mode tokens inline in your prompt: prepend /think or /no_think as the first token in your message. Use thinking for multi-step coding, math, and logic; use non-thinking for quick lookups, translations, and high-throughput pipelines.

25
Tokens per second on RTX 4090 with Qwen3-30B-A3B (MoE). The MoE architecture activates only ~3B parameters per token, delivering near-32B quality at a fraction of the compute cost — the top local deployment choice for coding tasks.

Step 4: Create a Custom Modelfile

A Modelfile bakes in your preferred system prompt, context window, and sampling parameters so you don't reconfigure every session. This is especially useful for coding agents that need consistent behavior:

Modelfile — coding agent configuration
FROM qwen3:32b PARAMETER num_ctx 32768 PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER top_k 20 PARAMETER repeat_penalty 1.0 SYSTEM """You are an expert senior software engineer. Provide concise, efficient code solutions. Avoid unnecessary explanations unless asked. Always use modern syntax and best practices."""
bash — build and run
ollama create qwen-coder -f Modelfile ollama run qwen-coder

Important: set repeat_penalty 1.0 for Qwen. Ollama's default repeat penalty of 1.1 causes quality degradation on code generation tasks for the Qwen-Next model family. Setting it to 1.0 (disabled) restores expected output quality. This applies to all Qwen3 and Qwen3.5 variants.

Context window note: Qwen3.5 models advertise 256K context, but Ollama's runtime default depends on your available VRAM: under 24GB = 4K default; 24-48GB = 32K default; 48GB+ = 256K. The model maximum and the Ollama runtime default are different values. Always set num_ctx explicitly in your Modelfile or API call.


Method 2: Run Qwen Locally with llama.cpp

llama.cpp gives you direct control over CUDA layer offloading, context window allocation, and GGUF quantization selection. GGUF (GPT-Generated Unified Format) is the file format for quantized models — it packages weights and metadata in a single portable file that llama.cpp loads directly, without requiring a Python environment or model hub download manager. Use llama.cpp when you need to tune hardware utilization beyond what Ollama exposes, or when running very large models via CPU+GPU hybrid offloading.

1. Build llama.cpp with CUDA Support

bash
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # Use -DGGML_CUDA=OFF for Mac/CPU-only cmake --build build --config Release

2. Download GGUF Weights

Use the HuggingFace CLI with hf_transfer for significantly faster downloads on large files:

bash
pip install huggingface_hub hf_transfer export HF_HUB_ENABLE_HF_TRANSFER=1 # Download Qwen3-32B Q4_K_M GGUF huggingface-cli download unsloth/Qwen3-32B-GGUF \ Qwen3-32B-Q4_K_M.gguf --local-dir . # For 397B (4-bit dynamic quant, ~214GB): huggingface-cli download unsloth/Qwen3.5-397B-A17B-GGUF \ --include "UD-Q4_K_XL*" --local-dir ./models/Qwen3.5

3. Start the Inference Server

Qwen3 and Qwen3.5 use a Jinja template argument to control thinking mode. Pass it via --chat-template-kwargs:

bash — thinking enabled
./build/bin/llama-server \ -m Qwen3-32B-Q4_K_M.gguf \ --port 8080 \ --ctx-size 32768 \ --n-gpu-layers 99 \ --chat-template-kwargs '{"enable_thinking":true}'
PowerShell (Windows — escape quotes)
.\build\bin\llama-server.exe ` -m Qwen3-32B-Q4_K_M.gguf ` --port 8080 --ctx-size 32768 --n-gpu-layers 99 ` --chat-template-kwargs "{\"enable_thinking\":false}"

Qwen3.5 small model note (0.8B/2B/4B/9B): Thinking is disabled by default on these sizes. You must explicitly pass enable_thinking: true to activate it. For Qwen3.6 models, thinking is enabled by default.

For models that exceed your GPU VRAM, reduce --n-gpu-layers below 99. Keeping some layers in system RAM is slower than full GPU offloading, but the model runs without an out-of-memory error. Experiment: start at 60 layers on an RTX 4090 for a 32B model, then increase until you hit your VRAM ceiling. For multi-GPU setups, add --split-mode row to distribute model layers across available cards.


Method 3: Run Qwen Locally with vLLM (Production)

vLLM is the right choice for team-serving scenarios: multiple concurrent users, batch inference, or automated pipelines that require maximum throughput. It natively supports Qwen's hybrid Gated DeltaNet attention architecture and Multi-Token Prediction (MTP) speculative decoding, which increases generation speed beyond standard autoregressive inference.

bash — install
pip install "vllm>=0.19.0" # Required for Qwen3.6 support

Standard Multi-GPU Serving

bash — serve Qwen3.6-35B-A3B across 4 GPUs
vllm serve Qwen/Qwen3.6-35B-A3B \ --tensor-parallel-size 4 \ --max-model-len 262144

The API endpoint is OpenAI-compatible at http://localhost:8000/v1.

Production Configuration Flags

bash — full production configuration
vllm serve Qwen/Qwen3.6-35B-A3B \ --tensor-parallel-size 4 \ --max-model-len 262144 \ --enable-auto-tool-choice \ # Enable tool/function calling --tool-call-parser qwen3_coder \ # Qwen-specific tool parser --num-scheduler-steps 4 \ # MTP speculative decoding --limit-mm-per-prompt image=0,video=0 # Text-only: skip vision encoder

The --num-scheduler-steps 4 flag activates Multi-Token Prediction (MTP), which generates multiple tokens per decoding step rather than one — effectively multiplying throughput. The --limit-mm-per-prompt image=0,video=0 flag skips loading the vision encoder entirely, freeing significant VRAM for a larger text KV cache when vision capabilities are not needed.

vLLM's internal memory manager automatically tunes logical block sizes so the linear attention (DeltaNet) layers and full attention layers share identical physical GPU memory footprints, avoiding fragmentation under heavy concurrent load.


Step 5: Use the OpenAI-Compatible API

Once Ollama (or llama.cpp/vLLM) is running, every OpenAI SDK or HTTP client you already own can point at your local server without modification. Ollama exposes a drop-in REST endpoint at http://localhost:11434/v1/ that accepts the same request schema as the OpenAI Chat Completions API.

0
Lines of adapter code needed to point an existing OpenAI SDK integration at local Qwen via Ollama. Change base_url and api_key — nothing else.

Python — OpenAI SDK

python — OpenAI SDK pointed at local Ollama
from openai import OpenAI client = OpenAI( base_url="http://localhost:11434/v1/", api_key="ollama", # required by SDK, ignored by local endpoint ) response = client.chat.completions.create( model="qwen3:32b", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain MoE architecture in one paragraph."} ], temperature=0.7, top_p=0.8, extra_body={"options": {"repeat_penalty": 1.0}}, # CRITICAL for Qwen ) print(response.choices[0].message.content)

The api_key="ollama" value is a placeholder required by the SDK's validation logic — the local Ollama endpoint does not authenticate or validate it. Any non-empty string works.

Node.js — OpenAI SDK

javascript — OpenAI SDK (Node.js)
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:11434/v1/", apiKey: "ollama", }); const completion = await client.chat.completions.create({ model: "qwen3:8b", messages: [{ role: "user", content: "Write a Python quicksort." }], temperature: 0.6, top_p: 0.95, }); console.log(completion.choices[0].message.content);

Raw HTTP — cURL

bash — curl
curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3:8b", "messages": [{"role": "user", "content": "Hello"}], "temperature": 0.7 }'
Context Window Caveats
Ollama defaults ≠ model maximum

Ollama's runtime context depends on available VRAM, not the model's stated maximum: <24 GB VRAM → 4K tokens; 24–48 GB → 32K; 48 GB+ → 256K. Set num_ctx explicitly in a Modelfile or via the API options object to override.

Override context per-request via API

To set context length per API call without a Modelfile rebuild, add "options": {"num_ctx": 32768} to your request body. This bypasses the VRAM-tiered runtime default for that request only.

Both /v1/chat/completions and /v1/responses are supported. The /v1/responses endpoint is non-stateful — you must include the full conversation history in each request. For llama.cpp server use http://localhost:8080/v1/; for vLLM use http://localhost:8000/v1/. Practically, this means you can swap any of the three backends mid-project by changing one variable — useful when testing whether a larger model improves output quality before committing to the hardware cost.


Step 6: IDE and Tool Integrations

Once Ollama or llama.cpp is running on localhost, point these tools at it. No vendor-specific adapters, no account setup, no rate limits. The three integrations below cover the workflows that matter most for developers running Qwen locally: terminal agents, IDE code assistants, and Alibaba's own CLI.

Claude Code (Terminal Agent)

Qwen3.7-Max natively supports the Anthropic API protocol, making it a drop-in for Claude Code's CLI. Set three environment variables — no plugin, no adapter, no config file required:

bash — Claude Code with Qwen3.7-Max via Alibaba Cloud API
export ANTHROPIC_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" export ANTHROPIC_API_KEY="your-alibaba-cloud-key" claude --model qwen-max # --model sets model; ANTHROPIC_MODEL env var does not exist
bash — Claude Code with local Qwen via Ollama (self-hosted)
export ANTHROPIC_BASE_URL="http://localhost:11434/v1" export ANTHROPIC_API_KEY="ollama" claude --model qwen3:32b # --model sets model; ANTHROPIC_MODEL env var does not exist

Claude Code's full feature set — tool use, file editing, multi-file context, subagent orchestration — operates through this three-variable configuration. No adapter layer is involved. Anthropic's multi-agent orchestration logic is tuned for Claude models, so reasoning quality on complex agentic workflows may differ from the native Claude experience.

Continue Extension (VS Code / JetBrains)

Continue is an open-source AI coding assistant that supports Ollama natively. Add Qwen to your ~/.continue/config.json:

json — ~/.continue/config.json
{ "models": [ { "title": "Qwen3-32B (Local)", "provider": "ollama", "model": "qwen3:32b", "apiBase": "http://localhost:11434", "contextLength": 32768 }, { "title": "Qwen3-8B (Local — Fast)", "provider": "ollama", "model": "qwen3:8b", "apiBase": "http://localhost:11434", "contextLength": 4096 } ] }

After saving the config, reload VS Code or JetBrains and select the model from the Continue panel. The Ollama server must be running before the extension sends its first request. Continue's autocomplete feature works with the same Ollama provider — set the tabAutocompleteModel field to a smaller, faster model like qwen3:8b to keep latency low.

Qwen Code CLI

Alibaba's native Qwen Code CLI supports VS Code, Zed, and JetBrains via native extensions. Configure the endpoint in ~/.qwen/settings.json:

json — ~/.qwen/settings.json
{ "model": "qwen3:32b", "endpoint": "http://localhost:11434/v1/", "apiKey": "ollama" }

Run qwen serve to start the CLI in daemon mode, keeping the local context warm between invocations. Refer to the official Qwen Code documentation for the latest IDE extension installation steps, as the distribution method may change across releases.

Sampling Parameters Reference

Qwen's recommended sampling presets differ by task type. Using mismatched parameters — especially the wrong repetition_penalty — causes measurably degraded output quality. Note that repetition_penalty must always be 1.0 for all Qwen tasks regardless of preset.

Recommended Sampling Presets (from official Qwen documentation)
Coding (Thinking On)
temp=0.6 · top_p=0.95 · top_k=20 · min_p=0.0 · presence_penalty=0.0 · repetition_penalty=1.0
General (Thinking On)
temp=1.0 · top_p=0.95 · top_k=20 · min_p=0.0 · presence_penalty=1.5 · repetition_penalty=1.0
Instruct (Thinking Off)
temp=0.7 · top_p=0.8 · top_k=20 · min_p=0.0 · presence_penalty=1.5 · repetition_penalty=1.0
Always 1.0
repetition_penalty for ALL Qwen tasks — Ollama's default 1.1 degrades code generation quality. Override explicitly in every Modelfile or API request.

Troubleshooting

The most common local Qwen issues fall into four categories: sampling misconfiguration, context window misunderstanding, thinking mode behavior on small models, and GPU memory management. Each entry below maps a symptom to its grounded fix.

Common Issues & Fixes

Cause: The default Ollama repeat_penalty is 1.1. Qwen is trained to self-regulate repetition internally, and the external penalty interferes with this mechanism, causing quality degradation on code and long-form generation tasks.

Fix: Explicitly set repeat_penalty 1.0 in your Modelfile, or pass "options": {"repeat_penalty": 1.0} in each API request body.

Modelfile fix
FROM qwen3:32b PARAMETER repeat_penalty 1.0

Cause: Ollama sets context length at startup based on available VRAM — not the model's stated maximum. Under 24 GB VRAM, the runtime default is 4K tokens. This is a resource management decision, not a model limitation.

Fix: Set num_ctx explicitly. Add PARAMETER num_ctx 32768 to your Modelfile, or include "options": {"num_ctx": 32768} in each API call. Ensure your VRAM can accommodate the requested context — plan for roughly 0.5–1 MB of KV cache per 1K tokens at Q4 quantization.

Cause: Qwen3.5 small models ship with thinking disabled by default — the opposite of Qwen3 models. The /set think command or /think inline token must be sent explicitly to activate chain-of-thought.

Fix for llama.cpp: Pass --chat-template-kwargs '{"enable_thinking":true}' when launching the server. This is required for Qwen3.5 small models but not for standard Qwen3 models.

bash — enable thinking on Qwen3.5 small models
llama-server -m Qwen3.5-9B-Q4_K_M.gguf \ --port 8080 \ --ctx-size 32768 \ --n-gpu-layers 99 \ --chat-template-kwargs '{"enable_thinking":true}'

Cause: The model in its current quantization does not fit entirely in GPU VRAM. Most common when loading a 32B model on a 20–22 GB VRAM card.

Options, in priority order:

  1. Switch to a smaller quantization — Q8 to Q4_K_M roughly halves VRAM usage
  2. Use partial offloading — reduce --n-gpu-layers below 99 to keep some layers in system RAM
  3. Consider Qwen3-30B-A3B (MoE) — fits in 19–24 GB VRAM on an RTX 4090 because only ~3B parameters activate per token
  4. For llama.cpp multi-GPU: add --split-mode row to distribute the model across available GPU cards

Cause: PowerShell parses double-quoted strings differently from bash. Single-quote wrapping used in bash examples fails on Windows.

Fix: Use PowerShell's escaped inner-quote syntax:

PowerShell — correct escape syntax
llama-server -m model.gguf --port 8080 ` --chat-template-kwargs "{\"enable_thinking\":false}"

Cause: Qwen3.6-series models require vLLM 0.19.0 or later. Earlier releases do not support the hybrid Gated DeltaNet attention architecture used in Qwen3.6-35B-A3B and the 397B flagship. Earlier Qwen3 models may work with older vLLM versions.

bash
pip install "vllm>=0.19.0" python -c "import vllm; print(vllm.__version__)" # verify

Frequently Asked Questions

It depends on the model size. Qwen3-8B runs on 5–6 GB VRAM — an RTX 3060 12 GB or an Apple M-series Mac with 16 GB unified memory. Qwen3-32B needs approximately 20 GB VRAM, putting it in RTX 4090 territory. Qwen3-30B-A3B (MoE architecture) is the efficiency sweet spot: it fits in 19–24 GB VRAM on an RTX 4090 and delivers up to 25 tokens per second because only ~3 billion parameters activate per token despite a 30B total parameter count.

The 397B flagship requires 192–256 GB of unified or system-addressable memory — an Apple M3 Ultra with maximum RAM, or an NVIDIA H200 paired with 240 GB of system RAM. For most practitioners, the 8B or 32B range is the practical target for local use.

Install Ollama using the one-line installer — curl -fsSL https://ollama.com/install.sh | sh on Mac and Linux, or irm https://ollama.com/install.ps1 | iex in Windows PowerShell. Then pull and run in two commands:

bash
ollama pull qwen3:8b ollama run qwen3:8b

In the interactive session, use /set think to enable chain-of-thought reasoning or /set nothink for faster direct answers. For scripted use, Ollama's OpenAI-compatible API is available at http://localhost:11434/v1/ as soon as the server starts.

Yes. Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1/. Point any OpenAI SDK client to this base URL with api_key="ollama" — a required-but-ignored placeholder. Both /v1/chat/completions and /v1/responses endpoints are supported, and the same approach works for llama.cpp (http://localhost:8080/v1/) and vLLM (http://localhost:8000/v1/).

Qwen3.7-Max supports the Anthropic API protocol natively, making it a drop-in for Claude Code with three environment variables and no adapter layer:

bash
export ANTHROPIC_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" export ANTHROPIC_API_KEY="your-key" claude

Claude Code's tool-use, file-edit, and multi-file context features all work through this configuration. For self-hosted Qwen via Ollama, substitute ANTHROPIC_BASE_URL="http://localhost:11434/v1" and ANTHROPIC_API_KEY="ollama", then run claude --model qwen3:32b. Complex multi-agent orchestration tasks may produce different results than native Claude, as Anthropic's orchestration logic is optimized for Claude models.

All hardware requirements, CLI commands, and integration patterns in this article are grounded in official Qwen documentation, the Ollama installation guide, the llama.cpp project repository, vLLM release documentation, and the Continue extension documentation. VRAM figures represent 4-bit (Q4_K_M) quantization — FP16 requires approximately 2× the listed VRAM. Ollama context window defaults reflect runtime behavior at publication (May 2026) and may change with future Ollama releases. Verify current defaults at ollama.com.
Qwen and Qwen3 are trademarks of Alibaba Group. Ollama is a product of Ollama Inc. llama.cpp is an open-source project by Georgi Gerganov and contributors. vLLM is developed by the vLLM team and open-source contributors. NVIDIA CUDA and RTX are trademarks of NVIDIA Corporation. Apple, Mac, M-series, and Apple Silicon are trademarks of Apple Inc. Claude Code and Anthropic are trademarks of Anthropic PBC. Continue is developed by Continue Dev, Inc. Tech Jacks Solutions is independent and not affiliated with any of the above companies. All trademarks are the property of their respective owners.
Before You Use AI
Your Privacy

Running Qwen locally means your prompts and outputs stay on your machine — no data is sent to Alibaba Cloud or any third party when using open-weight models via Ollama, llama.cpp, or vLLM. For Alibaba Cloud Model Studio API access, review Alibaba Cloud's Privacy Policy. Enterprise deployments may also configure fully air-gapped inference.

Mental Health & AI

Local AI deployment is powerful, but running models 24/7 can encourage over-reliance on AI-generated guidance. If you are experiencing distress:

  • 988 Suicide & Crisis Lifeline: call or text 988
  • SAMHSA Helpline: 1-800-662-4357
  • Crisis Text Line: text HOME to 741741

AI systems can produce plausible-sounding but incorrect technical guidance. For critical infrastructure, always validate AI-generated code and configurations against authoritative documentation before deploying to production.

Your Rights & Our Transparency

Tech Jacks Solutions is editorially independent and is not affiliated with Alibaba Cloud, Ollama, llama.cpp, or vLLM. All commands and configurations are sourced from official documentation and community testing. Under GDPR and CCPA, you have the right to access and delete your data. See our Privacy Policy and Editorial Standards. This guide references the NIST AI Risk Framework for responsible deployment guidance. The EU AI Act establishes risk-based regulations for AI systems operating in the European Union.