Qwen API Guide: Cloud Integration & SDK Reference (2026)
Qwen's API surface is wider than most developers realize — a direct Alibaba Cloud endpoint (OpenAI-compatible or Anthropic-protocol), six third-party inference providers, a native OpenAI SDK integration, thinking mode parameters that differ by framework, and a Model Context Protocol layer that connects your tools directly to the model. This guide covers every integration path with grounded code examples and the specific caveats that will save you debugging time.
Prerequisites & Integration Paths
Before connecting, decide which integration path fits your use case. The five paths below correspond to five distinct setups — you only need to complete the steps for your chosen approach.
pip install openai (Python) or npm install openai (Node.js) — OpenAI SDK works unchangedPath 1: Alibaba Cloud Model Studio (Direct)
Alibaba Cloud Model Studio — also called DashScope, Alibaba's AI API platform — is the primary endpoint for Qwen. It exposes two distinct API protocols at different base URLs — use the one that matches your existing SDK or tool.
Endpoint URLs
# OpenAI-compatible protocol — domestic/Beijing (China users)
OPENAI_BASE_URL_CN = "https://dashscope.aliyuncs.com/compatible-mode/v1"
# OpenAI-compatible protocol — international/Singapore (all other regions — use this)
OPENAI_BASE_URL_INT = "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
# Anthropic-compatible protocol (Claude Code, Anthropic SDK) — international domain
ANTHROPIC_BASE_URL = "https://dashscope-intl.aliyuncs.com/apps/anthropic"
# Using the wrong regional domain returns 401 — verify your region in the DashScope consoleAuthentication Setup
Generate a DashScope API key from the Alibaba Cloud Model Studio console. For high-volume use, the Alibaba Cloud Coding Plan offers a fixed monthly fee with significantly higher daily quotas than per-token billing — useful for teams where predictable costs matter more than pay-as-you-go flexibility.
export DASHSCOPE_API_KEY="sk-..." # or store in .env
# Verify connectivity:
curl https://dashscope-intl.aliyuncs.com/compatible-mode/v1/models \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" | python -m json.tool
# CN users: replace with dashscope.aliyuncs.comPython — OpenAI SDK (Alibaba Cloud)
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["DASHSCOPE_API_KEY"],
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1", # CN users: dashscope.aliyuncs.com
)
response = client.chat.completions.create(
model="qwen-max", # or qwen-plus, qwen-turbo — verify IDs in DashScope console
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the key differences between MoE and dense models."},
],
temperature=0.7,
top_p=0.8,
top_k=20,
max_tokens=4096,
)
print(response.choices[0].message.content)The sampling parameters used here are the Alibaba-recommended instruct-mode defaults: temperature=0.7 (controls randomness), top_p=0.8 (nucleus sampling threshold), and top_k=20 — a Qwen-specific parameter that limits the vocabulary to the top 20 tokens at each step. top_k is not in the OpenAI spec; providers that don't support it will silently ignore it. When thinking mode is enabled (enable_thinking: True), use temperature=0.6 and top_p=0.95 instead — these are the Alibaba-recommended values for the reasoning-heavy mode.
Model ID strings (like qwen-max, qwen-plus, qwen-turbo) map to specific Qwen model versions in the DashScope backend. Verify current mappings in the Alibaba Cloud Model Studio console — they update as new model versions release.
Anthropic Protocol (Claude Code Integration)
Qwen3.7-Max natively implements the Anthropic API protocol. Point Claude Code at Alibaba Cloud with three environment variables and no adapter code:
export ANTHROPIC_BASE_URL="https://dashscope-intl.aliyuncs.com/apps/anthropic"
export ANTHROPIC_API_KEY="$DASHSCOPE_API_KEY"
claude --model qwen-max # --model flag sets model; ANTHROPIC_MODEL env var does not existPath 2: Third-Party Providers & OpenRouter
If you want one API key that routes to multiple Qwen providers with automatic failover, or you need a lower per-token rate than Alibaba Cloud's direct pricing, third-party inference platforms are the answer. If you only need OpenRouter, skip ahead to the OpenRouter code example below the table.
The table compares providers as of May 2026. "Blended" pricing is a weighted average of input and output costs across a typical request mix — useful for rough cost comparison when input/output ratios vary. "TTFT" is time-to-first-token: the delay between sending your request and receiving the first streamed token. For interactive applications, TTFT matters more than throughput; for batch jobs, throughput matters more. A "—" in the TTFT column means the figure wasn't available from benchmarks used for this guide.
| Provider | Model | Input / Output ($/1M) | TTFT | Func Calling | Best For |
|---|---|---|---|---|---|
| Alibaba Cloud | qwen-max (Qwen3.7-Max) | $2.50 / $7.50 | — | Yes + JSON | Full feature support, prompt caching, Anthropic protocol |
| DeepInfra (FP8) | Qwen3.5-397B-A17B | $0.54 / $3.40 | 0.67s ★ | Yes (no JSON mode) | Lowest cost + lowest latency for 397B |
| OpenRouter | qwen/qwen3.6-35b-a3b | $0.15 / $1.00 | varies | Yes + JSON | One API key, automatic failover, lowest cost for 35B |
| Clarifai | Qwen3.5-397B | $1.35 blended | — | Yes + JSON | Highest throughput (268 t/s) — batch-heavy workloads |
| Together AI | Qwen3.5-397B-A17B | — | 46.5s ⚠ | Yes + JSON | Batch/background only — also hosts Qwen3.7-Max (1M context, normal TTFT) |
Together AI's 46.5-second time-to-first-token for Qwen3.5-397B makes it unsuitable for any user-facing or real-time application. Once processing completes, throughput is reasonable (100 t/s). Use it only for offline batch tasks where latency is not a constraint. Their Qwen3.7-Max endpoint (1M context) has normal latency.
DeepInfra's FP8 quantized serving of Qwen3.5-397B offers the lowest cost and latency but does not support JSON mode (structured outputs). Applications that require reliable JSON schema enforcement must use Clarifai, OpenRouter, or Alibaba Cloud instead.
OpenRouter Code Example
from openai import OpenAI
client = OpenAI(
api_key="your-openrouter-key",
base_url="https://openrouter.ai/api/v1",
)
response = client.chat.completions.create(
model="qwen/qwen3.6-35b-a3b", # $0.15/M input, 262K context
messages=[
{"role": "user", "content": "Explain gradient descent in 3 sentences."}
],
temperature=0.7,
top_p=0.8,
extra_headers={
"HTTP-Referer": "https://yourapp.com", # optional, for OpenRouter leaderboard
"X-Title": "Your App Name", # optional
},
)
print(response.choices[0].message.content)OpenRouter normalizes requests and responses across its provider network — the same code works whether the request routes to io.net, Parasail, AkashML, or SiliconFlow. Automatic failover means uptime is higher than relying on any single provider directly.
OpenAI SDK Integration Patterns
Because every provider above uses an OpenAI-compatible API, the same Python SDK call works for all of them — only base_url, api_key, and model change. Swap those three variables and your code works against Alibaba Cloud direct, OpenRouter, or any self-hosted vLLM instance.
Streaming Responses
For interactive applications, stream tokens as they arrive rather than waiting for the full response. The Qwen API returns server-sent events in the same format as the OpenAI streaming spec:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1", # CN users: use dashscope.aliyuncs.com
api_key=os.environ["DASHSCOPE_API_KEY"],
)
stream = client.chat.completions.create(
model="qwen-max",
messages=[{"role": "user", "content": "Explain async/await in Python"}],
stream=True,
max_tokens=32768, # set explicitly — API default may differ by provider
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print() # newline at endFunction Calling
Qwen supports parallel tool use with JSON-schema tool definitions. Conceptually: you define a list of tools the model is allowed to call. When the model decides a tool is needed, it returns a structured tool_calls response — not a final answer. Your code executes the actual function, then sends the result back in a second API call. Only then does the model produce the final user-facing answer. This is always a two-call pattern:
import json
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name, e.g. 'Singapore'"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}
]
messages = [{"role": "user", "content": "What's the weather in Singapore?"}]
# Call 1 — model decides to use the tool (returns tool_calls, not final answer)
response = client.chat.completions.create(
model="qwen-max",
messages=messages,
tools=tools,
tool_choice="auto",
)
msg = response.choices[0].message
if msg.tool_calls:
for call in msg.tool_calls:
# Execute your actual function here
result = {"temperature": 31, "condition": "Sunny", "humidity": "78%"}
# Build the second call with the tool result
messages.append(msg) # append model's tool_calls message
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})
# Call 2 — model uses the tool result to generate the final answer
final = client.chat.completions.create(
model="qwen-max",
messages=messages,
tools=tools,
)
print(final.choices[0].message.content)Multi-Turn Conversations: Strip Thinking Blocks from History
When building multi-turn conversations with thinking mode enabled, the official recommendation is to strip blocks from historical assistant messages before sending them back. Including raw reasoning traces in prior turns inflates token costs and can confuse the model in subsequent turns:
import re
def strip_thinking(text: str) -> str:
"""Remove blocks from assistant history."""
return re.sub(r".*? ", "", text, flags=re.DOTALL).strip()
history = []
def chat(user_message: str) -> str:
history.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="qwen-max",
messages=history,
extra_body={"enable_thinking": True},
max_tokens=32768,
)
raw_reply = response.choices[0].message.content
clean_reply = strip_thinking(raw_reply)
# Store CLEAN reply in history (no think blocks)
history.append({"role": "assistant", "content": clean_reply})
return raw_reply # return full response to the user This pattern keeps your conversation context efficient. The model re-reasons from the clean history rather than re-reading its own prior reasoning traces. For long agentic sessions with dozens of turns, this significantly reduces both token cost and prompt latency.
Thinking Mode & MCP Integration
Two features set Qwen apart from most OpenAI-compatible APIs: a toggleable thinking mode that controls whether the model reasons before responding, and native MCP server support in Qwen Code for agentic tool use.
Toggling Thinking Mode
What thinking mode does: When enabled, Qwen generates an internal chain of thought (wrapped in tags in the raw output) before producing its final answer. This improves accuracy on complex reasoning, coding, and multi-step tasks, at the cost of higher token usage and added latency. For Qwen3 models, thinking is on by default — you explicitly disable it for tasks where speed matters more than reasoning depth.
The parameter name is the same but the syntax differs depending on where you run the model. Using the wrong syntax results in the parameter being silently ignored without any error — the model continues with its default behavior.
Note also that the /think and /no_think inline tokens (prepend to any message to toggle mode) work only in the base Qwen3 series. They are not available in Qwen3.5 or Qwen3.6 — use the enable_thinking API parameter for those series.
The extra_body parameter in the OpenAI SDK passes arbitrary JSON fields beyond the official spec — it is the standard way to send Qwen-specific parameters that the OpenAI API doesn't define natively. Keys in extra_body are forwarded as-is to the provider backend; if the backend ignores a key, no error is raised.
# Alibaba Cloud Model Studio (DashScope) — top-level parameter
response = client.chat.completions.create(
model="qwen-max",
messages=[{"role": "user", "content": "Summarize this document: ..."}],
extra_body={
"enable_thinking": False # Alibaba API: top-level in extra_body
},
temperature=0.7,
top_p=0.8,
)# Local frameworks use chat_template_kwargs, NOT a top-level param
response = client.chat.completions.create(
model="qwen3-32b",
messages=[{"role": "user", "content": "Summarize this document: ..."}],
extra_body={
"chat_template_kwargs": {"enable_thinking": False} # wrapped in kwargs
},
temperature=0.7,
top_p=0.8,
)extra_body={"enable_thinking": False} directly. Wrapping in chat_template_kwargs will be ignored.extra_body={"chat_template_kwargs": {"enable_thinking": False}}. Top-level enable_thinking is not recognized by local server.Preserve Thinking (Qwen3.6 Series Only)
The preserve_thinking parameter retains reasoning traces from all historical messages across a multi-turn session. Instead of re-reasoning from scratch on each turn, the model uses prior reasoning context — this reduces redundant computation, improves decision consistency across long conversations, and optimizes KV cache utilization. This parameter is specific to the Qwen3.6 model series:
# Alibaba Cloud — top-level in extra_body
response = client.chat.completions.create(
model="qwen3.6-235b-a22b", # Qwen3.6 model required
messages=conversation_history,
extra_body={
"enable_thinking": True,
"preserve_thinking": True # retains reasoning from prior turns
},
)
# Local vLLM — wrapped in chat_template_kwargs
response = client.chat.completions.create(
model="qwen3.6-235b-a22b",
messages=conversation_history,
extra_body={
"chat_template_kwargs": {
"enable_thinking": True,
"preserve_thinking": True
}
},
)When to use it: Multi-step agentic tasks where earlier reasoning informs later decisions — planning sessions, debugging workflows, code refactoring across many files. For single-turn tasks or chatbots, the overhead isn't worth it.
MCP Server Configuration
What is Qwen Code? Qwen Code (github.com/QwenLM/qwen-code) is a terminal-based AI coding agent — similar to Claude Code or GitHub Copilot CLI — that you run locally and configure via a settings file. Install it via npm: npm install -g @qwen-code/cli, then run qwen auth to authenticate. MCP server connections extend what Qwen Code can access (databases, APIs, file systems) and are configured per-machine in a single JSON file.
Qwen Code reads MCP server definitions from ~/.qwen/settings.json. Each entry in the mcpServers object defines one server. Three transport mechanisms are supported: Stdio (subprocess), SSE (Server-Sent Events), and Streamable HTTP:
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user/projects"],
"trust": true
},
"github-sse": {
"url": "https://mcp.github.example.com/sse",
"headers": {
"Authorization": "Bearer ${GITHUB_TOKEN}"
}
},
"internal-api": {
"httpUrl": "https://api.internal.example.com/mcp",
"includeTools": ["search_docs", "create_ticket"],
"excludeTools": ["delete_project"]
}
}
}Key configuration options: trust: true bypasses per-call confirmation prompts — use this only for servers you control. includeTools and excludeTools whitelist or blacklist specific tools from a server. When two servers expose a tool with the same name, the first registered server keeps the short name; subsequent servers get serverName__toolName prefixed names. The default timeout per MCP tool call is 600,000 ms (10 minutes).
timeout field if your tools complete faster.Troubleshooting
These are the most common integration failures when connecting to Qwen APIs for the first time, along with verified fixes.
Check three things in order: (1) Confirm the API key was copied in full from Alibaba Cloud Model Studio — keys are long strings, truncation is common. (2) Verify base_url matches your region. International users must use dashscope-intl.aliyuncs.com; domestic China users use dashscope.aliyuncs.com. Crossing regions with the wrong domain returns 401. (3) Confirm the key has the correct permissions enabled in Model Studio — a key restricted to a specific model won't authenticate for other models.
The enable_thinking syntax differs between Alibaba Cloud and local frameworks. For Alibaba Cloud, pass it as a top-level key inside extra_body: extra_body={"enable_thinking": False}. For vLLM or llama.cpp running locally, wrap it in chat_template_kwargs: extra_body={"chat_template_kwargs": {"enable_thinking": False}}. Using the wrong form causes the parameter to be silently ignored — the model will not return an error.
The Qwen3.5-397B model on Together AI has an independently benchmarked time-to-first-token of 46.5 seconds as of May 2026 — not a client-side issue, and subject to change if Together AI upgrades their infrastructure. Set your HTTP timeout to at least 120 seconds when using Together AI for 397B. For interactive applications where response latency matters, use DeepInfra (0.67s TTFT) or Clarifai instead. Together AI at this size is viable only for batch jobs where throughput is more important than interactivity.
The DeepInfra FP8 variant of Qwen3.5-397B does not support JSON mode (response_format: {"type": "json_object"}) or structured output parameters. If your application needs guaranteed JSON output, switch to a provider that does support it — Clarifai and OpenRouter both support JSON mode on Qwen models. Alternatively, prompt the model to output JSON and parse the response text manually, but this is less reliable.
The Qwen OAuth free tier was discontinued on April 15, 2026. The daily quota was reduced from 1,000 to 100 requests on April 13, then fully shut down two days later. Applications that relied on OAuth-based authentication need to migrate to DashScope API keys from Alibaba Cloud Model Studio. If you're using Qwen Code, run qwen auth (or /auth inside a session) to set up the new key-based authentication. The ~/.qwen/settings.json security.auth.selectedType field should be set to "apiKey".
DashScope rate limits are applied per model and per key, not just globally. If you're hitting limits, check which model the key is scoped to in Model Studio — a key provisioned for qwen-turbo counts separately from one for qwen-max. For sustained high-volume workloads, Alibaba Cloud's Coding Plan subscription provides higher quotas under a fixed monthly fee rather than per-token billing. Contact Alibaba Cloud support to upgrade quota tiers for API keys.
Frequently Asked Questions
Create a DashScope API key at Alibaba Cloud Model Studio. Set it as the environment variable DASHSCOPE_API_KEY and pass it as the api_key parameter when initializing the OpenAI client with the DashScope base_url. The Qwen OAuth free tier was discontinued on April 15, 2026 — API key authentication is now the only method for programmatic access.
DeepInfra is currently the most cost-effective option for Qwen3.5-397B at $0.54/M input and $3.40/M output tokens ($1.25/M blended), with the lowest time-to-first-token (0.67 seconds) in independent benchmarks. The caveat: the DeepInfra FP8 variant does not support JSON mode. If your application requires structured output, Clarifai ($1.35/M blended) is the next best option and also holds the throughput record at 268 tokens/second.
The syntax depends on where the model is hosted. For Alibaba Cloud Model Studio, set enable_thinking: False as a top-level key inside extra_body. For local deployments running vLLM or llama.cpp, wrap it in chat_template_kwargs: extra_body={"chat_template_kwargs": {"enable_thinking": False}}. Using the wrong form for your runtime results in the parameter being silently ignored.
Yes. All major Qwen access points — Alibaba Cloud Model Studio, OpenRouter, DeepInfra, and local vLLM — expose OpenAI-compatible APIs. Initialize the standard OpenAI client with base_url set to your provider's endpoint and api_key set to that provider's key. No other changes to existing OpenAI SDK code are required. Swap providers by changing those two values — everything else stays the same.