How do I authenticate with the Qwen API?

Create an account on Alibaba Cloud Model Studio and generate a DashScope API key. Set it as DASHSCOPE_API_KEY in your environment, or pass it directly to the OpenAI SDK as api_key. The Qwen OAuth free tier was discontinued April 15, 2026 — only DashScope API keys and the Alibaba Cloud Coding Plan are available now.

What is the cheapest way to access Qwen models via API?

For the Qwen3.5-397B flagship, DeepInfra offers the lowest cost at $0.54/M input and $3.40/M output (blended $1.25/M) with the lowest latency (0.67s TTFT). For Qwen3.6-35B-A3B, OpenRouter routes to multiple providers at $0.15/M input and $1.00/M output. Direct Alibaba Cloud access costs $2.50/M input for Qwen3.7-Max.

How do I toggle thinking mode in the Qwen API?

For Alibaba Cloud Model Studio: pass enable_thinking=False in the request body. For local frameworks (llama.cpp, vLLM): use chat_template_kwargs={"enable_thinking": False}. Note: the syntax differs between Alibaba API and local inference frameworks. For Qwen3.6 series, you can also enable preserve_thinking=True to retain reasoning context across multi-turn conversations.

Can I use Qwen with the OpenAI SDK?

Yes. All Qwen cloud endpoints (Alibaba Cloud, OpenRouter, DeepInfra, Together AI) expose OpenAI-compatible APIs. Set base_url to your provider's endpoint and api_key to your provider key — no other code changes needed. The same Python or Node.js openai SDK code works across all providers by changing only these two parameters.

Qwen

Qwen API Guide: Cloud Integration & SDK Reference (2026)

Qwen's API surface is wider than most developers realize — a direct Alibaba Cloud endpoint (OpenAI-compatible or Anthropic-protocol), six third-party inference providers, a native OpenAI SDK integration, thinking mode parameters that differ by framework, and a Model Context Protocol layer that connects your tools directly to the model. This guide covers every integration path with grounded code examples and the specific caveats that will save you debugging time. If you would rather avoid the hosted API entirely, our guide to run Qwen locally covers self-hosting on your own hardware.

Prerequisites & Integration Paths

Qwen API at a Glance (as of May 2026)

API protocols on Alibaba Cloud: OpenAI-compatible + Anthropic-compatible — no adapter needed for either

$0.15

Per 1M input tokens for Qwen3.6-35B-A3B via OpenRouter — the lowest cost frontier-class inference option

Token context window on Qwen3.7-Max (stated ceiling — full-window retrieval reliability not independently confirmed) — 262K on Qwen3.6-35B-A3B

65K

Max output tokens per request with extended thinking enabled — plan context budgets accordingly

Before connecting, decide which integration path fits your use case. The five paths below correspond to five distinct setups — you only need to complete the steps for your chosen approach. If you are still choosing a model, our breakdown of the Qwen3 model family compares the tiers.

5 Integration Paths — Click to Mark Complete

Alibaba Cloud Direct

DashScope API key + direct endpoint — OpenAI or Anthropic protocol. Best for full feature access, prompt caching, and enterprise quotas.

OpenRouter (Aggregator)

One API key, automatic failover across providers, OpenAI-compatible. Best for lowest per-token cost on Qwen3.6-35B-A3B.

OpenAI SDK (Any Provider)

Drop-in OpenAI Python/Node.js SDK — change base_url and api_key only. Works with Alibaba Cloud, OpenRouter, DeepInfra, Together AI.

Thinking Mode Control

Toggle reasoning on/off per-request and preserve thinking context across turns. Syntax differs between Alibaba Cloud API and local frameworks.

MCP Server Integration

Connect external tools (databases, APIs, file systems) via Model Context Protocol — configured in ~/.qwen/settings.json.

Prerequisites Checklist

Alibaba Cloud account — required for DashScope API keys (dashscope.aliyuncs.com)

Python 3.8+ or Node.js 18+ for SDK examples

pip install openai (Python) or npm install openai (Node.js) — OpenAI SDK works unchanged

API key from your chosen provider: DashScope (Alibaba), OpenRouter, DeepInfra, or Together AI

Note: Qwen OAuth free tier was discontinued April 15, 2026 — only paid API keys and Coding Plan are available

Path 1: Alibaba Cloud Model Studio (Direct)

Alibaba Cloud Model Studio — also called DashScope, Alibaba's AI API platform — is the primary endpoint for Qwen. It exposes two distinct API protocols at different base URLs — use the one that matches your existing SDK or tool.

Endpoint URLs

# OpenAI-compatible protocol — domestic/Beijing (China users)
OPENAI_BASE_URL_CN  = "https://dashscope.aliyuncs.com/compatible-mode/v1"
# OpenAI-compatible protocol — international/Singapore (all other regions — use this)
OPENAI_BASE_URL_INT = "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"

# Anthropic-compatible protocol (Claude Code, Anthropic SDK) — international domain
ANTHROPIC_BASE_URL  = "https://dashscope-intl.aliyuncs.com/apps/anthropic"

# Using the wrong regional domain returns 401 — verify your region in the DashScope console

Authentication Setup

Generate a DashScope API key from the Alibaba Cloud Model Studio console. For high-volume use, the Alibaba Cloud Coding Plan offers a fixed monthly fee with significantly higher daily quotas than per-token billing — useful for teams where predictable costs matter more than pay-as-you-go flexibility.

export DASHSCOPE_API_KEY="sk-..."   # or store in .env

# Verify connectivity:
curl https://dashscope-intl.aliyuncs.com/compatible-mode/v1/models \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" | python -m json.tool
# CN users: replace with dashscope.aliyuncs.com

Python — OpenAI SDK (Alibaba Cloud)

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",  # CN users: dashscope.aliyuncs.com
)

response = client.chat.completions.create(
    model="qwen-max",   # or qwen-plus, qwen-turbo — verify IDs in DashScope console
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the key differences between MoE and dense models."},
    ],
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    max_tokens=4096,
)

print(response.choices[0].message.content)

The sampling parameters used here are the Alibaba-recommended instruct-mode defaults: temperature=0.7 (controls randomness), top_p=0.8 (nucleus sampling threshold), and top_k=20 — a Qwen-specific parameter that limits the vocabulary to the top 20 tokens at each step. top_k is not in the OpenAI spec; providers that don't support it will silently ignore it. When thinking mode is enabled (enable_thinking: True), use temperature=0.6 and top_p=0.95 instead — these are the Alibaba-recommended values for the reasoning-heavy mode.

Model ID strings (like qwen-max, qwen-plus, qwen-turbo) map to specific Qwen model versions in the DashScope backend. Verify current mappings in the Alibaba Cloud Model Studio console — they update as new model versions release.

Anthropic Protocol (Claude Code Integration)

Qwen3.7-Max natively implements the Anthropic API protocol. Point Claude Code at Alibaba Cloud with three environment variables and no adapter code:

export ANTHROPIC_BASE_URL="https://dashscope-intl.aliyuncs.com/apps/anthropic"
export ANTHROPIC_API_KEY="$DASHSCOPE_API_KEY"

claude --model qwen-max   # --model flag sets model; ANTHROPIC_MODEL env var does not exist

Path 2: Third-Party Providers & OpenRouter

If you want one API key that routes to multiple Qwen providers with automatic failover, or you need a lower per-token rate than Alibaba Cloud's direct pricing, third-party inference platforms are the answer. If you only need OpenRouter, skip ahead to the OpenRouter code example below the table.

The table compares providers as of May 2026. "Blended" pricing is a weighted average of input and output costs across a typical request mix — useful for rough cost comparison when input/output ratios vary. "TTFT" is time-to-first-token: the delay between sending your request and receiving the first streamed token. For interactive applications, TTFT matters more than throughput; for batch jobs, throughput matters more. A "—" in the TTFT column means the figure wasn't available from benchmarks used for this guide.

Provider	Model	Input / Output ($/1M)	TTFT	Func Calling	Best For
Alibaba Cloud	qwen-max (Qwen3.7-Max)	$2.50 / $7.50	—	Yes + JSON	Full feature support, prompt caching, Anthropic protocol
DeepInfra (FP8)	Qwen3.5-397B-A17B	$0.54 / $3.40	0.67s ★	Yes (no JSON mode)	Lowest cost + lowest latency for 397B
OpenRouter	qwen/qwen3.6-35b-a3b	$0.15 / $1.00	varies	Yes + JSON	One API key, automatic failover, lowest cost for 35B
Clarifai	Qwen3.5-397B	$1.35 blended	—	Yes + JSON	Highest throughput (268 t/s) — batch-heavy workloads
Together AI	Qwen3.5-397B-A17B	—	46.5s ⚠	Yes + JSON	Batch/background only — also hosts Qwen3.7-Max (1M context, normal TTFT)

Provider Caveats

Together AI: 46.5s TTFT on 397B (interactive use)

Together AI's 46.5-second time-to-first-token for Qwen3.5-397B makes it unsuitable for any user-facing or real-time application. Once processing completes, throughput is reasonable (100 t/s). Use it only for offline batch tasks where latency is not a constraint. Their Qwen3.7-Max endpoint (1M context) has normal latency.

DeepInfra FP8: no JSON mode

DeepInfra's FP8 quantized serving of Qwen3.5-397B offers the lowest cost and latency but does not support JSON mode (structured outputs). Applications that require reliable JSON schema enforcement must use Clarifai, OpenRouter, or Alibaba Cloud instead.

OpenRouter Code Example

from openai import OpenAI

client = OpenAI(
    api_key="your-openrouter-key",
    base_url="https://openrouter.ai/api/v1",
)

response = client.chat.completions.create(
    model="qwen/qwen3.6-35b-a3b",   # $0.15/M input, 262K context
    messages=[
        {"role": "user", "content": "Explain gradient descent in 3 sentences."}
    ],
    temperature=0.7,
    top_p=0.8,
    extra_headers={
        "HTTP-Referer": "https://yourapp.com",    # optional, for OpenRouter leaderboard
        "X-Title": "Your App Name",                # optional
    },
)
print(response.choices[0].message.content)

OpenRouter normalizes requests and responses across its provider network — the same code works whether the request routes to io.net, Parasail, AkashML, or SiliconFlow. Automatic failover means uptime is higher than relying on any single provider directly.

FREE TEMPLATE

Pre-Deployment Safety Gate

27-point checklist before any AI tool goes live

Download Free →

OpenAI SDK Integration Patterns

Because every provider above uses an OpenAI-compatible API, the same Python SDK call works for all of them — only base_url, api_key, and model change. Swap those three variables and your code works against Alibaba Cloud direct, OpenRouter, or any self-hosted vLLM instance.

Streaming Responses

For interactive applications, stream tokens as they arrive rather than waiting for the full response. The Qwen API returns server-sent events in the same format as the OpenAI streaming spec:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",  # CN users: use dashscope.aliyuncs.com
    api_key=os.environ["DASHSCOPE_API_KEY"],
)

stream = client.chat.completions.create(
    model="qwen-max",
    messages=[{"role": "user", "content": "Explain async/await in Python"}],
    stream=True,
    max_tokens=32768,   # set explicitly — API default may differ by provider
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()  # newline at end

Function Calling

Qwen supports parallel tool use with JSON-schema tool definitions. Conceptually: you define a list of tools the model is allowed to call. When the model decides a tool is needed, it returns a structured tool_calls response — not a final answer. Your code executes the actual function, then sends the result back in a second API call. Only then does the model produce the final user-facing answer. This is always a two-call pattern:

import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name, e.g. 'Singapore'"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in Singapore?"}]

# Call 1 — model decides to use the tool (returns tool_calls, not final answer)
response = client.chat.completions.create(
    model="qwen-max",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

msg = response.choices[0].message
if msg.tool_calls:
    for call in msg.tool_calls:
        # Execute your actual function here
        result = {"temperature": 31, "condition": "Sunny", "humidity": "78%"}

        # Build the second call with the tool result
        messages.append(msg)   # append model's tool_calls message
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result),
        })

    # Call 2 — model uses the tool result to generate the final answer
    final = client.chat.completions.create(
        model="qwen-max",
        messages=messages,
        tools=tools,
    )
    print(final.choices[0].message.content)

32,768

tokens — recommended max output for standard tasks; increase to 81,920 for complex math or multi-file coding benchmarks

Qwen3.6 Deployment Guide, qwen.readthedocs.io, 2026

Multi-Turn Conversations: Strip Thinking Blocks from History

When building multi-turn conversations with thinking mode enabled, the official recommendation is to strip ... blocks from historical assistant messages before sending them back. Including raw reasoning traces in prior turns inflates token costs and can confuse the model in subsequent turns:

import re

def strip_thinking(text: str) -> str:
    """Remove  blocks from assistant history."""
    return re.sub(r".*?", "", text, flags=re.DOTALL).strip()

history = []

def chat(user_message: str) -> str:
    history.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="qwen-max",
        messages=history,
        extra_body={"enable_thinking": True},
        max_tokens=32768,
    )

    raw_reply = response.choices[0].message.content
    clean_reply = strip_thinking(raw_reply)

    # Store CLEAN reply in history (no think blocks)
    history.append({"role": "assistant", "content": clean_reply})
    return raw_reply  # return full response to the user

This pattern keeps your conversation context efficient. The model re-reasons from the clean history rather than re-reading its own prior reasoning traces. For long agentic sessions with dozens of turns, this significantly reduces both token cost and prompt latency.

Thinking Mode & MCP Integration

Two features set Qwen apart from most OpenAI-compatible APIs: a toggleable thinking mode that controls whether the model reasons before responding, and native MCP server support in Qwen Code for agentic tool use. You can find the rest of our Qwen guides on the Qwen tools hub.

Toggling Thinking Mode

What thinking mode does: When enabled, Qwen generates an internal chain of thought (wrapped in ... tags in the raw output) before producing its final answer. This improves accuracy on complex reasoning, coding, and multi-step tasks, at the cost of higher token usage and added latency. For Qwen3 models, thinking is on by default — you explicitly disable it for tasks where speed matters more than reasoning depth.

The parameter name is the same but the syntax differs depending on where you run the model. Using the wrong syntax results in the parameter being silently ignored without any error — the model continues with its default behavior.

Note also that the /think and /no_think inline tokens (prepend to any message to toggle mode) work only in the base Qwen3 series. They are not available in Qwen3.5 or Qwen3.6 — use the enable_thinking API parameter for those series.

The extra_body parameter in the OpenAI SDK passes arbitrary JSON fields beyond the official spec — it is the standard way to send Qwen-specific parameters that the OpenAI API doesn't define natively. Keys in extra_body are forwarded as-is to the provider backend; if the backend ignores a key, no error is raised.

# Alibaba Cloud Model Studio (DashScope) — top-level parameter
response = client.chat.completions.create(
    model="qwen-max",
    messages=[{"role": "user", "content": "Summarize this document: ..."}],
    extra_body={
        "enable_thinking": False   # Alibaba API: top-level in extra_body
    },
    temperature=0.7,
    top_p=0.8,
)

# Local frameworks use chat_template_kwargs, NOT a top-level param
response = client.chat.completions.create(
    model="qwen3-32b",
    messages=[{"role": "user", "content": "Summarize this document: ..."}],
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False}  # wrapped in kwargs
    },
    temperature=0.7,
    top_p=0.8,
)

Alibaba Cloud

Use extra_body={"enable_thinking": False} directly. Wrapping in chat_template_kwargs will be ignored.

Local vLLM / llama.cpp

Use extra_body={"chat_template_kwargs": {"enable_thinking": False}}. Top-level enable_thinking is not recognized by local server.

Preserve Thinking (Qwen3.6 Series Only)

The preserve_thinking parameter retains reasoning traces from all historical messages across a multi-turn session. Instead of re-reasoning from scratch on each turn, the model uses prior reasoning context — this reduces redundant computation, improves decision consistency across long conversations, and optimizes KV cache utilization. This parameter is specific to the Qwen3.6 model series:

# Alibaba Cloud — top-level in extra_body
response = client.chat.completions.create(
    model="qwen3.6-235b-a22b",   # Qwen3.6 model required
    messages=conversation_history,
    extra_body={
        "enable_thinking": True,
        "preserve_thinking": True   # retains reasoning from prior turns
    },
)

# Local vLLM — wrapped in chat_template_kwargs
response = client.chat.completions.create(
    model="qwen3.6-235b-a22b",
    messages=conversation_history,
    extra_body={
        "chat_template_kwargs": {
            "enable_thinking": True,
            "preserve_thinking": True
        }
    },
)

When to use it: Multi-step agentic tasks where earlier reasoning informs later decisions — planning sessions, debugging workflows, code refactoring across many files. For single-turn tasks or chatbots, the overhead isn't worth it.

MCP Server Configuration

What is Qwen Code? Qwen Code (github.com/QwenLM/qwen-code) is a terminal-based AI coding agent — similar to Claude Code or GitHub Copilot CLI — that you run locally and configure via a settings file. Install it via npm: npm install -g @qwen-code/cli, then run qwen auth to authenticate. MCP server connections extend what Qwen Code can access (databases, APIs, file systems) and are configured per-machine in a single JSON file.

Qwen Code reads MCP server definitions from ~/.qwen/settings.json. Each entry in the mcpServers object defines one server. Three transport mechanisms are supported: Stdio (subprocess), SSE (Server-Sent Events), and Streamable HTTP:

{
  "mcpServers": {

    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user/projects"],
      "trust": true
    },

    "github-sse": {
      "url": "https://mcp.github.example.com/sse",
      "headers": {
        "Authorization": "Bearer ${GITHUB_TOKEN}"
      }
    },

    "internal-api": {
      "httpUrl": "https://api.internal.example.com/mcp",
      "includeTools": ["search_docs", "create_ticket"],
      "excludeTools": ["delete_project"]
    }

  }
}

Key configuration options: trust: true bypasses per-call confirmation prompts — use this only for servers you control. includeTools and excludeTools whitelist or blacklist specific tools from a server. When two servers expose a tool with the same name, the first registered server keeps the short name; subsequent servers get serverName__toolName prefixed names. The default timeout per MCP tool call is 600,000 ms (10 minutes).

600,000 ms

default MCP tool call timeout — 10 minutes per invocation. Override per-server with the timeout field if your tools complete faster.

Qwen Code MCP Documentation, github.com/QwenLM/qwen-code, 2026

Troubleshooting

These are the most common integration failures when connecting to Qwen APIs for the first time, along with verified fixes.

Check three things in order: (1) Confirm the API key was copied in full from Alibaba Cloud Model Studio — keys are long strings, truncation is common. (2) Verify base_url matches your region. International users must use dashscope-intl.aliyuncs.com; domestic China users use dashscope.aliyuncs.com. Crossing regions with the wrong domain returns 401. (3) Confirm the key has the correct permissions enabled in Model Studio — a key restricted to a specific model won't authenticate for other models.

The enable_thinking syntax differs between Alibaba Cloud and local frameworks. For Alibaba Cloud, pass it as a top-level key inside extra_body: extra_body={"enable_thinking": False}. For vLLM or llama.cpp running locally, wrap it in chat_template_kwargs: extra_body={"chat_template_kwargs": {"enable_thinking": False}}. Using the wrong form causes the parameter to be silently ignored — the model will not return an error.

The Qwen3.5-397B model on Together AI has an independently benchmarked time-to-first-token of 46.5 seconds as of May 2026 — not a client-side issue, and subject to change if Together AI upgrades their infrastructure. Set your HTTP timeout to at least 120 seconds when using Together AI for 397B. For interactive applications where response latency matters, use DeepInfra (0.67s TTFT) or Clarifai instead. Together AI at this size is viable only for batch jobs where throughput is more important than interactivity.

The DeepInfra FP8 variant of Qwen3.5-397B does not support JSON mode (response_format: {"type": "json_object"}) or structured output parameters. If your application needs guaranteed JSON output, switch to a provider that does support it — Clarifai and OpenRouter both support JSON mode on Qwen models. Alternatively, prompt the model to output JSON and parse the response text manually, but this is less reliable.

The Qwen OAuth free tier was discontinued on April 15, 2026. The daily quota was reduced from 1,000 to 100 requests on April 13, then fully shut down two days later. Applications that relied on OAuth-based authentication need to migrate to DashScope API keys from Alibaba Cloud Model Studio. If you're using Qwen Code, run qwen auth (or /auth inside a session) to set up the new key-based authentication. The ~/.qwen/settings.json security.auth.selectedType field should be set to "apiKey".

DashScope rate limits are applied per model and per key, not just globally. If you're hitting limits, check which model the key is scoped to in Model Studio — a key provisioned for qwen-turbo counts separately from one for qwen-max. For sustained high-volume workloads, Alibaba Cloud's Coding Plan subscription provides higher quotas under a fixed monthly fee rather than per-token billing. Contact Alibaba Cloud support to upgrade quota tiers for API keys.

Frequently Asked Questions

Create a DashScope API key at Alibaba Cloud Model Studio. Set it as the environment variable DASHSCOPE_API_KEY and pass it as the api_key parameter when initializing the OpenAI client with the DashScope base_url. The Qwen OAuth free tier was discontinued on April 15, 2026 — API key authentication is now the only method for programmatic access.

DeepInfra is currently the most cost-effective option for Qwen3.5-397B at $0.54/M input and $3.40/M output tokens ($1.25/M blended), with the lowest time-to-first-token (0.67 seconds) in independent benchmarks. The caveat: the DeepInfra FP8 variant does not support JSON mode. If your application requires structured output, Clarifai ($1.35/M blended) is the next best option and also holds the throughput record at 268 tokens/second.

The syntax depends on where the model is hosted. For Alibaba Cloud Model Studio, set enable_thinking: False as a top-level key inside extra_body. For local deployments running vLLM or llama.cpp, wrap it in chat_template_kwargs: extra_body={"chat_template_kwargs": {"enable_thinking": False}}. Using the wrong form for your runtime results in the parameter being silently ignored.

Yes. All major Qwen access points — Alibaba Cloud Model Studio, OpenRouter, DeepInfra, and local vLLM — expose OpenAI-compatible APIs. Initialize the standard OpenAI client with base_url set to your provider's endpoint and api_key set to that provider's key. No other changes to existing OpenAI SDK code are required. Swap providers by changing those two values — everything else stays the same.