DeepSeek

How to Run DeepSeek V4 Cost-Effectively: API, Providers, and Self-Hosting (2026)

Q: What hardware do I need to self-host DeepSeek V4?

V4-Flash at INT4 needs roughly 140 to 158 GB of VRAM, which fits on a single H100, two A100s, or four RTX 4090s. V4-Flash at FP8 needs about 500 GB (two H100s). V4-Pro at FP8 needs around 2.4 TB, which means a multi-node cluster of roughly sixteen H100s. Serve with vLLM or SGLang for production.

Last verified: June 2026 · Format: Guide · Est. time: 12-16 min

DeepSeek V4 is one of the cheapest frontier-class large language models you can run in 2026, but the difference between a well-tuned setup and a careless one can be a 10x swing in your monthly bill. There are three realistic paths: the official DeepSeek API, a third-party hosting provider, or self-hosting the open weights on your own hardware. This guide walks through the cost levers for each, with verified pricing as of April 2026, so you can budget with real numbers instead of guesses.

The headline figures are striking. On the official API, deepseek-v4-flash processes input at $0.14 per million tokens on a cache miss and just $0.0028 per million on a cache hit, with output at $0.28 per million. The heavier deepseek-v4-pro tier lists at $1.74 input and $3.48 output per million, though a launch promotion cut that by 75% through May 31, 2026, a discount that has now ended. Because the weights are released under the MIT license, you can also skip the API entirely and run the model yourself. Each path has a different cost profile, latency story, and compliance footprint, and this guide helps you choose deliberately.

$0.14

V4-Flash input per 1M tokens (cache miss), official API

Source: api-docs.deepseek.com (Apr 2026)

$0.0028

V4-Flash input per 1M tokens on a cache hit

Source: api-docs.deepseek.com (Apr 2026)

Token context window; max output 384K tokens

Source: api-docs.deepseek.com (Apr 2026)

MIT

License: self-host the open weights with no per-token fee

Source: DeepSeek model cards (Apr 2026)

Before You Start: What You Need

Running DeepSeek V4 cost-effectively is mostly a series of decisions made before you write a line of code. The hosted API lives at platform.deepseek.com, and the full pricing and model reference is at api-docs.deepseek.com. Work through the checklist below before committing to a budget.

Prerequisites Checklist

✓

An API key from platform.deepseek.com (or a third-party provider account)

✓

A decision on model tier: V4-Flash for volume, V4-Pro for hard reasoning

✓

A chosen endpoint format: OpenAI-compatible or Anthropic-compatible

✓

Hosted vs self-host decision based on your latency and budget needs

✓

A compliance check: DeepSeek infrastructure is in China; self-host for sensitive data

0 of 5 complete

Pricing changes frequently: The figures in this guide are verified as of April 2026, but DeepSeek adjusts API rates and promotions often. Always confirm current pricing at api-docs.deepseek.com before you finalize a budget. Treat every number here as a planning estimate, not a quote.

Step 1: Understand Official API Pricing

The official DeepSeek API is the simplest starting point and, for most workloads, the cheapest hosted option. There are two V4 tiers, and the gap between them is large enough that picking the right one is your single biggest cost lever.

V4-Flash vs V4-Pro Pricing

Model	Input (cache miss) / 1M	Input (cache hit) / 1M	Output / 1M
V4-Flash	$0.14	$0.0028	$0.28
V4-Pro (list)	$1.74	$0.0145	$3.48
V4-Pro (launch promo, 75% off)	$0.435	$0.003625	$0.87

Source: api-docs.deepseek.com, official pricing as of April 2026. The V4-Pro launch promotion ran through May 31, 2026 and has now ended; plan for list pricing.

V4-Flash is built for volume. At $0.14 per million input tokens on a cache miss, it undercuts most Western providers by a wide margin, and a cache hit drops that to a fraction of a cent. V4-Pro targets harder reasoning tasks where the extra capability is worth the higher price. The launch promotion brought V4-Pro down to $0.435 input and $0.87 output per million through May 31, 2026; it has now ended, so plan for list pricing.

Tracker variance: Some third-party pricing trackers report different numbers. One tracker (NxCode) lists $0.30 input and $0.50 output per million for a generic "V4." Treat the official figures above as primary. Tracker discrepancies usually reflect a different model variant, an older snapshot, or rounded estimates. When the numbers disagree, the vendor documentation wins.

Endpoints, Context, and Deprecation

Both V4 tiers support a 1-million-token context window and a maximum output of 384K tokens. The API exposes both an OpenAI-compatible endpoint and an Anthropic-compatible endpoint, so most existing SDKs work by changing the base URL and key. Note one scheduled change: the legacy model IDs deepseek-chat and deepseek-reasoner currently route to V4-Flash but retire on July 24, 2026. Pin explicit V4 model names now to avoid a surprise when they are removed.

Step 2: Cut Costs with Caching and Off-Peak Batching

Once you are on the official API, two mechanisms reduce your bill further without changing your code's logic: prefix caching and the off-peak discount window.

Prefix Caching

DeepSeek charges a cache-hit input rate that is dramatically lower than the cache-miss rate. For V4-Flash, that is $0.0028 per million versus $0.14, roughly a 50x reduction on cached input. To maximize hits, structure your prompts so that stable content, such as system instructions, tool definitions, and shared context, appears first and changes rarely. Variable user input goes last. Repeated calls that share a long static prefix benefit the most.

The Off-Peak Discount Window

DeepSeek runs a daily off-peak window from 16:30 to 00:30 UTC with substantial discounts. Historically this has offered up to 50% off V3 and V4 models and up to 75% off R1. If your workload can tolerate scheduling, batching non-urgent jobs into this window is the largest single discount lever after caching.

Verify V4 inclusion: The off-peak discount has been confirmed historically for R1 and V3, and V4 has been included in the published schedule, but the exact tiers and percentages change. Before you build a batch pipeline around the window, confirm the current V4 discount on the official pricing page at api-docs.deepseek.com.

FREE TEMPLATE

AI Risk Management Template

Identify, assess, and mitigate AI deployment risks before you ship.

Download Free →

Step 3: Compare Third-Party Providers

If you need data residency outside China, an enterprise SLA, or a provider you already use, several platforms host DeepSeek models. Pricing varies, and so does which model variant each provider exposes. The table below summarizes verified figures as of April 2026.

Provider	What they host / price (per 1M)	Notes
Official API	V4-Flash $0.14 in / $0.28 out; V4-Pro promo $0.435 / $0.87	Cheapest for most workloads; infrastructure in China
OpenRouter	Matches official Flash and Pro promo; V3.2 $0.252 / $0.378; R1 $0.70 / $2.50	Routes across multiple backends; easy multi-model access
Together AI	V3/V4 roughly $0.30-0.50 in, $0.50-0.90 out; R1 roughly $7-8	Minimum $5 credit; data outside China
Fireworks	Similar to Together AI	Performance-focused hosting; data outside China
AWS Bedrock	DeepSeek V3.2 $0.62 / $1.85	Enterprise IAM, VPC, and compliance controls
Azure AI Foundry	Varies by region and SKU	Confirm current rates in the Azure portal
Novita	Listed V4-Pro provider; price not disclosed here	Check the provider directly for current rates
Hugging Face	Open weights; no per-token API price	Download and host yourself, or use inference partners

Source: official DeepSeek docs plus provider listings (OpenRouter, AWS Bedrock) and pricing trackers, accessed April to June 2026. Provider pricing changes often; confirm in each provider's console before committing.

For pure cost on standard workloads, the official API is usually hardest to beat. Providers earn their premium through data residency, enterprise contracts, and integration with infrastructure you already operate. If you are on AWS or Azure, the convenience of staying inside one bill and one IAM boundary often outweighs a few cents per million tokens.

Step 4: Self-Host the Open Weights

Because DeepSeek V4 ships under the MIT license, you can run it on your own hardware with no per-token fee and no data leaving your infrastructure. This is the clean compliance path for regulated or data-sensitive workloads, and at very high volume it can be cheaper than any hosted API. The trade-off is the hardware and operational burden.

Hardware Requirements by Variant

Variant	Quantization	Approx. VRAM	Example Hardware
V4-Flash	INT4	~140-158 GB	1x H100, 2x A100, or 4x RTX 4090
V4-Flash	FP8	~500 GB	2x H100
V4-Pro	FP8	~2.4 TB	16x H100 cluster (862B-param BF16 checkpoint)

Source: DeepSeek model cards and technical documentation (April 2026). VRAM figures are approximate and depend on context length, batch size, and serving framework.

V4-Flash at INT4 is the realistic entry point for self-hosting. It fits on a single H100, a pair of A100s, or a four-card RTX 4090 rig, which puts a frontier-class model within reach of a single well-equipped workstation. Stepping up to FP8 roughly triples the memory footprint. V4-Pro is a different category: at FP8 it needs around 2.4 TB of VRAM, which means a multi-node cluster of roughly sixteen H100s. For most teams, V4-Pro self-hosting only makes sense at scale or under strict data-control mandates.

Serving Frameworks

Several mature inference frameworks serve DeepSeek V4 efficiently. The commonly used options are vLLM, SGLang, LMDeploy, TensorRT-LLM, and LightLLM. For production deployments, vLLM and SGLang are the most widely documented choices, with strong support for FP8 and BF16 weights and for parallelism across multiple GPUs.

Compliance angle: DeepSeek's hosted infrastructure is operated in China, which can create latency and data-residency concerns for some organizations. Self-hosting the open weights keeps every token inside your own environment, making it the cleanest path for workloads bound by data-handling rules. Weigh the hardware cost against the value of full data control.

Step 5: Pick the Right Path for Your Workload

The cheapest option depends on three things: your monthly token volume, your latency tolerance, and your data-sensitivity requirements. Use the cards below as a quick decision guide.

Best default

Official API + V4-Flash

$0.14 / 1M input (cache miss)

Lowest cost for most workloads
Prefix caching drops input to $0.0028 / 1M
Off-peak window for batch discounts
Best for high-volume, non-sensitive tasks

Official API + V4-Pro

$0.435 / 1M input (promo)

For harder reasoning and planning tasks
Launch promo ended May 31, 2026
Plan for $1.74 / $3.48 list after promo
Use only where Flash quality falls short

Third-Party Provider

Varies / provider pricing

OpenRouter, Together AI, Fireworks
AWS Bedrock and Azure AI Foundry for enterprise
Data stays outside China
Best for compliance and existing cloud bills

Self-Host (MIT)

$0 / per-token (hardware cost only)

V4-Flash INT4 fits on 1x H100 / 4x RTX 4090
No data leaves your infrastructure
vLLM or SGLang for production serving
Best for sensitive data and very high volume

Troubleshooting Common Cost and Reliability Issues

I keep getting 503 errors during peak hours+

DeepSeek's hosted infrastructure is operated in China and can be saturated during global peak demand, which surfaces as 503 responses or elevated latency. Implement exponential backoff with retries in your client, and shift non-urgent jobs into the off-peak window (16:30 to 00:30 UTC). If reliability is critical, route through a third-party provider such as OpenRouter, AWS Bedrock, or Azure AI Foundry, or self-host.

How do I schedule jobs for the off-peak discount?+

Queue non-time-sensitive work (batch summarization, bulk classification, offline evaluation) and release it between 16:30 and 00:30 UTC, when DeepSeek has historically discounted V3 and V4 by up to 50% and R1 by up to 75%. Confirm the current V4 discount tiers on the official pricing page before you build the pipeline, since the percentages and model inclusion can change.

My cache hit rate is low and costs are higher than expected+

Cache hits require a stable prompt prefix. Put your system instructions, tool definitions, and shared context first and keep them identical across calls; place variable user input last. If you reshuffle the prompt order or inject changing content near the top, you invalidate the cache and pay the full $0.14 per million input rate for V4-Flash instead of $0.0028.

My integration broke after July 24, 2026+

The legacy model IDs deepseek-chat and deepseek-reasoner retire on July 24, 2026. Until then they route to V4-Flash, but after that date calls referencing them will fail. Update your code to pin explicit V4 model names (for example deepseek-v4-flash or deepseek-v4-pro) ahead of the deprecation date.

Which provider is actually cheapest for my workload?+

For standard, non-sensitive workloads, the official API with V4-Flash plus caching is usually the lowest cost. Providers like Together AI and Fireworks price V3/V4 higher (roughly $0.30-0.50 input) but keep data outside China. AWS Bedrock and Azure AI Foundry add enterprise controls at a premium. Estimate your monthly input and output token volume, then multiply against each provider's verified rate before deciding.

Frequently Asked Questions

How much does the DeepSeek V4 API cost?+

As of April 2026, V4-Flash costs $0.14 per million input tokens on a cache miss, $0.0028 on a cache hit, and $0.28 per million output. V4-Pro lists at $1.74 input and $3.48 output per million, with a 75% launch promotion through May 31, 2026 that lowers it to $0.435 input and $0.87 output. Always confirm current rates at api-docs.deepseek.com.

What is the cheapest way to run DeepSeek V4?+

For most workloads, the official API with V4-Flash plus prefix caching and off-peak batching is the cheapest hosted option, with cached input at $0.0028 per million. At very high volume or under strict data-control rules, self-hosting the MIT-licensed open weights eliminates per-token fees entirely, leaving only hardware and operating costs.

What hardware do I need to self-host DeepSeek V4?+

V4-Flash at INT4 needs roughly 140-158 GB of VRAM, which fits on a single H100, two A100s, or four RTX 4090s. V4-Flash at FP8 needs about 500 GB (two H100s). V4-Pro at FP8 needs around 2.4 TB, which means a multi-node cluster of roughly sixteen H100s. Serve with vLLM or SGLang for production.

Is the DeepSeek API OpenAI-compatible?+

Yes. DeepSeek exposes both an OpenAI-compatible endpoint and an Anthropic-compatible endpoint. Most existing SDKs work by changing the base URL and API key. Both V4 tiers support a 1-million-token context window and up to 384K tokens of output.

Should I worry about data residency with DeepSeek?+

DeepSeek's hosted infrastructure is operated in China, which can raise compliance and data-residency concerns for regulated workloads. If that applies to you, either use a third-party provider that hosts the model outside China (Together AI, Fireworks, AWS Bedrock, Azure AI Foundry) or self-host the open weights so no data leaves your environment.

Do third-party providers charge more than the official API?+

Usually yes for raw token cost. Together AI and Fireworks price V3/V4 at roughly $0.30-0.50 input and $0.50-0.90 output per million. AWS Bedrock lists DeepSeek V3.2 at $0.62 input and $1.85 output. OpenRouter matches the official Flash and Pro promo rates. The premium buys data residency, enterprise controls, and integration with infrastructure you already use.

Video Resources

▶

DeepSeek V4 API Pricing and Setup

YouTube Search

▶

Self-Hosting DeepSeek V4 with vLLM and SGLang

YouTube Search

▶

DeepSeek Prompt Caching for Cost Savings

YouTube Search

Gallery

Contacts

How to Run DeepSeek V4 Cost-Effectively: API, Providers, and Self-Hosting (2026)

Before You Start: What You Need

Step 1: Understand Official API Pricing

V4-Flash vs V4-Pro Pricing

Endpoints, Context, and Deprecation

Step 2: Cut Costs with Caching and Off-Peak Batching

Prefix Caching

The Off-Peak Discount Window

Step 3: Compare Third-Party Providers

Step 4: Self-Host the Open Weights

Hardware Requirements by Variant

Serving Frameworks

Step 5: Pick the Right Path for Your Workload

Troubleshooting Common Cost and Reliability Issues

Frequently Asked Questions

Services

Learn

Company