Hugging Face Inference API: Serverless, Endpoints & Providers in 2026
Hugging Face offers three distinct inference products, each built for a different stage of your ML pipeline. The free Serverless API handles prototyping. Inference Endpoints give you dedicated GPU instances with SLA guarantees. Inference Providers route requests to partner platforms at pass-through pricing. This guide walks through the setup, usage, and cost math for each product so you can pick the right one for your workload.
Prerequisites
Before making your first inference call, you need three things: a Hugging Face account, an API token, and the huggingface_hub Python library. The entire setup takes under five minutes.
A read token is sufficient for downloading models and running inference requests. You only need a write token if you plan to upload models, push datasets, or modify repositories on the Hub.
Three Inference Products
Hugging Face splits its inference stack into three tiers. Each targets a different scale and reliability requirement. Understanding the differences prevents you from over-provisioning at prototype stage or under-provisioning in production.
The Serverless API is for experimentation. Inference Endpoints are for workloads where latency and uptime matter. Inference Providers give you access to partner hardware (Together AI, SambaNova, Cerebras, Fal, Groq) through a single API key with no price premium from Hugging Face.
Account Setup and Authentication
Every inference call requires a valid User Access Token. Hugging Face uses these tokens for both rate-limit tracking (on the free API) and billing (on Endpoints and Providers).
Creating Your Token
Navigate to huggingface.co/settings/tokens and create a new token. For inference-only use, read permissions are sufficient. Name the token descriptively (e.g., "inference-prod" or "local-dev") so you can revoke specific tokens later without disrupting other integrations.
Global CLI Login
Running huggingface-cli login stores your token in ~/.huggingface/token so the InferenceClient class picks it up automatically. You can also pass the token directly to the client constructor or set it as the HF_TOKEN environment variable.
Security note: Never commit tokens to version control. Use environment variables or a secrets manager in CI/CD pipelines. Hugging Face tokens can be revoked instantly from the settings page if compromised.
AI Risk Management Template
Identify, assess, and mitigate AI deployment risks
Download Free →Using the Serverless API
The Serverless API accepts HTTPS requests to any model hosted on the Hub. Hugging Face handles model loading, GPU allocation, and scaling behind the scenes. You send a POST request, wait for the result, and pay nothing.
The primary interface is the InferenceClient class from the huggingface_hub library. Instantiate it, call the task-specific method (e.g., text_generation, text_to_image, automatic_speech_recognition), and the library handles serialization, retries, and error formatting.
Supported Tasks
The API covers four broad categories:
- Text: text-generation, classification, question-answering, NER, summarization, translation
- Vision: text-to-image, image-to-image, classification, object-detection, document QA
- Audio: speech-to-text (Whisper), text-to-audio, speaker identification
- Embeddings: semantic search, RAG pipelines, recommendation engines
Rate limits on the free tier are not published as fixed numbers. Hugging Face describes them as "rate-limited" without specifying exact thresholds, and the limits vary by model popularity and current load. For prototyping and internal tools, this is rarely a problem. For anything customer-facing, plan to move to Endpoints or Providers.
Inference Endpoints
Inference Endpoints give you dedicated, auto-scaling GPU instances. Unlike the Serverless API, you choose the hardware, the region, and the scaling policy. Billing is per minute with scale-to-zero support, so you only pay when the endpoint is actively processing requests.
Hardware Tiers
| Hardware | Cost/hr | Best For |
|---|---|---|
| CPU | $0.03 | Neural network classifiers, NER, embeddings |
| T4 / L4 | $0.40 - $0.80 | 7-13B chat models, small Whisper |
| A10G / L40S | $1.00 - $1.80 | 13-30B chat, Stable Diffusion 3.5, FLUX |
| A100 / H100 | $1.29 - $10.00 | 70B+ models, high-throughput RAG, video gen |
Endpoints are SOC 2 compliant with SLA guarantees. Available regions include US and EU cloud zones. Native APAC, MENA, and HIPAA regions are not currently available. Organizations with AI governance requirements should evaluate data residency options before deploying.
Serving Engines (2026)
Hugging Face supports three serving engines for Endpoints. The default for new deployments is vLLM. SGLang is recommended for RAG workloads. TGI (Text Generation Inference) entered maintenance mode in 2026; existing TGI deployments continue to work, but Hugging Face recommends migrating new workloads to vLLM or SGLang.
Inference Providers
Inference Providers route your API calls to partner platforms while keeping the Hugging Face SDK interface. You use the same InferenceClient, the same model IDs, and the same method signatures. The difference is that the request runs on partner infrastructure instead of Hugging Face-managed GPUs.
Current providers include Together AI, SambaNova, Cerebras, Fal, and Groq. The API follows an OpenAI-compatible format, so existing code that targets the OpenAI SDK can often switch to Hugging Face Providers with minimal changes.
Zero markup: Hugging Face passes through the exact pricing from each provider. There is no additional fee layered on top. For example, Llama 70B runs at roughly $0.26 per million tokens, and FLUX image generation costs about $0.01 per image.
For organizations already using cloud-native ML services, Hugging Face also integrates with AWS Bedrock and SageMaker, Google Vertex AI, and Azure AI Foundry through their respective cloud platforms.
- ✓Prerequisites complete
- ✓Account and token configured
- ✓First Serverless API call
- ✓Endpoints evaluated
- ✓Providers pricing reviewed
Pricing and Cost Thresholds
Hugging Face pricing follows a clear escalation path. The free tier covers prototyping. Pro accounts ($9/user/month) increase your Inference Providers quota by 20x and give you access to 10 ZeroGPU Spaces. Beyond that, Endpoints and Providers are pay-as-you-go.
| Product | Cost | Billing Model |
|---|---|---|
| Hub Free | $0 | Unlimited public repos, basic CPU Spaces |
| Pro | $9/user/mo | 1 TB private storage, 20x Provider quota |
| Endpoints (CPU) | $0.03/hr | Per minute, scale-to-zero |
| Endpoints (GPU) | $0.40 - $10.00/hr | Per minute, scale-to-zero |
| Providers | Varies by partner | Pay-per-token, zero HF markup |
| Enterprise Hub | Custom | SSO, audit logs, on-prem connectors |
Deployment Thresholds
These thresholds help you decide when to move between tiers:
- Under ~5M tokens/month: The free Serverless API covers prototyping and internal tools without spending anything.
- 10M to 100M+ tokens/month: Inference Endpoints become cost-effective. Scale-to-zero keeps idle costs near zero.
- 500M+ tokens/month: Self-hosting on your own infrastructure starts to make financial sense.
- ~11B tokens/month: This is approximately where the build-versus-buy break-even line falls for most organizations.
Limitations and Caveats
Every inference product has trade-offs. Understanding them before you commit prevents mid-project surprises.
Hub statistics (2M+ models, 13M+ users, 30%+ of Fortune 500) are vendor-reported metrics. Endpoint pricing ranges depend on region and configuration. Inference Provider rates are set by the partner platforms and can change independently of Hugging Face.
Troubleshooting
Common issues when getting started with Hugging Face inference, and how to resolve them.
Go Deeper
Resources from across Tech Jacks Solutions
FREEAI Risk Management Template
Identify, assess, and mitigate AI deployment risks
EU AI Act Guide
Check your compliance obligations under the EU AI Act
FREEAI Bias Assessment
Evaluate bias risks before deploying any AI system
What Is Agentic AI?
Understand the architecture behind autonomous AI agents
AI Career Paths
Explore roles that work with these tools daily