What is the Hugging Face Inference API?

The Hugging Face Inference API is a free, rate-limited serverless endpoint that lets you run inference on any of the 2M+ models hosted on the Hugging Face Hub via simple HTTPS requests, with no infrastructure to manage.

How much does Hugging Face Inference cost?

The free Serverless API costs $0 with rate limits. Inference Endpoints start at $0.03/hr for CPU and range up to $10.00/hr for H100 GPUs, billed per minute with scale-to-zero. Inference Providers pass through partner pricing at zero markup, with Llama 70B at roughly $0.26 per million tokens.

What Python version and libraries are required?

You need Python 3.8 or newer and the huggingface_hub library installed via pip. A free Hugging Face Hub account and a User Access Token with read permissions are also required.

When should I switch from the free API to Inference Endpoints?

The free Serverless API works well for prototypes and internal tools under roughly 5 million tokens per month. Beyond that threshold, Inference Endpoints offer dedicated GPU instances with SLA guarantees and scale-to-zero billing, making them cost-effective for 10M to 100M+ tokens per month.

Which serving engines does Hugging Face support in 2026?

Hugging Face supports vLLM as the production default, SGLang for RAG workloads, and TGI which is in maintenance mode as of 2026. New deployments should use vLLM or SGLang.

HUGGING FACE

Hugging Face Inference API: Serverless, Endpoints & Providers in 2026

Hugging Face offers three distinct inference products, each built for a different stage of your ML pipeline. The free Serverless API handles prototyping. Inference Endpoints give you dedicated GPU instances with SLA guarantees. Inference Providers route requests to partner platforms at pass-through pricing. This guide walks through the setup, usage, and cost math for each product so you can pick the right one for your workload.

2M+

Models on Hub

huggingface.co

Free API Cost

Rate-limited, no SLA

$0.03

Endpoint CPU/hr

Billed per minute

Provider Markup

Pass-through pricing

Prerequisites

Before making your first inference call, you need three things: a Hugging Face account, an API token, and the huggingface_hub Python library. The entire setup takes under five minutes.

Setup Checklist

✓

Python 3.8+ installed (verify with python --version)

✓

Free Hugging Face account created at huggingface.co/join

✓

User Access Token with read permissions from account settings

✓

huggingface_hub installed via pip install huggingface_hub

✓

CLI authenticated via huggingface-cli login

0 of 5 complete

A read token is sufficient for downloading models and running inference requests. You only need a write token if you plan to upload models, push datasets, or modify repositories on the Hub.

Three Inference Products

Hugging Face splits its inference stack into three tiers. Each targets a different scale and reliability requirement. Understanding the differences prevents you from over-provisioning at prototype stage or under-provisioning in production.

Free

Serverless API

Managed, rate-limited, zero config

Cost $0

Target <5M tok/mo

SLA None

Production

Inference Endpoints

Dedicated GPU, scale-to-zero, SOC 2

Cost $0.03-$10/hr

Target 10M-100M+ tok/mo

SLA Guaranteed

Multi-Provider

Inference Providers

OpenAI-compatible routing, zero markup

Cost Pay-per-token

Providers Together, Groq, +3

Markup 0%

The Serverless API is for experimentation. Inference Endpoints are for workloads where latency and uptime matter. Inference Providers give you access to partner hardware (Together AI, SambaNova, Cerebras, Fal, Groq) through a single API key with no price premium from Hugging Face.

Account Setup and Authentication

Every inference call requires a valid User Access Token. Hugging Face uses these tokens for both rate-limit tracking (on the free API) and billing (on Endpoints and Providers).

Creating Your Token

Navigate to huggingface.co/settings/tokens and create a new token. For inference-only use, read permissions are sufficient. Name the token descriptively (e.g., "inference-prod" or "local-dev") so you can revoke specific tokens later without disrupting other integrations.

Global CLI Login

Running huggingface-cli login stores your token in ~/.huggingface/token so the InferenceClient class picks it up automatically. You can also pass the token directly to the client constructor or set it as the HF_TOKEN environment variable.

Security note: Never commit tokens to version control. Use environment variables or a secrets manager in CI/CD pipelines. Hugging Face tokens can be revoked instantly from the settings page if compromised.

FREE TEMPLATE

AI Risk Management Template

Identify, assess, and mitigate AI deployment risks

Download Free →

Using the Serverless API

The Serverless API accepts HTTPS requests to any model hosted on the Hub. Hugging Face handles model loading, GPU allocation, and scaling behind the scenes. You send a POST request, wait for the result, and pay nothing.

The primary interface is the InferenceClient class from the huggingface_hub library. Instantiate it, call the task-specific method (e.g., text_generation, text_to_image, automatic_speech_recognition), and the library handles serialization, retries, and error formatting.

Supported Tasks

The API covers four broad categories:

Text: text-generation, classification, question-answering, NER, summarization, translation
Vision: text-to-image, image-to-image, classification, object-detection, document QA
Audio: speech-to-text (Whisper), text-to-audio, speaker identification
Embeddings: semantic search, RAG pipelines, recommendation engines

~5M

Tokens per month is the practical ceiling for the free Serverless API before rate limits become a bottleneck for production workloads.

Rate limits on the free tier are not published as fixed numbers. Hugging Face describes them as "rate-limited" without specifying exact thresholds, and the limits vary by model popularity and current load. For prototyping and internal tools, this is rarely a problem. For anything customer-facing, plan to move to Endpoints or Providers.

Inference Endpoints

Inference Endpoints give you dedicated, auto-scaling GPU instances. Unlike the Serverless API, you choose the hardware, the region, and the scaling policy. Billing is per minute with scale-to-zero support, so you only pay when the endpoint is actively processing requests.

Hardware Tiers

Hardware	Cost/hr	Best For
CPU	$0.03	Neural network classifiers, NER, embeddings
T4 / L4	$0.40 - $0.80	7-13B chat models, small Whisper
A10G / L40S	$1.00 - $1.80	13-30B chat, Stable Diffusion 3.5, FLUX
A100 / H100	$1.29 - $10.00	70B+ models, high-throughput RAG, video gen

Endpoints are SOC 2 compliant with SLA guarantees. Available regions include US and EU cloud zones. Native APAC, MENA, and HIPAA regions are not currently available. Organizations with AI governance requirements should evaluate data residency options before deploying.

Serving Engines (2026)

Hugging Face supports three serving engines for Endpoints. The default for new deployments is vLLM. SGLang is recommended for RAG workloads. TGI (Text Generation Inference) entered maintenance mode in 2026; existing TGI deployments continue to work, but Hugging Face recommends migrating new workloads to vLLM or SGLang.

Inference Providers

Inference Providers route your API calls to partner platforms while keeping the Hugging Face SDK interface. You use the same InferenceClient, the same model IDs, and the same method signatures. The difference is that the request runs on partner infrastructure instead of Hugging Face-managed GPUs.

Current providers include Together AI, SambaNova, Cerebras, Fal, and Groq. The API follows an OpenAI-compatible format, so existing code that targets the OpenAI SDK can often switch to Hugging Face Providers with minimal changes.

Zero markup: Hugging Face passes through the exact pricing from each provider. There is no additional fee layered on top. For example, Llama 70B runs at roughly $0.26 per million tokens, and FLUX image generation costs about $0.01 per image.

For organizations already using cloud-native ML services, Hugging Face also integrates with AWS Bedrock and SageMaker, Google Vertex AI, and Azure AI Foundry through their respective cloud platforms.

Your Progress

0 of 5 steps complete

✓Prerequisites complete
✓Account and token configured
✓First Serverless API call
✓Endpoints evaluated
✓Providers pricing reviewed

Pricing and Cost Thresholds

Hugging Face pricing follows a clear escalation path. The free tier covers prototyping. Pro accounts ($9/user/month) increase your Inference Providers quota by 20x and give you access to 10 ZeroGPU Spaces. Beyond that, Endpoints and Providers are pay-as-you-go.

Product	Cost	Billing Model
Hub Free	$0	Unlimited public repos, basic CPU Spaces
Pro	$9/user/mo	1 TB private storage, 20x Provider quota
Endpoints (CPU)	$0.03/hr	Per minute, scale-to-zero
Endpoints (GPU)	$0.40 - $10.00/hr	Per minute, scale-to-zero
Providers	Varies by partner	Pay-per-token, zero HF markup
Enterprise Hub	Custom	SSO, audit logs, on-prem connectors

Deployment Thresholds

These thresholds help you decide when to move between tiers:

Under ~5M tokens/month: The free Serverless API covers prototyping and internal tools without spending anything.
10M to 100M+ tokens/month: Inference Endpoints become cost-effective. Scale-to-zero keeps idle costs near zero.
500M+ tokens/month: Self-hosting on your own infrastructure starts to make financial sense.
~11B tokens/month: This is approximately where the build-versus-buy break-even line falls for most organizations.

Limitations and Caveats

Every inference product has trade-offs. Understanding them before you commit prevents mid-project surprises.

Inference Endpoints are available in US and EU cloud regions only. Organizations with data residency requirements in Asia-Pacific or healthcare workloads requiring HIPAA compliance need to evaluate cloud catalog integrations (Bedrock, Vertex, Azure) as alternatives.

The Serverless API is rate-limited with no uptime guarantee. Rate limit thresholds are not documented as fixed numbers and vary by model load. Do not build customer-facing products on the free tier.

Hub statistics (2M+ models, 13M+ users, 30%+ of Fortune 500) are vendor-reported metrics. Endpoint pricing ranges depend on region and configuration. Inference Provider rates are set by the partner platforms and can change independently of Hugging Face.

Troubleshooting

Common issues when getting started with Hugging Face inference, and how to resolve them.

Common Issues

"401 Unauthorized" on API calls+

Your token is missing, expired, or has insufficient permissions. Run huggingface-cli whoami to verify your active token. If the command fails, re-authenticate with huggingface-cli login. Check that the token has at least read permissions in your account settings.

"Model is loading" timeout+

Cold starts happen when a model has not been accessed recently. The Serverless API loads models on demand, and large models (30B+ parameters) can take 30-90 seconds. Retry the request after a brief wait. For latency-sensitive workloads, switch to Inference Endpoints where the model stays loaded.

Rate limit errors on the free tier+

The free Serverless API has rate limits that are not published as fixed numbers. If you are hitting limits consistently, reduce request frequency with exponential backoff. For sustained workloads above ~5M tokens/month, move to Inference Endpoints or Providers. A Pro account ($9/month) also increases your Provider quota by 20x.

Endpoint scale-to-zero not working+

Scale-to-zero requires an idle timeout configuration on the endpoint. Check the endpoint settings in the Hugging Face dashboard. Also verify that no health-check or keep-alive process is sending periodic requests, which would prevent the endpoint from scaling down.

Wrong output format or truncated responses+

Verify you are calling the correct task method on InferenceClient. Each task (text_generation, text_to_image, etc.) expects specific input formats and returns different output structures. Check the InferenceClient documentation for the method signature matching your use case.

Hugging Face Inference API Tutorial

YouTube Search

Current walkthrough covering InferenceClient setup and first API calls

Deploy Models with Inference Endpoints

YouTube Search

Step-by-step endpoint creation with GPU selection and scaling config

Hugging Face for Beginners

YouTube Search

Full course covering Hub basics, pipelines, and inference patterns

Go Deeper

Resources from across Tech Jacks Solutions

FREEAI Risk Management Template

Identify, assess, and mitigate AI deployment risks

EU AI Act Guide

Check your compliance obligations under the EU AI Act

FREEAI Bias Assessment

Evaluate bias risks before deploying any AI system

What Is Agentic AI?

Understand the architecture behind autonomous AI agents

AI Career Paths

Explore roles that work with these tools daily

Verified May 2026

Hugging Face, the Hugging Face logo, Inference API, Inference Endpoints, and Inference Providers are trademarks or registered trademarks of Hugging Face, Inc. This article is an independent publication by Tech Jacks Solutions and is not affiliated with or endorsed by Hugging Face, Inc.

Gallery

Contacts

Hugging Face Inference API: Serverless, Endpoints & Providers in 2026

Prerequisites

Three Inference Products

Account Setup and Authentication

Creating Your Token

Global CLI Login

Using the Serverless API

Supported Tasks

Inference Endpoints

Hardware Tiers

Serving Engines (2026)

Inference Providers

Pricing and Cost Thresholds

Deployment Thresholds

Limitations and Caveats

Troubleshooting

Go Deeper

Services

Learn

Company