Stable Audio 3 Open Weights Released: SAME Autoencoder Enables Long-Form 44.1 kHz Stereo Generation

May 27, 2026 3 min read arXiv (Stable Audio 3 Technical Paper) Partial Strong

Tech Jacks Solutions AI News Coverage

Stability AI has released open weights for Stable Audio 3's Small and Medium models on Hugging Face, while restricting the Large model to enterprise licensing, with a technical architecture built on the SAME autoencoder that reportedly reaches latent audio frequencies around 10.76 Hz, enabling variable-length stereo generation at 44.1 kHz on consumer hardware.

ai-tools-news generative-ai-news ai-models-news stability-ai stable-audio open-weights audio-generation hugging-face

Downsampling ratio, 4096×

Key Takeaways

Stable Audio 3 Small and Medium open weights are available on Hugging Face; Large is enterprise-licensed only
Architecture uses SAME autoencoder with reported 4096× downsampling and ~10.76 Hz latent frequency, specific figures from arXiv paper (2605.17991), human validation of numerical specs recommended
Inpainting capability, editing specific audio segments without regenerating the full file, addresses a real production workflow gap not covered by most audio AI tools
GPU memory and generation throughput at production scale are undisclosed; test your hardware configuration before committing to pipeline integration

Model Release

Stable Audio 3

OrganizationStability AI

TypeAI Tool Update — Video Generation

ParametersNot disclosed

BenchmarkNot disclosed (no independent benchmark data available at time of publication)

AvailabilitySmall + Medium: Open weights (Hugging Face). Large: Enterprise license only.

Open weights. That’s the practical news. Stability AI has put Stable Audio 3 Small and Medium on Hugging Face, meaning developers can download and run audio generation locally without API dependency or per-call costs. The Large model requires an enterprise license. The split is the same tier structure Stability AI has used before, open weights for community development, gated access for commercial-scale capability.

The technical architecture is where this gets interesting. The Stable Audio 3 technical paper (arXiv:2605.17991, authored by Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons) describes a system built around the SAME autoencoder, Stability AI’s audio encoding approach, with a reported 4096× total downsampling factor composed of 256× patching and 16× Transformer Resampling. The result is a latent audio frequency of approximately 10.76 Hz per the paper’s specifications.

Those numbers matter for practitioners. Audio encoded at that compression ratio can be processed at substantially lower compute than operating at 44.1 kHz natively, while the decoding stage reconstructs full-quality stereo output. That’s the architecture’s core contribution: high-fidelity output from a computationally efficient latent space.

A flag for the production stack: the specific technical figures, 44.1 kHz output, 4096× downsampling, 256× + 16× component breakdown, ~10.The paper’s existence and authorship are confirmed. The full PDF wasn’t parsed in text-readable form during verification. Human validation of the specific numerical specs against the actual paper is recommended before including these figures in technical documentation or evaluation summaries.

The inpainting editing capability is the feature most likely to matter to production audio teams. Variable-length generation exists in other audio AI tools. Inpainting, the ability to modify a specific segment of an existing audio file while preserving surrounding content, is less common and directly addresses a real workflow constraint: most audio AI tools require generating from scratch rather than editing in place. That said, inpainting claims from the paper haven’t been independently benchmarked against alternatives; this is based on the paper’s characterization.

The part nobody mentions in open-weight audio AI releases: inference cost at production scale. Consumer hardware can run Small and Medium models for experimentation. What’s undisclosed is the GPU memory requirement and generation time per second of audio at 44.1 kHz stereo for each model tier. Before integrating Stable Audio 3 into a production pipeline, test generation latency at your specific hardware configuration. Don’t assume “runs on consumer hardware” means it runs at production throughput.

The Stability AI Hugging Face organization page confirms the company’s presence and open-weight distribution model. The specific Stable Audio 3 model repositories are accessible from that organization page.

Unanswered Questions

What is the GPU memory requirement per model tier (Small, Medium, Large) at inference?
What generation time per second of audio does each tier produce on representative consumer and server hardware?
How does the inpainting capability compare to AudioCraft and other open-weight alternatives on standardized audio editing benchmarks?

Stability AI’s positioning with tiered open weights, Small/Medium open, Large gated, reflects a strategy the company has used across image generation models. The Large model’s enterprise licensing suggests Stability AI sees commercial audio generation as a revenue path while using open weights to build ecosystem adoption. For teams evaluating audio generation tools, the open-weight Small and Medium tiers are the appropriate starting point for capability assessment before committing to enterprise licensing conversations for Large.

Wait for independent benchmark comparisons against MusicGen, AudioCraft, and comparable open-weight audio tools before treating Stable Audio 3 as the default choice for new audio AI pipelines. The arXiv paper is the right reference for architecture decisions.

View Source

More Technology intelligence

View all Technology

Deep Dive Available [Withdrawn] Earlier Guidance on Rebuilding Fable 5 Workflows After the Shutdown