Databricks is a cloud-based Data Intelligence Platform that unifies data engineering, analytics, business intelligence, machine learning, and AI on a single lakehouse architecture. It was founded in 2013 by the creators of Apache Spark out of the UC Berkeley AMPLab and runs natively on AWS, Microsoft Azure, and Google Cloud. The platform processes data while leaving the files in the customer's own cloud storage in open formats such as Delta Lake and Apache Iceberg.

What is a lakehouse and how is it different from a data warehouse?

A lakehouse combines the structured, ACID-transaction reliability of a data warehouse with the open, low-cost flexibility of a data lake. A traditional warehouse stores data in a proprietary format inside the vendor's system; a lakehouse keeps data in open formats (Delta Lake, Apache Iceberg) in the customer's own cloud storage and decouples compute from storage. Independent analysts note this separation mitigates vendor lock-in and avoids egress fees.

What open-source projects power Databricks?

Four open-source projects form the foundation: Apache Spark (distributed compute), Delta Lake (open storage with ACID transactions), MLflow (the AI engineering platform for tracking, registry, evaluation, and deployment), and Unity Catalog (unified governance, open-sourced under Apache 2.0 in June 2024). Databricks was created by the Spark team and contributes heavily to all four.

How much does Databricks cost?

Databricks uses pay-as-you-go, per-second billing with no up-front cost, priced in DBUs (Databricks Units), a normalized unit of processing power. Vendor-reported starting per-DBU rates (verified 2026-06-09, varying by cloud and region) include $0.15/DBU for data engineering, $0.22/DBU for data warehousing, $0.40/DBU for interactive data science and ML, and $0.07/DBU for AI workloads. Storage and networking are billed separately by your cloud provider. On Azure, pricing is set and billed by Microsoft. Verify current pricing at databricks.com/product/pricing.

Databricks

What Is Databricks? The Data Intelligence Platform Explained

Databricks is a cloud platform that puts data engineering, analytics, business intelligence, machine learning, and AI on one system. The company was founded in 2013 by the team that created Apache Spark out of the UC Berkeley AMPLab, and it has spent the years since pushing a single architectural idea: the lakehouse. Instead of copying your data into a proprietary warehouse, Databricks processes it where it already lives, in your own cloud storage, in open file formats. The company now brands this approach the Data Intelligence Platform.

If you have ever maintained a separate data lake for raw files and a separate data warehouse for analytics, then wired ETL jobs between them, the lakehouse is the pitch to collapse those two systems into one. This breakdown walks through what Databricks actually is from a practitioner's seat: how the lakehouse differs from a warehouse, the four open-source projects underneath it, what Mosaic AI adds for generative AI work, how it runs across AWS, Azure, and Google Cloud, and what it costs.

2013

Founded by Spark Creators

Company history

$5.4B

ARR (Jan 2026, reported)

Independent reporting

Clouds: AWS, Azure, GCP

Native deployment on each

Open-Source Foundations

Spark, Delta Lake, MLflow, Unity Catalog

What Is Databricks?

Databricks, Inc. is an American software company headquartered in San Francisco. The product is a managed cloud platform that unifies the full data lifecycle: ingesting data, transforming it, querying it for analytics and BI, training machine learning models, and building generative AI applications. You do not install it on your own servers in the traditional sense. You run it inside your own AWS, Azure, or Google Cloud account, and Databricks orchestrates the compute against data sitting in your cloud storage.

The detail that sets it apart is where your data stays. A conventional analytics stack pulls data into the warehouse vendor's proprietary storage and format. Databricks instead processes files in place, in open formats such as Delta Lake and Apache Iceberg, inside the customer's own object storage. Independent analysts point out that separating compute from storage this way mitigates vendor lock-in and avoids the egress fees you pay when data has to leave a closed system.

Practitioner note: "Databricks" is the platform, but the architecture is the lakehouse and the marketing name is the Data Intelligence Platform. When a colleague says "we run on Databricks," they usually mean their tables live as Delta Lake files in their cloud storage, governed by Unity Catalog, with Spark or serverless compute doing the work. Knowing that mental model up front makes the rest of the product line easy to place.

Beyond the core, the platform has grown a wide product surface: Lakeflow for data pipelines, Lakebase (a Postgres-compatible operational database aimed at AI agents), Databricks One as a no-code BI surface, the Genie assistant for natural-language querying, and Unity Catalog for governance. You do not need all of it on day one. Most teams start with data engineering or warehousing and add the AI pieces later.

Lakehouse vs Data Warehouse

To understand the lakehouse, it helps to remember the two systems it merges. A data warehouse gives you structure, fast SQL, and reliable ACID transactions, but historically it locks data into a proprietary format and charges you to get it back out. A data lake gives you cheap, open storage for any kind of file, but on its own it lacks transactions, schema enforcement, and governance, which is how lakes turned into "data swamps."

The lakehouse keeps the warehouse's reliability and the lake's openness. Delta Lake adds ACID transactions on top of open files, so concurrent reads and writes behave correctly. Unity Catalog adds the governance layer. Serverless compute spins up and tears down automatically. The result is one system that serves both the BI analyst running SQL and the data scientist training a model, against the same governed tables.

1 system

The lakehouse collapses the separate data-lake and data-warehouse stacks into a single platform: open formats on your own cloud storage, with warehouse-grade transactions and governance layered on top.

The practical payoff is fewer copies of your data and fewer brittle pipelines moving it around. Because compute and storage are decoupled, you can scale query capacity up for a heavy job and back down again without touching where the data lives. The tradeoff is conceptual: you are buying into a platform and its governance model, and you still pay your cloud provider separately for the underlying storage.

Grounding caution: You will see Databricks tutorials describe a "medallion" pattern of bronze, silver, and gold table layers. That is a common organizing convention, but the specifics are not part of the core definitions covered here. Treat it as a pattern to learn from the official documentation rather than a fixed rule, and confirm details at docs.databricks.com.

The Four Open-Source Foundations

Databricks did not start as a closed product with open-source marketing bolted on. The company grew out of an open-source project, and four major projects still form the technical backbone of the platform. You can use each of them outside Databricks, which is part of why the lock-in story is softer than with a fully proprietary warehouse.

Apache Spark

Distributed compute engine for large-scale data

Role Compute

Handles Semi-structured

Schema Not required

Delta Lake

Open storage with ACID transactions on the lake

Role Storage

Adds ACID

Format Open

MLflow

AI engineering platform: tracking to deployment

License Apache-2.0

Covers Registry, eval

Deploys to K8s, SageMaker

Unity Catalog

Unified governance, open-sourced June 2024

License Apache 2.0

Provides Access, lineage

Governs External models

A word on each. Apache Spark is the distributed compute framework that lets you run analytical queries over semi-structured data without forcing a schema up front. Delta Lake is the open storage framework that adds ACID transactions to data lakes, which is what makes the "lake" reliable enough to behave like a warehouse. MLflow is the open-source AI engineering platform, covering experiment tracking, a model registry, evaluation with 50 or more metrics and LLM judges, prompt optimization, deployment to targets like Docker, Kubernetes, SageMaker, and Azure ML, and cost management; the project reports more than 60 million monthly downloads. Unity Catalog is the governance layer, open-sourced under Apache 2.0 in June 2024, providing access controls, AI guardrails, rate limits, and data lineage, and it can govern models hosted outside Databricks too.

Mosaic AI in Brief

Mosaic AI is the part of the platform aimed at generative AI and machine learning. It came from the $1.4 billion acquisition of MosaicML in June 2023, and Databricks now markets it as a unified platform for building agent systems. For a practitioner, the value is that the GenAI tooling sits on the same governed lakehouse as your data, rather than as a separate stack you have to integrate and secure on your own.

The major components are worth knowing by name:

Model Serving deploys, governs, queries, and monitors GenAI models, classical ML models, and agents through one interface.
Mosaic AI Training lets you pretrain custom large language models, fine-tune open-source models, and build classical ML on your own data.
Agent Bricks / Agent Framework is for building, deploying, and evaluating agents grounded in enterprise data, and it includes Genie Code.
AI / Vector Search is a vector database with real-time sync, the retrieval layer for RAG applications.
Agent Evaluation uses AI judges to score quality, catch regressions, and trace root causes.

Governance threads through all of it: the Unity AI Gateway governs every LLM and MCP call, which matters when you are routing prompts and enterprise data through third-party models. Databricks also publishes customer outcomes for these tools, which are useful as directional signals as long as you read them as vendor-reported rather than independently audited benchmarks.

Databricks cites outcomes such as FactSet reporting a 44% accuracy gain, Comcast a 10x cost reduction, Block roughly $10M in productivity, ICE 96% answer accuracy, and Reckitt 60% faster delivery. These come from the vendor and its customers, not from independent testing. Treat them as illustrative of what is possible, then validate against your own workloads before you build a business case on them.

Multi-Cloud: AWS, Azure, and GCP

Databricks runs natively on all three major clouds. The platform itself is functionally the same wherever you deploy it, but the commercial relationship differs in one important way that affects how you buy and bill.

Databricks bills you directly on a pay-as-you-go basis tied to DBU consumption. You provision it from your AWS account and pay AWS separately for the underlying storage and networking.

Azure Databricks is a first-party Microsoft service, available since 2017. Pricing is set and billed by Microsoft under your Azure subscription terms, so it shows up on your Azure invoice rather than a separate Databricks one.

On GCP, as on AWS, Databricks bills directly on DBU consumption while Google Cloud charges for the storage and networking your workloads use.

The Databricks experience is consistent across clouds. The choice usually comes down to where your data and the rest of your stack already live, and which billing relationship your finance team prefers.

For most organizations, the cloud decision is made for you: you run Databricks where your data and the rest of your infrastructure already sit. The Azure distinction is the one to flag to procurement early, because being billed by Microsoft rather than by Databricks changes which contract and which committed-spend agreement the cost lands under.

How Databricks Pricing Works

Databricks pricing is pay-as-you-go with no up-front cost and per-second billing. The unit you are charged in is the DBU, or Databricks Unit, which the company describes as a normalized unit of processing capacity driven by processing metrics such as compute used and data processed. Storage and networking are billed separately by your cloud provider, so the DBU rate is only part of your total bill.

Rates vary by workload type, cloud, and region. The figures below are vendor-reported starting per-DBU rates, verified on 9 June 2026; treat them as a baseline rather than a quote.

Data Engineering

Lakeflow jobs and pipelines

From $0.15/DBU

Data Warehousing

SQL, classic and serverless

From $0.22/DBU

Interactive

Data science and ML notebooks

From $0.40/DBU

Artificial Intelligence

Model serving, AI search, agents

From $0.07/DBU

Two more line items round out the model: Genie, the AI assistant, is billed at $0.07/DBU beyond its free usage, and the Lakebase operational database is billed at $0.069 per CU (compute unit). Higher commitments earn discounts through Committed-Use Contracts. One thing the official pricing does not use is named "Standard / Premium / Enterprise" plan tiers in the way many SaaS products do; the model is consumption-based per workload, so be skeptical of any third-party summary that invents tier names.

Practitioner note: On Azure, none of the rates above apply directly, because Microsoft sets and bills Azure Databricks pricing. If you want to learn the platform before committing spend, there is a free Community Edition for learning Apache Spark, a referenced Free Edition, and a free trial of the full platform where you still pay your cloud provider for compute. A 14-day trial with up to $400 in free credits is offered for the AI agent workflow. Always confirm the current numbers at databricks.com/product/pricing, since rates move and vary by region.

Who Databricks Is For

Databricks is built for organizations that have outgrown disconnected data tools and want one governed platform for everything from raw ingestion to production AI. It tends to make the most sense where there is enough data volume and enough cross-team collaboration to justify a unified system.

Data Engineering Teams

Teams building pipelines benefit from Lakeflow, Delta Lake reliability, and Spark at scale, with one place to land, transform, and govern data instead of stitching together separate lake and warehouse tooling.

Data Science & AI Teams

Practitioners get MLflow for the model lifecycle and Mosaic AI for building, serving, and evaluating GenAI applications and agents, all sitting directly on top of the governed data rather than in a separate environment.

Analysts & BI Users

SQL analysts and business users can query the same governed tables through Databricks SQL, Databricks One, and the Genie natural-language assistant, without a separate warehouse copy of the data.

Platform & Governance Owners

Unity Catalog gives platform teams centralized access control, lineage, and AI guardrails across clouds, which is the layer that makes a shared lakehouse safe to open up to many teams at once.

If your data fits comfortably in a single database and you run a handful of dashboards, a full lakehouse platform is more machinery than the job requires. The consumption-based pricing and the breadth of the product surface reward scale and shared use; they can feel heavy for a one-analyst shop.

Per-second DBU billing is flexible, but it means costs track usage rather than a fixed subscription. Without budgets, alerts, and someone owning cost governance, interactive notebooks and serverless queries can run up a larger bill than expected. Plan for monitoring from day one.

Frequently Asked Questions

What is Databricks used for?

Databricks is used to ingest, transform, govern, and query data, then build analytics, BI, machine learning, and generative AI on top of it, all on a single lakehouse platform. Teams use it for data engineering pipelines, SQL analytics, model training and serving through MLflow and Mosaic AI, and for building AI agents grounded in their own enterprise data.

Is Databricks the same as Apache Spark?

No. Apache Spark is an open-source distributed compute engine, and Databricks was founded by Spark's creators. Spark is one of the four open-source foundations the platform is built on, alongside Delta Lake, MLflow, and Unity Catalog. Databricks is the managed commercial platform that wraps Spark and those projects with governance, serverless compute, BI, and AI tooling.

What is the difference between Databricks and a data warehouse?

A traditional data warehouse stores data in a proprietary format inside the vendor's system. Databricks uses a lakehouse: data stays in open formats such as Delta Lake and Apache Iceberg in your own cloud storage, with warehouse-grade ACID transactions and Unity Catalog governance layered on top. The lakehouse aims to give you the reliability of a warehouse and the openness and lower lock-in of a data lake in one system.

What is Mosaic AI?

Mosaic AI is the generative AI and machine learning layer of the Databricks platform, originating from the June 2023 acquisition of MosaicML. It includes Model Serving, Mosaic AI Training, Agent Bricks for building and evaluating agents, AI / Vector Search for retrieval, and Agent Evaluation. Because it sits on the same governed lakehouse, your AI tooling shares the same data and access controls as the rest of your stack.

Does Databricks run on AWS, Azure, and Google Cloud?

Yes. Databricks runs natively on all three. The platform is functionally the same across clouds. The main difference is billing: on AWS and Google Cloud, Databricks bills you directly on DBU consumption, while on Azure the service is first-party and Microsoft sets and bills the pricing under your Azure subscription.

Video Resources

What Is Databricks? Platform Overview

YouTube Search

High-level introduction to the Data Intelligence Platform and the lakehouse architecture for newcomers.

Lakehouse vs Data Warehouse Explained

YouTube Search

Walkthrough of how the lakehouse merges the data lake and the data warehouse into one system.

Building AI Agents with Mosaic AI

YouTube Search

Hands-on look at Mosaic AI for serving models and building agents grounded in enterprise data.

Databricks

Databricks Lakehouse Architecture, Layer by Layer

How Delta Lake, Unity Catalog, and serverless compute fit together in the lakehouse.

Databricks

Mosaic AI: Building and Serving GenAI on Databricks

A closer look at Model Serving, Agent Bricks, Vector Search, and Agent Evaluation.

Databricks

Databricks Pricing: How DBUs and Per-Second Billing Work

Break down DBU rates by workload, committed-use discounts, and the Azure billing difference.

Comparison

Databricks vs Snowflake: Lakehouse vs Warehouse

How the open-format lakehouse approach compares to the traditional cloud warehouse model.

Go Deeper

Resources from across Tech Jacks Solutions

Agent Frameworks Compared

Side-by-side analysis of the major agent-building frameworks

Agent Threat Landscape

Security risks specific to autonomous AI agents

FREEAgentic AI Compliance Assessment

Compliance checklist for autonomous agent deployments

IAPP AIGP Certification

The AI governance certification for privacy professionals

Fact-checked against vendor documentation and official sources, June 2026. Verify current pricing at databricks.com/product/pricing before purchasing.

Databricks, Mosaic AI, Delta Lake, Unity Catalog, and Lakehouse are trademarks of Databricks, Inc. Apache Spark, Apache Iceberg, and MLflow are projects of the Apache Software Foundation or are maintained as open-source projects. AWS is a trademark of Amazon. Azure is a trademark of Microsoft. Google Cloud is a trademark of Google. Snowflake is a trademark of Snowflake Inc. All other trademarks belong to their respective owners.

Gallery

Contacts

What Is Databricks? The Data Intelligence Platform Explained

What Is Databricks?

Lakehouse vs Data Warehouse

The Four Open-Source Foundations

Mosaic AI in Brief

Multi-Cloud: AWS, Azure, and GCP

How Databricks Pricing Works

Who Databricks Is For

Frequently Asked Questions

What is Databricks used for?

Is Databricks the same as Apache Spark?

What is the difference between Databricks and a data warehouse?

What is Mosaic AI?

Does Databricks run on AWS, Azure, and Google Cloud?

Video Resources

Go Deeper

Services

Learn

Company