What Is Databricks? The Data Intelligence Platform Explained
Databricks is a cloud platform that puts data engineering, analytics, business intelligence, machine learning, and AI on one system. The company was founded in 2013 by the team that created Apache Spark out of the UC Berkeley AMPLab, and it has spent the years since pushing a single architectural idea: the lakehouse. Instead of copying your data into a proprietary warehouse, Databricks processes it where it already lives, in your own cloud storage, in open file formats. The company now brands this approach the Data Intelligence Platform.
If you have ever maintained a separate data lake for raw files and a separate data warehouse for analytics, then wired ETL jobs between them, the lakehouse is the pitch to collapse those two systems into one. This breakdown walks through what Databricks actually is from a practitioner's seat: how the lakehouse differs from a warehouse, the four open-source projects underneath it, what Mosaic AI adds for generative AI work, how it runs across AWS, Azure, and Google Cloud, and what it costs.
What Is Databricks?
Databricks, Inc. is an American software company headquartered in San Francisco. The product is a managed cloud platform that unifies the full data lifecycle: ingesting data, transforming it, querying it for analytics and BI, training machine learning models, and building generative AI applications. You do not install it on your own servers in the traditional sense. You run it inside your own AWS, Azure, or Google Cloud account, and Databricks orchestrates the compute against data sitting in your cloud storage.
The detail that sets it apart is where your data stays. A conventional analytics stack pulls data into the warehouse vendor's proprietary storage and format. Databricks instead processes files in place, in open formats such as Delta Lake and Apache Iceberg, inside the customer's own object storage. Independent analysts point out that separating compute from storage this way mitigates vendor lock-in and avoids the egress fees you pay when data has to leave a closed system.
Practitioner note: "Databricks" is the platform, but the architecture is the lakehouse and the marketing name is the Data Intelligence Platform. When a colleague says "we run on Databricks," they usually mean their tables live as Delta Lake files in their cloud storage, governed by Unity Catalog, with Spark or serverless compute doing the work. Knowing that mental model up front makes the rest of the product line easy to place.
Beyond the core, the platform has grown a wide product surface: Lakeflow for data pipelines, Lakebase (a Postgres-compatible operational database aimed at AI agents), Databricks One as a no-code BI surface, the Genie assistant for natural-language querying, and Unity Catalog for governance. You do not need all of it on day one. Most teams start with data engineering or warehousing and add the AI pieces later.
Lakehouse vs Data Warehouse
To understand the lakehouse, it helps to remember the two systems it merges. A data warehouse gives you structure, fast SQL, and reliable ACID transactions, but historically it locks data into a proprietary format and charges you to get it back out. A data lake gives you cheap, open storage for any kind of file, but on its own it lacks transactions, schema enforcement, and governance, which is how lakes turned into "data swamps."
The lakehouse keeps the warehouse's reliability and the lake's openness. Delta Lake adds ACID transactions on top of open files, so concurrent reads and writes behave correctly. Unity Catalog adds the governance layer. Serverless compute spins up and tears down automatically. The result is one system that serves both the BI analyst running SQL and the data scientist training a model, against the same governed tables.
The practical payoff is fewer copies of your data and fewer brittle pipelines moving it around. Because compute and storage are decoupled, you can scale query capacity up for a heavy job and back down again without touching where the data lives. The tradeoff is conceptual: you are buying into a platform and its governance model, and you still pay your cloud provider separately for the underlying storage.
Grounding caution: You will see Databricks tutorials describe a "medallion" pattern of bronze, silver, and gold table layers. That is a common organizing convention, but the specifics are not part of the core definitions covered here. Treat it as a pattern to learn from the official documentation rather than a fixed rule, and confirm details at docs.databricks.com.
The Four Open-Source Foundations
Databricks did not start as a closed product with open-source marketing bolted on. The company grew out of an open-source project, and four major projects still form the technical backbone of the platform. You can use each of them outside Databricks, which is part of why the lock-in story is softer than with a fully proprietary warehouse.
A word on each. Apache Spark is the distributed compute framework that lets you run analytical queries over semi-structured data without forcing a schema up front. Delta Lake is the open storage framework that adds ACID transactions to data lakes, which is what makes the "lake" reliable enough to behave like a warehouse. MLflow is the open-source AI engineering platform, covering experiment tracking, a model registry, evaluation with 50 or more metrics and LLM judges, prompt optimization, deployment to targets like Docker, Kubernetes, SageMaker, and Azure ML, and cost management; the project reports more than 60 million monthly downloads. Unity Catalog is the governance layer, open-sourced under Apache 2.0 in June 2024, providing access controls, AI guardrails, rate limits, and data lineage, and it can govern models hosted outside Databricks too.
Mosaic AI in Brief
Mosaic AI is the part of the platform aimed at generative AI and machine learning. It came from the $1.4 billion acquisition of MosaicML in June 2023, and Databricks now markets it as a unified platform for building agent systems. For a practitioner, the value is that the GenAI tooling sits on the same governed lakehouse as your data, rather than as a separate stack you have to integrate and secure on your own.
The major components are worth knowing by name:
- Model Serving deploys, governs, queries, and monitors GenAI models, classical ML models, and agents through one interface.
- Mosaic AI Training lets you pretrain custom large language models, fine-tune open-source models, and build classical ML on your own data.
- Agent Bricks / Agent Framework is for building, deploying, and evaluating agents grounded in enterprise data, and it includes Genie Code.
- AI / Vector Search is a vector database with real-time sync, the retrieval layer for RAG applications.
- Agent Evaluation uses AI judges to score quality, catch regressions, and trace root causes.
Governance threads through all of it: the Unity AI Gateway governs every LLM and MCP call, which matters when you are routing prompts and enterprise data through third-party models. Databricks also publishes customer outcomes for these tools, which are useful as directional signals as long as you read them as vendor-reported rather than independently audited benchmarks.
Multi-Cloud: AWS, Azure, and GCP
Databricks runs natively on all three major clouds. The platform itself is functionally the same wherever you deploy it, but the commercial relationship differs in one important way that affects how you buy and bill.
For most organizations, the cloud decision is made for you: you run Databricks where your data and the rest of your infrastructure already sit. The Azure distinction is the one to flag to procurement early, because being billed by Microsoft rather than by Databricks changes which contract and which committed-spend agreement the cost lands under.
How Databricks Pricing Works
Databricks pricing is pay-as-you-go with no up-front cost and per-second billing. The unit you are charged in is the DBU, or Databricks Unit, which the company describes as a normalized unit of processing capacity driven by processing metrics such as compute used and data processed. Storage and networking are billed separately by your cloud provider, so the DBU rate is only part of your total bill.
Rates vary by workload type, cloud, and region. The figures below are vendor-reported starting per-DBU rates, verified on 9 June 2026; treat them as a baseline rather than a quote.
Two more line items round out the model: Genie, the AI assistant, is billed at $0.07/DBU beyond its free usage, and the Lakebase operational database is billed at $0.069 per CU (compute unit). Higher commitments earn discounts through Committed-Use Contracts. One thing the official pricing does not use is named "Standard / Premium / Enterprise" plan tiers in the way many SaaS products do; the model is consumption-based per workload, so be skeptical of any third-party summary that invents tier names.
Practitioner note: On Azure, none of the rates above apply directly, because Microsoft sets and bills Azure Databricks pricing. If you want to learn the platform before committing spend, there is a free Community Edition for learning Apache Spark, a referenced Free Edition, and a free trial of the full platform where you still pay your cloud provider for compute. A 14-day trial with up to $400 in free credits is offered for the AI agent workflow. Always confirm the current numbers at databricks.com/product/pricing, since rates move and vary by region.
Who Databricks Is For
Databricks is built for organizations that have outgrown disconnected data tools and want one governed platform for everything from raw ingestion to production AI. It tends to make the most sense where there is enough data volume and enough cross-team collaboration to justify a unified system.
Frequently Asked Questions
What is Databricks used for?
Databricks is used to ingest, transform, govern, and query data, then build analytics, BI, machine learning, and generative AI on top of it, all on a single lakehouse platform. Teams use it for data engineering pipelines, SQL analytics, model training and serving through MLflow and Mosaic AI, and for building AI agents grounded in their own enterprise data.
Is Databricks the same as Apache Spark?
No. Apache Spark is an open-source distributed compute engine, and Databricks was founded by Spark's creators. Spark is one of the four open-source foundations the platform is built on, alongside Delta Lake, MLflow, and Unity Catalog. Databricks is the managed commercial platform that wraps Spark and those projects with governance, serverless compute, BI, and AI tooling.
What is the difference between Databricks and a data warehouse?
A traditional data warehouse stores data in a proprietary format inside the vendor's system. Databricks uses a lakehouse: data stays in open formats such as Delta Lake and Apache Iceberg in your own cloud storage, with warehouse-grade ACID transactions and Unity Catalog governance layered on top. The lakehouse aims to give you the reliability of a warehouse and the openness and lower lock-in of a data lake in one system.
What is Mosaic AI?
Mosaic AI is the generative AI and machine learning layer of the Databricks platform, originating from the June 2023 acquisition of MosaicML. It includes Model Serving, Mosaic AI Training, Agent Bricks for building and evaluating agents, AI / Vector Search for retrieval, and Agent Evaluation. Because it sits on the same governed lakehouse, your AI tooling shares the same data and access controls as the rest of your stack.
Does Databricks run on AWS, Azure, and Google Cloud?
Yes. Databricks runs natively on all three. The platform is functionally the same across clouds. The main difference is billing: on AWS and Google Cloud, Databricks bills you directly on DBU consumption, while on Azure the service is first-party and Microsoft sets and bills the pricing under your Azure subscription.
Video Resources
Go Deeper
Resources from across Tech Jacks Solutions
Agent Frameworks Compared
Side-by-side analysis of the major agent-building frameworks
Agent Threat Landscape
Security risks specific to autonomous AI agents
FREEAgentic AI Compliance Assessment
Compliance checklist for autonomous agent deployments
IAPP AIGP Certification
The AI governance certification for privacy professionals