Data Warehouse vs Data Lake vs Data Lakehouse: Which One Do You Actually Need?

A startup CTO asks their first data hire to “set up a data lake.” A year later, nobody can find anything in it, queries take forever, and two analysts calculate revenue differently. Meanwhile, another company spent more on a data warehouse, has cleaner numbers, and their analysts are actually productive. The difference wasn't the team — it was the architecture decision made on day one.

The analogy: library, storage unit, and a smart acquisitions room

A data warehouse is like a public library. Everything is organised, catalogued, and shelved correctly before it's made available. You can find what you need in seconds. But you can't just throw anything in — raw, unprocessed material isn't allowed on the shelves.

A data lake is like a storage unit. You can dump anything in — raw files, JSON blobs, images, CSV exports, unformatted logs. There are no rules about what goes in or how it's organised. The problem: six months later, it's a mess. Nobody remembers what's processed and what's raw. Analysts query the wrong table and get wrong answers. This is called a data swamp, and it's how most data lakes end up.

A data lakehouse is a library that also has a smart acquisitions room. Raw material arrives and is stored as-is (the lake part), but an open table format — Delta Lake, Apache Iceberg, or Apache Hudi — layers warehouse-grade structure on top of it. You get cheap object storage AND SQL performance AND ACID transactions.

What is a data warehouse?

A data warehouse stores structured, processed data in a columnar format optimised for SQL queries. It uses schema-on-write — you define the structure before loading data. Loading garbage in is hard because the schema enforces correctness at ingestion time.

Examples: BigQuery (Google), Snowflake,Amazon Redshift. Databricks SQL also provides a warehouse-compatible SQL endpoint, though Databricks is primarily a lakehouse platform — covered below.

Warehouses are excellent for BI dashboards, SQL analytics, and business reporting. They are not designed for unstructured data, ML training datasets, or storing raw event logs at petabyte scale.

What is a data lake?

A data lake stores raw data in its native format on cheap object storage — Amazon S3, Google Cloud Storage, Azure Data Lake Storage. There is no schema enforcement. CSV, Parquet, JSON, Avro, images, video — all in the same place.

This flexibility is both the strength and the failure mode. A lake with no governance, no cataloguing, and no processing layer becomes a data swamp within months. Data sits there, untransformed, untrusted, unused.

Lakes work well as the landing zone for raw ingestion — a place to keep the original data before transforming it. They also work for ML training datasets and unstructured content that a warehouse cannot store efficiently.

The schema problem: write vs. read

The most fundamental difference between warehouses and lakes is when the schema is applied:

Schema-on-write (warehouse): you define columns, types, and constraints before loading. Bad data is rejected at the door. Queries are fast because the structure is already known.
Schema-on-read (lake): raw data is stored without a schema. The schema is interpreted at query time. Flexible, but slow and error-prone — you only discover data quality issues when you run a query.

What is a data lakehouse?

The lakehouse emerged to solve the lake's quality problem without giving up its cost advantage. The key innovation: open table formats that sit on top of object storage and add warehouse features to it.

Delta Lake (from Databricks) — the most widely adopted. ACID transactions, time travel, schema enforcement, on top of Parquet files in S3 or GCS.
Apache Iceberg — open standard, supported by BigQuery, Snowflake, Spark, Trino. Strong on schema evolution and partition pruning.
Apache Hudi — optimised for streaming and upserts. Common in Uber-style high-velocity pipelines.

With these formats, a data lake gets ACID transactions (no more partial writes), time travel (query data as it was yesterday), schema evolution (add columns without rewriting), and query performance close to a managed warehouse.

Which one should you actually use?

For most startups and growing companies, the answer is straightforward:

Default to a cloud data warehouse. BigQuery or Snowflake gives you SQL, performance, BI tool compatibility, and managed infrastructure with no setup overhead. You can load raw data into a raw schema and use dbt to transform it. This is ELT in practice — and the warehouse handles both the raw storage and the analytics layer.
Add a data lake when you have unstructured data or ML needs. If your team is training models on raw event logs, images, or audio, a lake (S3 or GCS) makes sense alongside your warehouse. Keep structured analytics in the warehouse; raw and unstructured in the lake.
Consider a lakehouse when you're at scale with multiple use cases.Databricks + Delta Lake is the most common lakehouse stack. It makes sense when a single team needs to run SQL analytics AND ML workloads on the same data at scale. This is not a day-one decision.

The practical rule: if your team's primary job is SQL analytics and BI dashboards, go to a warehouse and stay there. Lakes and lakehouses solve problems you likely don't have yet. Premature lake adoption is one of the most common expensive mistakes in early data infrastructure.

How this connects to dbt and your pipeline

Regardless of which storage architecture you choose, the transformation layer works the same way. dbt runs inside your warehouse or lakehouse compute, transforming raw data into business-ready models. The medallion architecture (bronze → silver → gold) maps naturally on top of any of these storage types.

The storage choice affects where raw data lands and how it's queried. The transformation layer is always the same job: clean, model, document, test.

Data Warehouse vs Data Lake vs Data Lakehouse: Which One Do You Actually Need?

The analogy: library, storage unit, and a smart acquisitions room

What is a data warehouse?

What is a data lake?

The schema problem: write vs. read

What is a data lakehouse?

Which one should you actually use?

How this connects to dbt and your pipeline

Working on something similar?