Data Warehouse Design: A Complete 2026 Guide (with examples and templates)

KKeboola Marketing TeamMay 19, 202613 min read

Key Takeaways

•Data warehouse design is four choices: where data lives, how it is modeled, how it is governed, and how it is consumed.
•Schema (star vs snowflake vs data vault) matters less than your team's ability to maintain it.
•In 2026, the semantic layer is mandatory. It is the only honest way to expose data to AI agents.
•Pick the philosophy (Inmon, Kimball, Data Vault) by your dominant constraint: speed, governance, or change rate.
•For CFO-led organizations, three principles matter most: verifiable AI, drill-down to the journal entry, EU audit-readiness.

Intro

Most data warehouse projects fail. Not because the technology is wrong. Because the design is.

Three weeks for a number that should take three minutes. AI agents generating plausible reports nobody can trace. Two ERPs naming the same metric differently. The spreadsheet swamp. The fire drill before every audit. These problems live in the warehouse layer, in how data is modeled, governed, and made available to the people and AI agents that read from it.

Short version: route Claude, Copilot, Gemini, or any AI agent through a properly designed semantic layer. You can tell when it is guessing versus when it is giving a verifiable answer grounded in your own metric definitions. That semantic layer is the glass box. Keboola exists to build and govern it.

Core components of a modern data warehouse

A modern data warehouse is rarely a single database. It is a layered architecture.

1. Source systems. ERPs, SaaS tools, operational databases, event streams, files. Keboola ships hundreds of connectors so the source layer is rarely the bottleneck.

2. Ingestion. Batch or streaming. The choice is rarely strict binary. Revenue events stream, finance close stays batch.

3. Staging / Raw layer. A landing zone where source data sits unmodified. The audit trail. If you ever need to ask "what did this look like on Tuesday?", staging is where you go.

4. Transformation layer. Business logic encoded in SQL, often orchestrated with dbt. Where you map seven account hierarchies onto one chart of accounts and reconcile timezones. Keboola treats transformations as first-class versioned objects with lineage attached automatically.

5. Modeled / Curated layer. The structured, analytics-ready output. Facts and dimensions if you follow Kimball; hubs, links, satellites if you follow Data Vault. This is what dashboards and AI agents read.

6. Semantic layer. The business glossary. "Active customer" means this. "Net revenue" excludes that. In 2026 this layer is no longer optional. It is the only honest way to expose data to AI agents and the core of Keboola's Financial Intelligence.

Financial Intelligence

One auditable financial truth across every entity you own

7. Consumption layer. BI tools, embedded analytics, reverse ETL, and AI agents querying via natural language or APIs.

The mistake most architects make is treating these as fixed boxes. A modern warehouse blurs lines: transformation may run in the warehouse engine (ELT), the semantic layer may live in dbt or the BI tool, staging may be an external lake. Components are real. Boundaries are negotiable.

Top-down vs bottom-up: the design philosophy choice

Before you draw a single table, you choose a philosophy.

Bill Inmon, top-down, enterprise-first. Build a corporate warehouse in third normal form first, then carve out data marts. Single source of truth, strong governance, durability. Cost: months before the first dashboard ships.

Ralph Kimball, bottom-up, business-process-first. Build dimensional marts (star schemas) around specific business processes. Conform dimensions so the marts can be combined. Faster time to value, easier to query. Cost: governance is a discipline, not a structure.

Data Vault, hybrid, scale-first. Dan Linstedt's approach separates raw business keys (hubs), relationships (links), and descriptive attributes (satellites). Designed for environments where sources change frequently and history matters. Common in regulated industries and organizations absorbing acquisitions.

For most companies in 2026 the answer is not pure. A typical pattern: stage raw data in a vault-ish structure for traceability, expose dimensional marts on top for analytics. Pick the philosophy that matches your dominant constraint: speed (Kimball), governance (Inmon), or change rate (Data Vault).

Schema design choices

Three schema options dominate.

Star schema. Central fact table surrounded by denormalized dimension tables. Optimized for query speed and analyst ergonomics. The default choice unless you have a reason to do something else.

Snowflake schema. A star schema where dimensions are normalized into sub-tables. Saves storage, loses query simplicity. Rarely wins in modern cloud warehouses where storage is cheap and joins are fast. See Star Schema vs Snowflake Schema.

Data Vault schema. Hubs, links, satellites. Designed for environments where source schemas change and you need to track every change. Heavier to build. Pays off when auditors arrive.

A common 2026 pattern: vault-ish raw storage with star-schema serving layers on top. The vault gives you history and auditability. The stars give your analysts a usable interface. The semantic layer hides both.

The "best" schema is the one your team can maintain. A perfect data vault nobody understands is worse than a slightly inefficient star schema your team can extend.

Modern data warehouse architecture (2026)

Cloud-first by default. AWS Redshift, Snowflake, BigQuery, Databricks, and Azure Synapse handle nearly all new warehouse builds. On-prem persists in central banks, defense, and certain healthcare environments. EU organizations increasingly prioritize providers with EU data sovereignty guarantees as a direct response to DORA and the EU AI Act. Keboola runs on the cloud of your choice and offers EU-based control planes for organizations that need them.

ELT, not ETL. Cloud warehouses are fast enough to run transformations themselves. Land raw data, then transform in SQL. ETL still has a role for compute-heavy work or sensitive data that should not land raw, but ELT is the default for finance, marketing, and operational analytics.

Lakehouse blurs the warehouse/lake divide. Open table formats (Iceberg, Delta Lake, Hudi) let you store data once and query it with multiple engines. The strict warehouse-vs-lake debate matters less than the question of which workloads benefit from open formats.

Three principles cut across these layers. Together they form the glass box: every AI agent's output, every dashboard number, traces back to a definition a human authored.

Verifiable AI. The semantic layer says exactly what each metric means. The agent's output traces back to a human-authored definition. Stop deciding by gut. Stop letting AI bluff.

Drill-down to the journal entry. Every group-level number traces back to the source transaction in the originating system. A CFO looking at "EU operations net revenue" should be able to click through to the specific journal entry in the specific ERP. Without that lineage, every AI-assisted analysis is a black box and every audit becomes an archaeology project.

EU audit-readiness. Data residency, sovereignty, audit trail. DORA. BCBS 239. EU AI Act. IFRS 17. Solvency II. A deployment topology decision made on day one, not a marketing claim.

Česká spořitelna runs one of the largest examples of this pattern in EU banking. Their Keboola-based warehouse serves 180+ active users across 70 teams and 100+ production projects.

An effective weapon in the fight against shadow IT.

How to design a data warehouse: 7 steps

From chaos to clarity. Typically 8 to 12 weeks. Here is the working sequence.

Step 1: Define the questions. Start with the decisions the warehouse needs to support. "Show monthly revenue by segment" is a question. "Be a data warehouse" is not. List 10 to 20 specific questions. The warehouse is graded on whether it can answer them.

Step 2: Inventory the sources. Catalog every system that produces relevant data. Who owns it, how often it changes, what its quality reputation is. This is where you discover "revenue" is computed differently in five places.

Step 3: Define the conformed dimensions. Before tables are built, agree on what "customer" means, what "active" means, what "fiscal Q1" means. Skipping this is how a warehouse ships with four different customer counts on four dashboards.

Step 4: Pick the philosophy and schema, then stand up ingestion. Inmon, Kimball, or Data Vault. Star, snowflake, or vault. Document the choice. Then decide what gets streamed, what gets batched, and where it lands. Keboola's connector library covers most of this without custom code.

Step 5: Build the transformation and curated layers. Encode business logic in SQL or dbt. One transformation per metric. Version controlled, tested, reviewed like code. Then expose tables grouped by business domain.

Step 6: Add the semantic layer. Each metric has one definition. The semantic layer is the contract between the warehouse and everything downstream, including AI agents.

Step 7: Wire governance and monitoring from day one. Data quality tests, lineage capture, access control, audit logs, freshness monitoring. None of this is optional in 2026. Ship it with the first pipeline, not as a retrofit. Then iterate: ship something usable in weeks, not months.

Best practices

Document business definitions before you write SQL. A metric without a written definition is a future argument.
Treat raw data as immutable. Never modify it. Always transform forward.
Build the semantic layer early. Once analysts write their own definitions across 14 dashboards, putting it back together is painful.
Capture lineage from day one. Keboola does this automatically; OpenLineage does it externally. Either way, do not skip it.
Plan for multi-entity from the start. Expansion is the default trajectory for any company that acquires, gets acquired, or expands geographically. Designing it in later is expensive.

Real-world examples

Two patterns from actual Keboola deployments.

Multi-country consumer finance (Home Credit International). Nine countries on day one.

Design challenge: harmonize chart of accounts across countries with different regulators, fiscal calendars, and ERPs.
Approach: unified semantic layer over country-specific staging zones, conformed dimensions enforced at the curated layer.
Outcomes: 70% reduction in FP&A reporting time, 25% of reporting fully automated, the same platform later supported risk, HR, and CRM use cases without redesign.
Lesson: design for the consolidation problem first; platform extensibility follows.

Banking governance at scale (Česká spořitelna). A retail bank with 180+ active warehouse users across 70 teams.

Design challenge: give every team self-service access without recreating shadow IT.
Approach: curated layer with strict ownership, semantic layer with business glossary, self-service gated by access control rather than restricted by gatekeepers.
Outcome: 100+ production projects, no shadow data factories, governance intact.
Lesson: scale comes from clear ownership, not from centralization.

Frequently asked questions

Q: How long does a data warehouse take to build? A: A usable first version should ship in 8 to 12 weeks on Keboola. A full enterprise-grade build takes 6 to 18 months depending on source complexity, regulatory requirements, and team experience. The danger sign is anything taking 12+ months with no production dashboard.

Q: Star schema or snowflake schema? A: Star schema for most cases. Cloud warehouse storage is cheap and joins are fast, so the normalization benefits of snowflake rarely outweigh the simplicity benefits of star. See our comparison for the trade-offs.

Q: How does the warehouse work with AI agents? A: Through the semantic layer. AI agents that query metrics directly from raw tables produce unreliable answers. AI agents that query through a semantic layer with defined metric meanings produce verifiable ones. This is the design choice that determines whether your AI deployment is a credibility risk or a productivity gain.

Q: ETL or ELT? A: ELT for most modern workloads. Cloud warehouses are fast enough to handle transformations. ETL still has a role for compute-heavy work or sensitive data that should not land raw. More on the ETL process.

What to do next

If you are starting from scratch: write the 20 business questions first, inventory your sources, then pick your philosophy. Resist the urge to choose tools before you have answered those three.

If you are redesigning an existing warehouse: start with the semantic layer. Most warehouse problems are not schema problems. They are definitional problems that the semantic layer is built to solve.

Keboola is the governed data foundation CFOs deploy before AI. We work with finance, data, and operations teams who deliver business outcomes across multiple ERPs, geographies, and regulatory contexts. Your team keeps the tools. We make them speak the same language. Read about Financial Intelligence on Keboola, or book a 20-minute call to see what your warehouse could look like in 8 weeks.

Data Engineering

Data Warehouse Design: A Complete 2026 Guide (with examples and templates)

KKeboola Marketing TeamMay 19, 202613 min read

Key Takeaways

•Data warehouse design is four choices: where data lives, how it is modeled, how it is governed, and how it is consumed.
•Schema (star vs snowflake vs data vault) matters less than your team's ability to maintain it.
•In 2026, the semantic layer is mandatory. It is the only honest way to expose data to AI agents.
•Pick the philosophy (Inmon, Kimball, Data Vault) by your dominant constraint: speed, governance, or change rate.
•For CFO-led organizations, three principles matter most: verifiable AI, drill-down to the journal entry, EU audit-readiness.

Intro

Most data warehouse projects fail. Not because the technology is wrong. Because the design is.

Core components of a modern data warehouse

A modern data warehouse is rarely a single database. It is a layered architecture.

1. Source systems. ERPs, SaaS tools, operational databases, event streams, files. Keboola ships hundreds of connectors so the source layer is rarely the bottleneck.

2. Ingestion. Batch or streaming. The choice is rarely strict binary. Revenue events stream, finance close stays batch.

3. Staging / Raw layer. A landing zone where source data sits unmodified. The audit trail. If you ever need to ask "what did this look like on Tuesday?", staging is where you go.

Financial Intelligence

One auditable financial truth across every entity you own

7. Consumption layer. BI tools, embedded analytics, reverse ETL, and AI agents querying via natural language or APIs.

Top-down vs bottom-up: the design philosophy choice

Before you draw a single table, you choose a philosophy.

Schema design choices

Three schema options dominate.

Star schema. Central fact table surrounded by denormalized dimension tables. Optimized for query speed and analyst ergonomics. The default choice unless you have a reason to do something else.

Data Vault schema. Hubs, links, satellites. Designed for environments where source schemas change and you need to track every change. Heavier to build. Pays off when auditors arrive.

The "best" schema is the one your team can maintain. A perfect data vault nobody understands is worse than a slightly inefficient star schema your team can extend.

Modern data warehouse architecture (2026)

Three principles cut across these layers. Together they form the glass box: every AI agent's output, every dashboard number, traces back to a definition a human authored.

Verifiable AI. The semantic layer says exactly what each metric means. The agent's output traces back to a human-authored definition. Stop deciding by gut. Stop letting AI bluff.

EU audit-readiness. Data residency, sovereignty, audit trail. DORA. BCBS 239. EU AI Act. IFRS 17. Solvency II. A deployment topology decision made on day one, not a marketing claim.

Česká spořitelna runs one of the largest examples of this pattern in EU banking. Their Keboola-based warehouse serves 180+ active users across 70 teams and 100+ production projects.

An effective weapon in the fight against shadow IT.

How to design a data warehouse: 7 steps

From chaos to clarity. Typically 8 to 12 weeks. Here is the working sequence.

Step 6: Add the semantic layer. Each metric has one definition. The semantic layer is the contract between the warehouse and everything downstream, including AI agents.

Best practices

Document business definitions before you write SQL. A metric without a written definition is a future argument.
Treat raw data as immutable. Never modify it. Always transform forward.
Build the semantic layer early. Once analysts write their own definitions across 14 dashboards, putting it back together is painful.
Capture lineage from day one. Keboola does this automatically; OpenLineage does it externally. Either way, do not skip it.
Plan for multi-entity from the start. Expansion is the default trajectory for any company that acquires, gets acquired, or expands geographically. Designing it in later is expensive.

Real-world examples

Two patterns from actual Keboola deployments.

Multi-country consumer finance (Home Credit International). Nine countries on day one.

Design challenge: harmonize chart of accounts across countries with different regulators, fiscal calendars, and ERPs.
Approach: unified semantic layer over country-specific staging zones, conformed dimensions enforced at the curated layer.
Outcomes: 70% reduction in FP&A reporting time, 25% of reporting fully automated, the same platform later supported risk, HR, and CRM use cases without redesign.
Lesson: design for the consolidation problem first; platform extensibility follows.

Banking governance at scale (Česká spořitelna). A retail bank with 180+ active warehouse users across 70 teams.

Design challenge: give every team self-service access without recreating shadow IT.
Approach: curated layer with strict ownership, semantic layer with business glossary, self-service gated by access control rather than restricted by gatekeepers.
Outcome: 100+ production projects, no shadow data factories, governance intact.
Lesson: scale comes from clear ownership, not from centralization.

Frequently asked questions

What to do next

If you are starting from scratch: write the 20 business questions first, inventory your sources, then pick your philosophy. Resist the urge to choose tools before you have answered those three.

If you are redesigning an existing warehouse: start with the semantic layer. Most warehouse problems are not schema problems. They are definitional problems that the semantic layer is built to solve.

Data Warehouse Design: A Complete 2026 Guide (with examples and templates)

Key Takeaways

Intro

Core components of a modern data warehouse

Top-down vs bottom-up: the design philosophy choice

Schema design choices

Modern data warehouse architecture (2026)

How to design a data warehouse: 7 steps

Best practices

Real-world examples

Frequently asked questions

What to do next

Ready to transform your data workflow?

Data Warehouse Design: A Complete 2026 Guide (with examples and templates)

Key Takeaways

Intro

Core components of a modern data warehouse

Top-down vs bottom-up: the design philosophy choice

Schema design choices

Modern data warehouse architecture (2026)

How to design a data warehouse: 7 steps

Best practices

Real-world examples

Frequently asked questions

What to do next

Ready to transform your data workflow?