Data mesh - the answer to the failures of centralized data architectures

Contents

Example H2

Example H3

How To

May 26, 2021

Updated on

5 min read

Data mesh - the answer to the failures of centralized data architectures

No items found.

Introduction to data mesh, its importance, and how to get started.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

Data is the leverage that unlocks innovation and gets companies ahead of the competition.

This is reflected in the heavy investments into analytics and data infrastructure: data engineers and data scientists are pushed to the top of the hiring priorities, data warehouses are migrated to data lakes to unlock real-time data analytics, and data ops are on the agenda of C-suits.

What companies often encounter is that the centralized data architecture leads to unfulfilled promises when scaling.

What is the problem of a centralized data architecture design?

This is the bird-view of the centralized data architecture design:

Multiple data sources are ingested into an ETL (or ELT) pipeline. The ETL process takes care of ingestion or extraction, data transformation, and data storage into a single, centralized data repository. Usually, the data storage is a central data lake or data warehouse, which is the repository of all data used by BI tools and other data consumers.

The centralized data architecture has been praised for its role in data-driven transformation.

Unlike its legacy system, which was composed of disparate and siloed data infrastructures, data is unified in a single source of truth (the data warehouse/data lake), data access is democratized via access to the singular data storage, and the underlying technology (data warehouses, data lakes) allow for scaling in data volume, variety, and velocity, unlocking the potential of big data analytics.

So, where is the problem?

Enterprises find that the current centralized architecture has a common design flaw. It does not scale well, when new data pipelines are needed.

Enterprises add data pipelines when they ingest new data sources, harmonize existing data between themselves via data transformations, develop ad hoc analyses needing new aggregations not foreseen in the data lake design, or need novel datasets for machine learning experimentation.

The centralized model fails to scale, when the data ecosystem grows in complexity.

The failure to scale can be seen in many shortcomings:

Lack of data ownership.

Multiple workers produce and consume data along the centralized pipeline, but the data platform lacks overall ownership.

Software engineers usually produce data, data engineers and database administrators ingest, transform, aggregate, and serve data, while consumers like data scientists and analysts slice and dice data to understand the story hidden in the information.

The misalignment in incentives causes friction between interested parties.

Software engineers have no incentives to provide correct, meaningful, or validated data - issues such as missing data are often noted downstream.

Data engineers have neither the scope nor the domain expertise to understand the intricacies of how data is produced upstream or what are the needs of the consumers downstream.

Analysts and data scientists are often left with unclear, missing, and broken data and have to haul upstream coworkers to get a working understanding of the data, so they can put it to good use.

Lack of data quality. Because of the lack of overall ownership, data quality suffers.

Data engineers care about ingestion, transformations, and loads. Their KPIs and monitoring systems are tailored to check whether the right number of rows passes the test, not whether semantic changes in the production of data have caused a concept drift that affects the machine learning algorithm downstream.

The same misalignment of KPIs can be seen upstream (software engineers) and downstream (data scientists and analysts). Since it is no one’s responsibility to take care of the system end-to-end the quality of data suffers.

Impeded organizational scaling. The more complex the system grows, the harder it is to scale.

The centralized data architecture is designed as a highly coupled system. The stages of data processing are linearly dependent, you always have to extract, transform, and load new data, and you have to do it in this order on the shared infrastructure.

Adding new transformations, like filtering out customers, who are not relevant for a specific analysis or aggregating financial data under a new dimension not present in other reports, introduces hidden changes that need to play well with existing processes.

Enabling a single new feature requires data engineers to change all the components in the pipeline. This either breaks down the quality of data (point above) or slows down operations when scaling due to the bottlenecks.

All these problems present as small hiccups when the data platform is small. Adjusting one data change within the pipeline a week is not a massive challenge.

But as the number of new sources grows, new transformations are added, and the organization scales, the hiccups become time-eating problems halting data-driven decision-making.

What is the solution? Data mesh.

What is data mesh?

Coined by ThoughtWorks consultant Zhamak Dehghani in her seminal essay “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh”, data mesh describes an architectural design pattern that breaks up with the centralized data architecture model.

The inspiration for data mesh is derived from a similar paradigm in software engineering: the change from monolithic applications to the distributed microservice architecture.

In monolithic applications, all the logic and data were centralized into a singular tightly coupled paradigm, causing the same problems as centralized data applications.

The answer for software engineering was to break up the monolith architecture into multiple microservices. Each microservice is independent of one another, fulfills a specific function, is organized around business operations (not software division of labor), and is owned end-to-end by a responsible team.

Its data equivalent in data mesh would be independent end-to-end data pipelines, which extract, transform, load, and analyze or productize data within the boundaries of a self-contained domain.

Instead of having a single repository for all the customer support, manufacturing, marketing, sales, financials, and other data needs, we break them down into their respective operational data units.

But beware.

Unless certain principles are applied, distributed data pipelines do not serve to alleviate the problems of monolithic data architecture. Instead, the company reverts to the previous legacy systems of siloed and broken data operations.

The 3 principles of data mesh

To develop a data mesh paradigm (image below), three principles need to be observed.

Domain-driven distributed architecture

The centralized model groups its units down a sequential line of data production > ETL (ingestion, transformation, storage) > data storage > potentially additional analytic-oriented transformations such as cubes and other aggregation > machine learning, BI, and consumption of data assets.

The pipeline is universal for all business domains.

In the data mesh architecture, the distribution of data is domain-driven. Each domain takes ownership over the production and maintenance of its own data pipelines. In effect, this creates distributed pipelines under each domain.

What are domains? Domains are areas of common data needs and visions. Usually, data-driven domains map to business domains. For example, the finance department would have its data-driven domain, while the marketing department would have a separate one. Alternatively, domains map to products. For example, the Google Search team would be its own domain, while the Google Ads would be a separate one.

How does this architecture differ from siloed data? Domain-driven distributed architecture is not siloed. As pictured above by lines connecting domains, domains converge, sync, and collaborate when:

The source data is shared among domains to avoid duplication.
The end product of a domain is shared among the domains. An example would be a recommender system for Google. Both the organic search domain and the advertising domain would develop their end-to-end distributed pipelines to inform their recommendations after a user has entered a search query, but their respective products need to be aligned before showing (organic and paid) search results to the user.

Data as a product

In the centralized model, data is thought of as the units, which flow through the data pipelines or as a result of a functioning data platform (the main engineering product).

In the data mesh architecture, data is a first-class citizen and is viewed as a product.

Each domain is tasked to take over the end-to-end responsibility of data products. The data products here include all the domain data assets: the machine learning outputs, the finalized datasets, as well as the data pipelines that produce the end results.

Data product owners are accountable for the data products. They need to establish the same mentality as other product owners. They ask themselves questions such as:

How satisfied are my customers (data consumers) with my data products?
How will my customer learn about my data product (accessibility and discoverability concerns, self-describing nature of data products)?
What is the quality and trustworthiness of my product?
How much are my data products used (monitoring and analytics)?

This shift in mentality leads to better data products and organizational change. Instead of having hyper-specialized experts at each node of the centralized system, domain-driven teams need to organize themselves as cross-functional teams, to take full ownership (and accountability) of their data products.

Domain agnostic self-serve infrastructure as a service

The domain-driven teams develop their distributed data products with the use of a self-serve data infrastructure. The infrastructure is domain-agnostic, so it allows the same standards and governance across all teams. This standardization is fundamental for collaboration between distributed teams, as well as a unified cultural understanding of data product specifications for the data consumers.

The underlying technology of a self-service platform design has the following characteristics:

Interoperable - interoperability indicates the infrastructure is domain agnostic and can be used by any domain team without having to develop their own infrastructure.
Discoverable - data products are discoverable, that is, each data product has a data catalog with meta information such as owner, date of last update, quality assessment and concerns, source of origin, data lineage, sample datasets, etc.
Self-describing - the infrastructure enforces or advises on semantics and syntax to make data products self-describing. Examples of self-describing data would be clear naming conventions for tables and fields, common patterns for describing edge cases and null values, clear information of schema migrations, data product versioning, etc.
Addressable - data products are addressable, when they can easily be accessed programmatically or by users. This either entails building an entry point document for users to understand where (IP / schema) data products are, or exposing data products as APIs with full documentation explaining the access protocols.
Trustworthy - Quality metrics and measures are put in place to both guarantee the quality of data as well as signal potential mishaps and edge cases in the data products.
Secure - All data is encrypted at rest and in motion, global access control is implemented, and security is monitored across all distributed data domains.
Standardized - The same standards for data quality, self-describing, addressability, discoverability, etc. are implemented across the platform, to lower the cross-domain collaboration friction.
Monitoring, alerting, and logging - each infrastructure element is instrumentalized with capabilities that monitor and log events, as well as alert the users or maintainers, when errors occur or values get out of predefined bounds.

Practical considerations: How to implement data mesh?

Data mesh implementation necessitates two major components:

A paradigm shift in data operations. Companies need to rethink and restructure their workforce to align themselves with the principles of data as a product and domain-driven distributed data.
Development of a domain-agnostic and self-serve infrastructure as a service.

Keboola can help you accelerate your data mesh deployment with infrastructure as a service.

Keboola is an end-to-end data operations platform, which offers out of the box:

Data governance and enterprise-level security standards.
Data Catalog for discoverability.
Extensive monitoring: every event, job, and user interaction is monitored to the finest granularity, to offer users an overview of the platform’s functioning.
Scalability. Keboola connects to over 250 sources and destinations, without additional engineering or maintenance needed.
Domain-agnosticism. Teams can work together on common data pipelines or separately on their own distributed pipelines, and converge when necessary.

Try it for free. Keboola has an always-free, no-questions-asked plan. So you can explore all the power of the data mesh paradigm. Feel free to give it a go or reach out to us if you have any questions.