Data is the leverage that unlocks innovation and gets companies ahead of the competition.
This is reflected in the heavy investments into analytics and data infrastructure: data engineers and data scientists are pushed to the top of the hiring priorities, data warehouses are migrated to data lakes to unlock real-time data analytics, and data ops are on the agenda of C-suits.
What companies often encounter is that the centralized data architecture leads to unfulfilled promises when scaling.
This is the bird-view of the centralized data architecture design:
Multiple data sources are ingested into an ETL (or ELT) pipeline. The ETL process takes care of ingestion or extraction, data transformation, and data storage into a single, centralized data repository. Usually, the data storage is a central data lake or data warehouse, which is the repository of all data used by BI tools and other data consumers.
The centralized data architecture has been praised for its role in data-driven transformation.
Unlike its legacy system, which was composed of disparate and siloed data infrastructures, data is unified in a single source of truth (the data warehouse/data lake), data access is democratized via access to the singular data storage, and the underlying technology (data warehouses, data lakes) allow for scaling in data volume, variety, and velocity, unlocking the potential of big data analytics.
So, where is the problem?
Enterprises find that the current centralized architecture has a common design flaw. It does not scale well, when new data pipelines are needed.
Enterprises add data pipelines when they ingest new data sources, harmonize existing data between themselves via data transformations, develop ad hoc analyses needing new aggregations not foreseen in the data lake design, or need novel datasets for machine learning experimentation.
The centralized model fails to scale, when the data ecosystem grows in complexity.
The failure to scale can be seen in many shortcomings:
Multiple workers produce and consume data along the centralized pipeline, but the data platform lacks overall ownership.
Software engineers usually produce data, data engineers and database administrators ingest, transform, aggregate, and serve data, while consumers like data scientists and analysts slice and dice data to understand the story hidden in the information.
The misalignment in incentives causes friction between interested parties.
Software engineers have no incentives to provide correct, meaningful, or validated data - issues such as missing data are often noted downstream.
Data engineers have neither the scope nor the domain expertise to understand the intricacies of how data is produced upstream or what are the needs of the consumers downstream.
Analysts and data scientists are often left with unclear, missing, and broken data and have to haul upstream coworkers to get a working understanding of the data, so they can put it to good use.
Data engineers care about ingestion, transformations, and loads. Their KPIs and monitoring systems are tailored to check whether the right number of rows passes the test, not whether semantic changes in the production of data have caused a concept drift that affects the machine learning algorithm downstream.
The same misalignment of KPIs can be seen upstream (software engineers) and downstream (data scientists and analysts). Since it is no one’s responsibility to take care of the system end-to-end the quality of data suffers.
The centralized data architecture is designed as a highly coupled system. The stages of data processing are linearly dependent, you always have to extract, transform, and load new data, and you have to do it in this order on the shared infrastructure.
Adding new transformations, like filtering out customers, who are not relevant for a specific analysis or aggregating financial data under a new dimension not present in other reports, introduces hidden changes that need to play well with existing processes.
Enabling a single new feature requires data engineers to change all the components in the pipeline. This either breaks down the quality of data (point above) or slows down operations when scaling due to the bottlenecks.
All these problems present as small hiccups when the data platform is small. Adjusting one data change within the pipeline a week is not a massive challenge.
But as the number of new sources grows, new transformations are added, and the organization scales, the hiccups become time-eating problems halting data-driven decision-making.
What is the solution? Data mesh.
Coined by ThoughtWorks consultant Zhamak Dehghani in her seminal essay “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh”, data mesh describes an architectural design pattern that breaks up with the centralized data architecture model.
The inspiration for data mesh is derived from a similar paradigm in software engineering: the change from monolithic applications to the distributed microservice architecture.
In monolithic applications, all the logic and data were centralized into a singular tightly coupled paradigm, causing the same problems as centralized data applications.
The answer for software engineering was to break up the monolith architecture into multiple microservices. Each microservice is independent of one another, fulfills a specific function, is organized around business operations (not software division of labor), and is owned end-to-end by a responsible team.
Its data equivalent in data mesh would be independent end-to-end data pipelines, which extract, transform, load, and analyze or productize data within the boundaries of a self-contained domain.
Instead of having a single repository for all the customer support, manufacturing, marketing, sales, financials, and other data needs, we break them down into their respective operational data units.
Unless certain principles are applied, distributed data pipelines do not serve to alleviate the problems of monolithic data architecture. Instead, the company reverts to the previous legacy systems of siloed and broken data operations.
To develop a data mesh paradigm (image below), three principles need to be observed.
The centralized model groups its units down a sequential line of data production > ETL (ingestion, transformation, storage) > data storage > potentially additional analytic-oriented transformations such as cubes and other aggregation > machine learning, BI, and consumption of data assets.
The pipeline is universal for all business domains.
In the data mesh architecture, the distribution of data is domain-driven. Each domain takes ownership over the production and maintenance of its own data pipelines. In effect, this creates distributed pipelines under each domain.
What are domains? Domains are areas of common data needs and visions. Usually, data-driven domains map to business domains. For example, the finance department would have its data-driven domain, while the marketing department would have a separate one. Alternatively, domains map to products. For example, the Google Search team would be its own domain, while the Google Ads would be a separate one.
How does this architecture differ from siloed data? Domain-driven distributed architecture is not siloed. As pictured above by lines connecting domains, domains converge, sync, and collaborate when:
In the centralized model, data is thought of as the units, which flow through the data pipelines or as a result of a functioning data platform (the main engineering product).
In the data mesh architecture, data is a first-class citizen and is viewed as a product.
Each domain is tasked to take over the end-to-end responsibility of data products. The data products here include all the domain data assets: the machine learning outputs, the finalized datasets, as well as the data pipelines that produce the end results.
Data product owners are accountable for the data products. They need to establish the same mentality as other product owners. They ask themselves questions such as:
This shift in mentality leads to better data products and organizational change. Instead of having hyper-specialized experts at each node of the centralized system, domain-driven teams need to organize themselves as cross-functional teams, to take full ownership (and accountability) of their data products.
The domain-driven teams develop their distributed data products with the use of a self-serve data infrastructure. The infrastructure is domain-agnostic, so it allows the same standards and governance across all teams. This standardization is fundamental for collaboration between distributed teams, as well as a unified cultural understanding of data product specifications for the data consumers.
The underlying technology of a self-service platform design has the following characteristics:
Data mesh implementation necessitates two major components:
Keboola can help you accelerate your data mesh deployment with infrastructure as a service.
Keboola is an end-to-end data operations platform, which offers out of the box:
Try it for free. Keboola has an always-free, no-questions-asked plan. So you can explore all the power of the data mesh paradigm. Feel free to give it a go or reach out to us if you have any questions.