How do data scientists, data analysts, and data engineers experience the data mesh architecture?
We previously wrote how the data mesh architecture rose as an answer to the problems of the monolithic centralized data model.
To recap, in the centralized data models, ETL or ELT data pipelines collect data from various enterprise data sources and ingest it into a single central data lake or data warehouse. Data consumers and business intelligence tools access the data from the central storage to drive insights and inform decision-making. This monolithic organization regulates the enterprise data by defining the data governance rules, data quality standards, and data model schemas.
Unfortunately, the centralized architecture often causes bottlenecks in delivering business results.
Why?
Scalability becomes a problem. As the data ecosystem grows it becomes progressively harder to add new data sources and keep the schema constraints intact. Also, data ownership is a problem. Data teams are organized around their skillset (data engineering team, data science team, business intelligence team), and not around the end-to-end data products they’re building (sales data team, e-commerce data team, customer retention data team, …). The responsibility for data is passed around like a hot potato.
The distributed data mesh approach tries to solve these issues.
Instead of a single data pipeline, it introduces a microservices architecture that builds data products for a business domain (e.g. “sales data” vs “data engineer’s data model”).
Domain teams of cross-functional individuals (engineers and scientists and analysts devoted to the same business domain) are combined to address domain problems and take data ownership over their field. All within a platform that offers tools for data management through self-serve data infrastructure as a service.
Now, this is great. But it is also extremely abstract.
To better understand how the data mesh architecture addresses analytical use cases, let’s look at it through the eyes of different stakeholders.
As you log into your computer, you see an email from the Head of Operations asking for your help with shipping: “Can we get a better insight into the effects of delayed shipments? I want to understand how delays in order fulfillment and delivery are affecting our bottom line.”
Your data science lightbulb goes off. “I need to build a machine learning predictor that can anticipate shipping delays before they happen and explain which factors exacerbate them! Ok, I know I’ll be using linear regression or decision tree algorithms, but what about the input data?”
Let’s explore how your story develops in different architectures.
As you approach the problem of the shipping delay prediction, you start to worry. You’ve never worked with this domain data before.
You look through the Orders table in Snowflake (a SQL data warehouse) and try to guess the meaning of data from the field names.
‘shipped_at, shipped_on, what’s the difference between those columns? And why are some orders apparently fulfilled, but there is missing data in both columns?’ you mutter to yourself.
With no metadata or a data catalog describing how the dataset was produced, you quickly abandon this self-exploration path.
You search your memory to remember Stella had an All hands presentation on an analysis she did about shipping times. You search the office contact book to call her, only to realize she’s on vacation.
Back to the Heads of Operation.
With a knot in your stomach, you call him to ask who you should contact about the data. ‘Rob from engineering’, he responds.
After talking to Rob you realize that the data is complicated and missing crucial features. ‘shipped_at’ describes the moment we as a company shipped the product, while ‘shipped_on’ is used for when our logistic partners shipped it.
‘So why are some dates missing?’ you ask. Rob responds he doesn’t own the data and has no clue.
You dig through Stella’s previous Python and R notebooks. After a couple of hours, you finally understand that products that were returned and re-shipped have NULLs for ‘shipped_on’ and ‘shipped_at’ and the data for the return shipment needs to be computed from some metadata in another table. But that table is stored on-premise, not in the cloud data warehouse, and you have no data access.
You haven’t even started building the machine learning model and you already wasted a day. As you wonder if all big data is messy data, you call again the Head of Operations and discuss if he’s fine with building a model for shipping without returns.
He’s disappointed because order returns are important information for fulfillment. But if this is the best you can do, fine, let’s have a look at it.
You set off to build a machine learning predictor with a bitter taste in your mouth.
To recap the problem of centralized architectures so far:
Now let’s take a look at the story under the data mesh architecture.
You get excited by the Head of Operations’ request!
Shipping data is your bread and butter. Before, you used to work on all analytical data use cases. Your attention was distributed over multiple domains and your business knowledge of each domain was superficial at most.
But since you were allocated from the “Data team” to the “Shipping team”, you can focus on building advanced predictors for shipping data. Instead of working alongside other data scientists (who also got incorporated in other domain teams), you now work with a data engineer and data analyst who are also devoted to the domain of shipping.
The request you’re working on is already halfway done since you already worked on a predictive model for a quarterly report.
You just spend a couple more hours automating the computation, updating the Data Catalog with information on the new data, and sending the link to the self-service model to the Heads of Operations.
You might have noticed we used the scientist/analyst title interoperably. This is because the main problems for both roles are the same in the centralized model.
But don't be mistaken. The data mesh approach does not automatically break down silos and solve the centralized data problems.
What it does, though, is acknowledge the intrinsic issue of the centralized model - the organization of work has (unwanted) spillover effects.
Because teams are organized around skills (data science, engineering, analytics), there is no clear data owner for the end-to-end pipeline.
And there is no incentive to take care of the data catalog and metadata documentation, no push to standardize missing data handling because it is uncommon to share the same artifacts across teams.
In the data mesh approach, the teams are organized by business domains.
The data they consume and produce is their own data product and they are responsible for every touchpoint.
They are effectively the product owners of the domain, and data (and analyses, predictors, and other artifacts) is their product. So there are fewer back-and-forths and there is more product thinking within the business domain. That delivers results faster and with greater clarity.
Now, let’s look at the day in the life of a data engineer.
In the middle of your meeting with the marketing team, the Director of paid advertising asks you to add a new field to the marketing report. “Should be easy, it’s just a field to tell us if we acquired the marketing lead via organic marketing or leadgen activities.”
Let’s see what the story looks like under two architectures.
“Leadgen?” you write in your notebook. Oh, “lead generation!”
You quickly start to sketch the specification for what is needed.
But as you go along, you get more questions than answers:
You spend a boring afternoon emailing back and forth with the Director of paid advertising. She is getting frustrated because you do not understand the business value of this task. You get frustrated because you need to explain the engineering tradeoffs. While other tasks are piling up in your Jira backlog.
Now, let’s look at how the story would play out under the data mesh approach.
Because you’ve been working with leads and marketing data assets for a while, you understand the essence of the task - the Director of paid advertising is trying to validate marketing activities based on segments.
So you suggest a different approach for the same results and less engineering time wasted.
Instead of building a static report table with users and their acquisition status (organic, leadgen), you can pipe the acquisition information from the CRM (where it is already collected) directly into the Facebook Ads and Google Ads platforms to create audiences. This way you can create marketing experiments and reports automatically from the data exported from the advertising platforms.
All you need to do is build a reverse ETL pipeline, which should be easy with your data stack. And not worry about the engineering constraints of adding new data.
Again, the data mesh approach does not automatically break down silos and solve your data problems.
What it does, though, is speed up understanding between different roles. The data engineer understood the need of the marketing persona faster because they usually work in this domain.
Additionally, the data platform the engineer uses is interoperable - allowing him to quickly build (reverse ETL) pipelines without worrying about the usual engineering constraints.
Keboola can help you accelerate your data mesh deployment with infrastructure as a service.
Keboola is an end-to-end data platform, which offers out-of-the-box:
How can Keboola be used to empower your data mesh architecture? Check our clients’ use cases to get a taste of what Keboola has to offer:
Try it for free. Keboola has an always-free, no-questions-asked plan. So you can explore all the power of the data mesh paradigm. Feel free to give it a go or reach out to us if you have any questions.