Join our newsletter

#noSpamWePromise
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
cross-icon
Subscribe

Run your data operations on a single, unified platform.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
cross-icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Download the file

#getsmarter
Oops! Something went wrong while submitting the form.

Data Mesh Architecture Through Different Perspectives

How do data scientists, data analysts, and data engineers experience the data mesh architecture?

How To
September 15, 2022
Data Mesh Architecture Through Different Perspectives
How do data scientists, data analysts, and data engineers experience the data mesh architecture?

We previously wrote how the data mesh architecture rose as an answer to the problems of the monolithic centralized data model

To recap, in the centralized data models, ETL or ELT data pipelines collect data from various enterprise data sources and ingest it into a single central data lake or data warehouse. Data consumers and business intelligence tools access the data from the central storage to drive insights and inform decision-making. This monolithic organization regulates the enterprise data by defining the data governance rules, data quality standards, and data model schemas

Unfortunately, the centralized architecture often causes bottlenecks in delivering business results. 

Why? 

Scalability becomes a problem. As the data ecosystem grows it becomes progressively harder to add new data sources and keep the schema constraints intact. Also, data ownership is a problem. Data teams are organized around their skillset (data engineering team, data science team, business intelligence team), and not around the end-to-end data products they’re building (sales data team, e-commerce data team, customer retention data team, …). The responsibility for data is passed around like a hot potato.

The distributed data mesh approach tries to solve these issues. 

Instead of a single data pipeline, it introduces a microservices architecture that builds data products for a business domain (e.g. “sales data” vs “data engineer’s data model”). 

Domain teams of cross-functional individuals (engineers and scientists and analysts devoted to the same business domain) are combined to address domain problems and take data ownership over their field. All within a platform that offers tools for data management through self-serve data infrastructure as a service.

Now, this is great. But it is also extremely abstract. 

To better understand how the data mesh architecture addresses analytical use cases, let’s look at it through the eyes of different stakeholders.

The data mesh architecture for a data scientist/analyst

As you log into your computer, you see an email from the Head of Operations asking for your help with shipping: “Can we get a better insight into the effects of delayed shipments? I want to understand how delays in order fulfillment and delivery are affecting our bottom line.”

Your data science lightbulb goes off. “I need to build a machine learning predictor that can anticipate shipping delays before they happen and explain which factors exacerbate them! Ok, I know I’ll be using linear regression or decision tree algorithms, but what about the input data?”

Let’s explore how your story develops in different architectures.

The data scientist/analyst story in a centralized architecture

As you approach the problem of the shipping delay prediction, you start to worry. You’ve never worked with this domain data before.

You look through the Orders table in Snowflake (a SQL data warehouse) and try to guess the meaning of data from the field names. 

‘shipped_at, shipped_on, what’s the difference between those columns? And why are some orders apparently fulfilled, but there is missing data in both columns?’ you mutter to yourself. 

With no metadata or a data catalog describing how the dataset was produced, you quickly abandon this self-exploration path.

You search your memory to remember Stella had an All hands presentation on an analysis she did about shipping times. You search the office contact book to call her, only to realize she’s on vacation. 

Back to the Heads of Operation. 

With a knot in your stomach, you call him to ask who you should contact about the data. ‘Rob from engineering’, he responds.

After talking to Rob you realize that the data is complicated and missing crucial features. ‘shipped_at’ describes the moment we as a company shipped the product, while ‘shipped_on’ is used for when our logistic partners shipped it. 

‘So why are some dates missing?’ you ask. Rob responds he doesn’t own the data and has no clue. 

You dig through Stella’s previous Python and R notebooks. After a couple of hours, you finally understand that products that were returned and re-shipped have NULLs for ‘shipped_on’ and ‘shipped_at’ and the data for the return shipment needs to be computed from some metadata in another table. But that table is stored on-premise, not in the cloud data warehouse, and you have no data access. 

You haven’t even started building the machine learning model and you already wasted a day. As you wonder if all big data is messy data, you call again the Head of Operations and discuss if he’s fine with building a model for shipping without returns.

He’s disappointed because order returns are important information for fulfillment. But if this is the best you can do, fine, let’s have a look at it.

You set off to build a machine learning predictor with a bitter taste in your mouth.

To recap the problem of centralized architectures so far:

  1. Working in teams organized around skill sets (data science, data engineering, data analysis) instead of domain teams (shipping, revenue, marketing, …) puts data workers at a knowledge disadvantage. This results in more communication back-and-forths and superficial understanding of domain data sets.
  2. Centralized architectures suffer from an overall lack of data ownership. Data operations are focused on pieces of the puzzle (engineering an ETL pipeline, a specific machine learning algorithm), but not on the end-to-end product (shipping data, in the example above).
  3. Centralized architectures usually lack common standards for data governance. This is why the data scientist/analyst had problems understanding where data came from and how to interpret metadata. Sure, some companies implement a data governance policy, but data management is viewed as an (often missing) extra step, not as a federated feature of the data platform. And there is no consistent standards for data governance throughout all teams.

Now let’s take a look at the story under the data mesh architecture.

Stop working on your data infrastructure, and start using it instead. Create a forever-free account and pay as you grow!

The data scientist/analyst story in a data mesh architecture

You get excited by the Head of Operations’ request! 

Shipping data is your bread and butter. Before, you used to work on all analytical data use cases. Your attention was distributed over multiple domains and your business knowledge of each domain was superficial at most. 

But since you were allocated from the “Data team” to the “Shipping team”, you can focus on building advanced predictors for shipping data. Instead of working alongside other data scientists (who also got incorporated in other domain teams), you now work with a data engineer and data analyst who are also devoted to the domain of shipping. 

The request you’re working on is already halfway done since you already worked on a predictive model for a quarterly report.

You just spend a couple more hours automating the computation, updating the Data Catalog with information on the new data, and sending the link to the self-service model to the Heads of Operations.

Why is the world so different for the data scientist/analyst in two different data architectures?

You might have noticed we used the scientist/analyst title interoperably. This is because the main problems for both roles are the same in the centralized model.

But don't be mistaken. The data mesh approach does not automatically break down silos and solve the centralized data problems. 

What it does, though, is acknowledge the intrinsic issue of the centralized model - the organization of work has (unwanted) spillover effects.

Because teams are organized around skills (data science, engineering, analytics), there is no clear data owner for the end-to-end pipeline. 

And there is no incentive to take care of the data catalog and metadata documentation, no push to standardize missing data handling because it is uncommon to share the same artifacts across teams.

In the data mesh approach, the teams are organized by business domains. 

The data they consume and produce is their own data product and they are responsible for every touchpoint. 

They are effectively the product owners of the domain, and data (and analyses, predictors, and other artifacts) is their product. So there are fewer back-and-forths and there is more product thinking within the business domain. That delivers results faster and with greater clarity.

Now, let’s look at the day in the life of a data engineer.

The data mesh architecture for a data engineer

In the middle of your meeting with the marketing team, the Director of paid advertising asks you to add a new field to the marketing report. “Should be easy, it’s just a field to tell us if we acquired the marketing lead via organic marketing or leadgen activities.”

Let’s see what the story looks like under two architectures.

The data engineer story in a centralized architecture

“Leadgen?” you write in your notebook. Oh, “lead generation!”

You quickly start to sketch the specification for what is needed. 

But as you go along, you get more questions than answers:

  • Should this information be available in real-time as we bid on ads, or is it alright if there is a delay?
  • Is this a one-off campaign or should I aim to automate the pipeline?
  • What format do we need the data to be in? A boolean field saying “True”, this is an organic lead? A separate table with lead contacts? 
  • How to compute the field if a user is both organic and leadgen? Or should we just duplicate the entries then?
  • What is the lifecycle of this information? Does it override the previous information? When can we delete the data?
  • Where can you get the information about a user’s status? You pour through the current data producers, but it seems you will need to integrate a new API to get the info. 
  • How will this potential new data processing affect access control, schema consistency, current dimensional aggregates, and other data quality and data governance checks? Who will maintain it?

You spend a boring afternoon emailing back and forth with the Director of paid advertising. She is getting frustrated because you do not understand the business value of this task.  You get frustrated because you need to explain the engineering tradeoffs. While other tasks are piling up in your Jira backlog.

Now, let’s look at how the story would play out under the data mesh approach.

Run a 100% data-driven business without any extra hassle. Pay as you go, starting with our free tier.

The data engineer story in a data mesh architecture

Because you’ve been working with leads and marketing data assets for a while, you understand the essence of the task - the Director of paid advertising is trying to validate marketing activities based on segments.

So you suggest a different approach for the same results and less engineering time wasted. 

Instead of building a static report table with users and their acquisition status (organic, leadgen), you can pipe the acquisition information from the CRM (where it is already collected) directly into the Facebook Ads and Google Ads platforms to create audiences. This way you can create marketing experiments and reports automatically from the data exported from the advertising platforms.

All you need to do is build a reverse ETL pipeline, which should be easy with your data stack. And not worry about the engineering constraints of adding new data.

Why is the world so different for the data engineer in two different data architectures?

Again, the data mesh approach does not automatically break down silos and solve your data problems. 

What it does, though, is speed up understanding between different roles. The data engineer understood the need of the marketing persona faster because they usually work in this domain. 

Additionally, the data platform the engineer uses is interoperable - allowing him to quickly build (reverse ETL) pipelines without worrying about the usual engineering constraints.

The data mesh architecture … for you?

Keboola can help you accelerate your data mesh deployment with infrastructure as a service.

Keboola is an end-to-end data platform, which offers out-of-the-box:

  1. Federated data governance and enterprise-level security standards. Keboola is designed as a SIEM platform, giving your organization end-to-end Security information and event management (SIEM) technology automatically for every event and every touchpoint in the platform.
  2. Data Catalog for sharing data (between data teams and departments) and documenting domain data. The Data Catalog is an essential part of thinking about data as a product, since it acts as a source of entry for coworkers from other business domains, fueling discoverability. and data democratization.
  3. Observability through extensive monitoring: every event, job, and user interaction is monitored to the finest granularity, to offer users an overview of the platform’s functioning.
  4. Scalability. Keboola connects to over 250 sources and destinations, without additional engineering or maintenance needed. This allows your domain team to self-serve their engineering and analytic needs and build their own product pipelines. 
  5. Domain-agnosticism. Teams can work together on common data pipelines or separately on their own distributed pipelines (decentralization), and converge when necessary. Use Keboola’s Data Templates to build reusable end-to-end workflows that can be designed centrally and customized by every domain team individually. Data Templates can be deployed in a couple of clicks and are extremely flexible - customize projects, augment them with additional data, build on top of them with novel functionality, … the choice is yours.
  6. Democratization. Keboola allows fine-grained access control to each data set and one-click data sharing to empower your workers with the data they need to do their best jobs.

How can Keboola be used to empower your data mesh architecture? Check our clients’ use cases to get a taste of what Keboola has to offer:

  1. Mall Group restructured their data operations from centralized to data mesh architecture. Empowering over 100 engineers to build data use cases autonomously within their data team. Without relying on a centralized organization. This helped them produce 400+ data products and features yearly (previous record: 15), raising revenues by more than 7 figures. (dive deeper on how Mall Group did it
  2. Olfin Cars build new data products - from predictive demand algorithms, to competition pricing. Cumulatively raising sales in a single quarter by 760%. (check how Olfin Cars grew with the data mesh architecture).
  3. Firehouse Subs built a self-service data infrastructure as a platform with Keboola. Now a single person can support over 1200 franchises (read full story).

Try it for free. Keboola has an always-free, no-questions-asked plan. So you can explore all the power of the data mesh paradigm. Feel free to give it a go or reach out to us if you have any questions.

Recomended Articles