It is Time to Rebundle the Modern Data Stack

Contents

Example H2

Example H3

How To

October 26, 2022

Updated on

5 min read

It is Time to Rebundle the Modern Data Stack

No items found.

Look, unbundling was great. But now it’s like herding cats.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

When you look closer at the Modern Data Stack (MDS) you need to brace yourself.

The number of tools companies use for their databases, user administration, data extraction, data integration, security, machine learning, and a myriad of other use cases has grown astronomically.

Matt Turck, VC at FirstMark, composes a yearly infographic of the hot tools in the datascape:

And this is just a shortlist of both the most popular and fastest-growing tools.

The abundance of tools companies rely on is so great, it is hard to even see the big picture (Portable's Modern Data Stack overview has a list with hundreds of tools. Check it out if you are currently shopping around).

But this was not always the case.

It started with Automated Linear Pipelines in Airflow

In the early days, companies built simple linear pipelines: extract data, transform it (clean it, aggregate it, …) and load it into data storage or ingest it into a visualization tool.

The data integration tool soon evolved into a platform. A tool that became a de facto centralized all-in-one, end-to-end data operations tool. The swiss knife of data tools.

As Gorkem Yurtseven puts it in his great essay “The Unbundling of Airflow”:

“Heavy users of Airflow can do a vast variety of data related tasks without leaving the platform; from extract and load scripts to generating reports, transformations with Python and SQL to syncing back data to BI tools.”

But as platforms grow, more users adopt them, and the needs of the users surpass the technical abilities of any platform. This is when the competitors take notice and carve out a piece of the pie to serve a neglected vertical.

The Great Unbundling of Airflow

What started as complex and multifaceted data pipelines built in Airflow, were soon unbundled and other tools took over the heavy lifting of neglected or underserved verticals.

To quote the eloquent Gorkan:

“Fivetran and Airbyte took care of the extract and load scripts that one might write with Airflow. dbt came for the data transformations, Census and Hightouch for Reverse ETL. Metrics and experimentation layers are also getting their own focused tooling; metrics with tools like Transform, Metriql, Supergrain and experimentation with Eppo.”

Unbundling was great for a while.

It offered companies hyper-specialized tools that did one thing and did that one thing extremely well.

But unbundling also left a gap in modern data practices.

In his essay Rebundling the Data Platform, Nick Schrok gets closer to the real problem of unbundling:

“Having this many tools without a coherent, centralized control plane is lunacy, and a terrible endstate for data practitioners and their stakeholders.”

So what exactly is the pain point behind a sea of specialized tools?

Complete the form below to get your complimentary copy.

Oops! Something went wrong while submitting the form.

What is the Problem with Multiple Specialized Tools?

There are four main downsides that the unbundling caused.

1. Data UNdemocratization

When multiple specialized tools are needed to construct a singular data pipeline, a lot of your crucial experts are excluded.

The modern tools are great, but they are tailored to specific niches of data experts. You have tools designed for data engineers, other tools for machine learning practitioners, tools for analysts, and tools for BI experts.

Each tool has its distinctive logic, quirks, and learning curve. And each tool holds a part of your data pipelines that is not accessible to other data profiles in your company. Let alone domain experts who lack technical know-how but desperately need data for their work.

How many times have you had to wait on data engineers to deploy a new pipeline before a data analyst could compute a new metric?

Or how often have you heard about one team being the bottleneck for a company report to non-technical stakeholders?

Unbundling caused the inverse of data democratization. A data UNdemocratization if you will. It keeps processes and data in silos, causing unwanted dependencies and bottlenecks.

2. Lack of Observability

Each tool is built as its independent platform. Each tool has its devoted way of tracing data lineage, logging errors, and orchestrating runs.

This is all fine and well until something goes wrong.

Without a holistic view of how the data traveled between your tools, observability becomes a nightmare.

A simple change of dependencies in your architecture or root cause analysis of a frisky bug consumes more hours, nerves, and coffee than anyone is willing to admit.

With the promise of increased agility, the Modern Data Stack has delivered a Trojan horse of endless lineage tracing across unsynchronized tools and vastly divergent tracing standards.

Even though there was an immediate response to this missing part - such as dedicated data discovery/governance tools, it is yet another tool™ in the mix, requiring to integrate everything. Should you embrace OpenLineage, OpenMetadata, or any other standard?

3. Security Is Spread Thin

Similar to observability, each tool needs its particular access control, privileges settings, and increases the area of security leakages and attacks.

As more tools are added to your company’s data stack, each tool acts as a potential vector of vulnerability.

Managing access and security becomes a juggling game against time and the inevitable “oh, I forgot about that one” slip.

Furthermore, this access is very often reduced to DB-driven access rights, alongside the VIP access to data engineers.

4. Increased Costs

The fragmented tools bring fragmented pricing. And they increase the total cost of ownership.

Every tool comes with its price tag. But that is not the most expensive part of the Modern Data Stack.

Because tools are hyper-specialized, they need devoted personnel to not just use them, but also maintain, deploy and fine-tune them. This is an opportunity cost. The time you could spend building shiny new data products is spent maintaining the machine that is supposed to build the data product.

But there is also a hidden cost.

In the same way, that vendor lock-in is a risk factor for your operations, person lock-in peeks around the corner, especially in this workforce scarcity environment.

If a team of SpecialToolOps configured part of your data stack in their own quirky way, it is hard to decouple the team from the tooling. So when someone leaves the company or is just absent from work, it is hard if not impossible to replace them (or the tool) without causing costly delays and downtimes in your data operations.

Is there a solution to the problems caused by unbundling?

Run a 100% data-driven business without any extra hassle. Pay as you go, starting with our free tier.

Enter The Next Phase: Rebundling

There is an increased awareness that we need to re-bundle the Modern Data Stack if we want to make it work.

But some approaches are better than others.

Certain advocates of rebundling are calling for improving the current tools so they can handle security, observability, quality monitoring, and lineage.

For example, the open-sourced Dagster is betting on the future of improved tooling. Their goal is to enable “the explicit modeling and operation of entities such as dbt models and tables in Airbyte destinations natively in Dagster.”

And we get it. Python is great. So is SQL in dbt. But data needs are not limited to just Python or dbt or [insert any one solution to rule them all].

Gorkem - again - chose the perfect words to describe the situation:

“Building a better Airflow feels like trying to optimize writing code that shouldn’t have been written in the first place.”

So what is the alternative solution?

The Modern Data Stack should be built as a data mesh architecture. The data mesh architecture is built on two core principles:

Common infrastructure as a service. All data operations, from ingestions to visualization, are built on a common infrastructure that has universally set up security, deployment, monitoring, and tracing processes across all teams.
Data as a product. Each team of (technical and domain) experts is responsible for end-to-end data products. The data products here include all the domain data assets: the machine learning outputs, the finalized datasets, as well as the data pipelines that produce the end results.

The architecture is loosely coupled. The infrastructure as a service takes care of the interoperability of all the tools, languages, frameworks, apps, and standards, while the teams choose the tools that serve them best. One note here: The “tools” might mean languages, but also frameworks or the whole tools (For instance, in the case of transformations - SQL, dbt, standalone app).

There is no such thing as a free lunch. So what is the tradeoff of the data mesh architecture?

Pavel Dolezal, CEO and co-founder of Keboola, puts it nicely:

“Sure you can hire the best 5 baristas to brew the most exquisite coffee for your company. But when 1000 employees need their morning fuel at the same time, your 5 baristas will not be able to handle their orders. Rather than relying on just 5 baristas, get more sophisticated coffee machines and teach people how to press a couple of buttons. Sure the coffee will not be Michelline Star class, but at least no one will suffer their morning jitters.”

So of course, allowing everyone at your company to build and play with data might not produce the most polished technical products. But it sure beats waiting around on the bottlenecks to resolve before you get your reports.

And the data mesh architecture is exactly what Keboola offers.

Keboola sets up the infrastructure, security, governance, environment, configuration, and all other DataOps out of the box (no worries, you can dive in and adjust/build your own data engines). You can either use it on its own for your data operations or integrate it with the specialized tools you grew to love.

It is also called the “Data Stack as a Service” platform, to emphasize the ability to bring your own tools to the data game. Keboola allows you to:

Run all the typical data operations. ELT, ETL, and Reverse ETL are made simple by components that integrate data from over 250 sources and destinations, without additional engineering or maintenance needed.
Include external solutions in the mix (executing simple data pipeline tool? Using custom Reverse ETL? All should be and is supported.)
Ready for data science. Out-of-the-box Python and R transformations and workspaces allow you to run free-code data science or plug-and-play with your favorite machine learning apps.
Security is a first-class citizen. Data governance and enterprise-level security standards are a must to run a smooth operation.
Observability. Every event, job, and user interaction is monitored to the finest granularity, to offer users an overview of the platform’s functioning.
Data democratization. Use the Data Catalog to share data between teams and departments.
Work democratization. From low-code, automation APIs, and user-friendly UIs to full platform as a code approach. Your personnel can choose what level of technical expertise they want when building data pipelines.
Interoperability. Keboola is designed to be fully interoperable with other tools and standards. From OpenLineage for governance to ML Flow for data science. There is no vendor lockin, use Keboola to plug and play with your favorite tools.

Ready to move from theory to practice?

Keboola has an always free, no-questions-asked plan so you can explore all the power of the data mesh paradigm. Feel free to give it a go.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

It is Time to Rebundle the Modern Data Stack

It started with Automated Linear Pipelines in Airflow

The Great Unbundling of Airflow