Join our newsletter

#noSpamWePromise
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
cross-icon
Subscribe

Run your data operations on a single, unified platform.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
cross-icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Download the file

#getsmarter
Oops! Something went wrong while submitting the form.
cross-icon
Community
December 10, 2021
The modern data stack is broken. It’s time for Data stack as a service (DStaaS).
The modern data stack is complex, limited and expensive. Read this article to learn about a scalable alternative - data stack as a service.

Yes, I’ve said it. The modern data stack is a pain to work with.

But it wasn’t always like that. 

As companies realized they can leverage data to accelerate growth new data tools were invented. 

From NoSQL databases that specialize in processing specific data structures (graph anyone?) to the Python-Pandas-like Spark ecosystem that allows you to run queries on Big Data (capital B, mind you). 

But with every new tool added to the data stack, the complexity increased.

We have so many data tools nowadays that the data technology stack is starting to look like the land of JavaScript libraries. Just. Too. Many.

Trying to work with so many different tools is like herding cats. 

And any cat herd owner - or distributed data stack system worker - can attest that this is a nightmare.

What exactly is the problem, you might wonder?

Constant updates and maintenance 

Tools update all the time. Look at Facebook’s Graph and Marketing API changelogs. A new massive update comes out every month. Sometimes even four times in a single month.

And Facebook is not alone in this. All the data tool providers are guilty of improving their technological solutions. 

So what happens when a tool gets updated? Endpoints get deprecated, methods change names, techniques do not work across different versions, data needs migrating, the current script stops working, and the list goes on.

And with each update, the data pipelines tend to break down. With each update, we first need to fix the data tool that got upgraded. ASAP. But it is not just the updated tool that needs changing. The changes (and bugs!) are often propagated downstream. 

When the Facebook Marketing API changes, we need to change Facebook Ads extractor scripts. 

But also the data ingestion scripts, the data exploration scripts, our metric computation scripts, … All the scripts that depended on that first tool.

With so many updates in so many different places, it feels like we are constantly just chasing the dragon of change.

How many data engineers does it take to change an experiment?

We worship experiments as the holy grail of growth. 

The more experiments you do, the more you learn, the faster you optimize your product, the faster you grow.  Jeff Bezos is renowned for endorsing this experimentation culture:

“Our success at Amazon is a function of how many experiments we do per year, per month, per week, per day.”

But when you work with a dispersed and varied data stack, that quote seems more like a mug-motto than an actual data strategy. Running experiments with a dispersed and numerous data stack is slow and cumbersome.

Data pipelines are hard to change. Because tools are separated from each other, it is hard to make changes in all the tools you need to run your experiments. For example, if I want to run an A/B test to determine if one success metric could be improved on a different cohort, I will have to change the cohort identification and labeling in multiple tools before I can even run my experiment. 

Running experiments can have spillover effects. Because each tool is isolated, we often lack introspection into who is running what on which tool. So often a data experiment is hard to interpret because there might be clashing effects from changes in other tools. And don’t get me started on accidentally overriding the production server, because the test server is not properly configured. Yikes. 

“Where did X come from?”

Whether we are solving a bug using Root Cause Analysis or we got a data request under GDPR, we often have to trace data, step-by-step, from its location to its source.

With a distributed data stack an innocent request becomes a laborious task. Often it is not clear how data jumped from one tool to another. Let alone the lack of documentation of data flow between tools, each tool has its own rules. 

From opposing naming conventions (must be UPPERCASE vs snake_case only) to changing data types, as data flows through the system, it is hard to trace its journey.

Data observability is like finding the needle in the haystack. But there are multiple haystacks and each one is on a different continent.

Sharing is caring … unless you share data

Often when you share data, you want to share the entire data pipeline, or “logic” of how a metric, table, or dashboard got constructed.

This wish alone carries a domino effect of figuring out who has access to which tool and what permissions to give them, so they can see how the ETL pipeline got constructed.

Costs are higher (and hidden)

Deploying multiple distributed tools across your data stack increases your costs. 

Each new tool has a higher dollar cost than tools that are bound together. Hence the success of Infrastructure as a Service, or AWS and other cloud vendors who provide the infrastructure at lower costs than building it yourself piecewise.

But there are also hidden costs with maintaining multiple dispersed tools. 

From the cost of maintenance to the cost of onboarding different technologies and data architectures, distributed data stacks exert a higher toll on the bottom line.

And don’t get me started on trying to figure out how much your entire data stack costs. 

Companies build special pipelines and dashboards, just to bring data from disparate systems and consolidate it into a single view of their running costs.

Is there a better way?

There is. Data Stack as a Service (DSaaS). With Data Stack as a Service, all your disparate tools are consolidated and centralized into a single tool. 

Keboola offers exactly that - a centralized Data Stack as a Service.

By bringing disparate tools together you consolidate and lower costs:

data stack as a service vs the modern data stack



But cost-cutting is just one of the advantages.

Data observability and governance are easier in a centralized view than tracing data across a myriad of technical tools.

Keboola further takes care of the maintenance of your data stack (hence “as a service”), so you do not waste maintaining old and leaky data pipelines, but rather build new ones for faster time-to-market and experimentation. 

Pipelines you can easily share with others from a single location.

“But I already have tool X that I love!” - We get you. There is no arguing with love.

This is why Keboola unifies tools instead of replacing them. It is a universal connector, so you can plug in your existing data stack or adopt a new stack altogether. Building and deploying can be done with a couple of simple clicks. 

Take Keboola for a spin. Keboola has an always-free, no-questions-asked plan. So, you can explore all the power Keboola has to offer. Feel free to give it a go or reach out to us if you have any questions.


Run a 100% data-driven business without any extra hassle.
Pay as you go, starting with our free tier.

Run a 100% data-driven business without any extra hassle.
Pay as you go, starting with our free tier.

Recomended Articles