The Risks of Data Fragmentation

KKeboola Marketing TeamAug 9, 202211 min read

Mass data fragmentation causes companies multiple problems - from missed business opportunities to increased risks of security breaches.

To better appreciate the risks, we need to better understand what is the nature of data fragmentation, what causes it, and how to solve the problems of data fragmentation.

What is data fragmentation?

Data fragmentation describes the problem when your data assets are split across your entire data ecosystem without a common thread uniting them:

Different file systems (Excel, JSON, XML, …).
Different backups in different locations.
Multiple (on-premise and cloud providers) data storages, such as multiple databases and data warehouses.
Multiple non-connected versions of the same data assets (e.g. a production server, a testing server, a development server, and an analytic server all holding out-of-sync data about the same data assets).

The identifying characteristic of data fragmentation is a lack of a single location/solution that unifies all your data assets.

The ground truth represented in your data is only partially covered in each location/app/solution. Often, these partial truths come in conflict with one another.

Do not mistake data fragmentation for the distributed systems process

“Data fragmentation” is also a technical term in distributed systems.

Distributed database management systems (DBMS) split data sets into subsets to optimize the system’s performance. This process is (also) called data fragmentation.

Why is it useful? DBMS’ optimization algorithms use data fragmentation to improve SQL/NoSQL query processing. By running queries on only subsets of all the (big) data sets, the query workloads need lower memory allocations to reach the same results.

Also, it might be helpful to fragment your data for data privacy reasons. For example, a data architect for a healthcare app can design a hybrid fragmentation policy that uses algorithms to split data based on different locations - one vertical fragmentation for EU customers and GDPR compliance (different backups and data sharing policies) and one for the rest of the world.

Distributed system design is an interesting topic (read more about it here).

However, this article will focus on the issues of having data fragmented across your operations, not as a technical term.

Let’s start by exploring what gives rise to a fragmented data ecosystem.

What causes data fragmentation?

There is not a single process that leads to data fragmentation.

The variability of data operations will often lead to a fragmented data ecosystem:

Fragmented data stack. As you work with multiple tools (databases, ETL tools, BI tools), each tool will tend to dominate its own piece of the pie. Unless you synchronize them, they will quickly deviate from one another. For example, you will build some metrics with DAX in Power BI that will differ from the metrics you computed via Transformations in your Snowflake data warehouse.
Data silos. Each department and team working with data will have their own interests at heart. For instance, when Marketing is counting new customers, they will look at the user’s first purchase data on the website. When Sales will count new customers, they will look at the first contact they had with a customer. Unless someone unified the two definitions, you can quickly have two conflicting metrics (and double count some customers who talked to sales AND purchased online).
Engineering practices. It is easier to set up separate development, testing, production, and analytical servers (to appease the different technical teams and their data use cases) than to make sure they are all synced with each other.
And multiple other reasons.

As you can see, all of the options above stem from a lack of a holistic data management policy. And they happen naturally, as you use data.

But is this even problematic? What are the mass data fragmentation problems?

What are the risks of data fragmentation?

When your data is fragmented, you expose your company to 9 crucial data risks.

Data fragmentation risk #1: Conflicting business definitions

Data fragmentation often arises from each department running its own data collection, transformation, and storage practices.

When processes are divided into (departmental) data silos, each silo creates its own version of the business truth.

This is when conflicting business definitions arise.

The example above of Marketing using a different definition for new customers than Sales is just one illustration. The more fragmented your data, the harder it is to have consistent business definitions.

Data fragmentation risk #2: Lack of operational clarity

When data is dispersed across different datasets and systems, your frontline workers suffer and make mistakes.

For example, when your Customer Success Representative (CSR) talks to a client, there is a higher chance of looking at the wrong payment data and angering the client, if the CSR needs to check in four places before getting a definite answer to why you cut your client’s access to the platform.

Data fragmentation risk #3: Inaccessible data

When data is fragmented, different access rules are applied for each data set.

Maybe you don't have access credentials for the database you need. Or you do not have downloading privileges for the Excel file you need. Often we realize this too late since a lot of information is needed right then and now.

Solving problems in real-time becomes a wishful thinking exercise.

Data fragmentation risk #4: Higher chances of data loss

Similar to the issue above, when data is fragmented it can more easily be deleted by accident. Someone can delete an Excel file in good faith, not knowing this will cause a problem down the line.

Data fragmentation risk #5: Higher storage costs

Fragmented data is often duplicated. Different data silos keep a copy of the same information multiple times.

Information replication eats away unnecessary data storage space. For the majority of data initiatives, this is not a problem. But the moment your organization works with unstructured data (IoT data, images, videos, sound clips), the storage space bill can become prohibitively expensive.

Data fragmentation risk #6: Duplicated work

Whether it is a data collection pipeline extracting information from a data source you already have in-house or a duplicated database system storing customer details already present in your CRM, data fragmentation duplicates work.

Not knowing what data you have leads to your workers unnecessarily and unknowingly replicating the same efforts as their coworkers.

Data fragmentation risk #7: Slower development and prototyping

Developing data products, experimenting, and prototyping are all about speed.

Figuring out where is the data you need slows you down. Joining data across all your different locations for a single experiment prolongs the time to results.

Data fragmentation risk #8: Increased security risks

When data is dispersed in different locations, under different data management rules, it is harder to manage and enforce consistent security best practices.

Fragmented data increases your exposure surface and increases the risks of security breaches.

Data fragmentation risk #9: Vicious circle

Data fragmentation follows a vicious circle - the more fragmented your current data, the more likely your future data will be even more fragmented.

Partially, this stems from the broken window theory. But importantly, data fragmentation feeds itself, because it is always easier adding a new data source, a new data silo, or a new data storage than figuring out how to find, join, and sync the existing data assets.

Given the multiple risks of data fragmentation problems, how can we prevent them or even fix them?

How to correct data fragmentation?

When you're faced with a fragmented data ecosystem, there is only one solution: bring clarity with a coherent and holistic data management strategy.

Step 1: Take inventory of your data assets

Sit with all the department/team leaders (or other representatives of data silos) and record all your data assets. Plus points for understanding their data lineage and how data travels through your systems.

Step 2: Join data assets

Wherever possible, try to join existing assets together. You might need to get technical here. For example, joining multiple customer identifiers (emails, addresses, telephone numbers, etc.) into a single customer representation view that can bridge different data sets.

Step 3: Discard data assets

Don’t be afraid to discard data assets. Especially when they are outdated or duplicated.

Step 4: Set up a data governance framework

Set up a data governance framework that specifies how:

Data is tracked (plus lineage).
Data is accessed and shared.
Data is validated and quality is upheld.
Ownership of data assets is set up and maintained.

We wrote extensively on the topic of data governance automation, feel free to deep dive.

A good data governance framework will help you better understand your data assets, spot when fragmentation is happening, and work against fragmentation.

Step 5: Monitor, evaluate, and adjust

Your data operations and systems will tend towards chaos if left unattended. The only way to build resilient systems is to constantly:

Monitor your data assets for data quality and compliance with your data governance framework.
When you find deviations, evaluate how to fix them.
Adjust your current processes, practices, and technological solutions to best solve and prevent future data fragmentations.

Keboola can help you fight data fragmentation

Keboola is an end-to-end data operations platform that comes with in-built data governance tools, that help you automate a lot of the heavy lifting when implementing your data governance policies that resist data fragmentation:

Track data lineage and operational metadata, describing user activity, job activity, data flow, schema evolution, data pipeline performance, compliance with your security rules, etc. Keboola implements data governance by design, which offers you extensive people tracking and audit capabilities as well fingerprinting to comply with regulatory standards on one hand, and fully understanding the data lineage on the transaction - and event-level on the other hand.
Deploy the Data Catalog to centralize and unify data definitions, hence increasing data understanding and accessibility across business departments. Unified definitions allow you to increase data quality as well, by disambiguating different interpretations of the same incoming data.
Guarantee best-in-class security standards out of the box.
Support different governance roles, by using granular access practices, which safeguard data safety and privacy, while empowering every user to get the data they need to do their best work.
Automate all your ETL data pipelines: from data collection, cleaning, transformation, and ingestion into other tools.

But Keboola is not just a tool for automating data governance. It is designed to automate and speed up all data operations, from ETL pipeline construction and maintenance to deploying machine learning models in production.

We offer a no-questions-asked always-free tier. Try Keboola out and check for yourself what Keboola can do for you.

How To