Mass data fragmentation causes companies multiple problems - from missed business opportunities to increased risks of security breaches.
Mass data fragmentation causes companies multiple problems - from missed business opportunities to increased risks of security breaches.
To better appreciate the risks, we need to better understand what is the nature of data fragmentation, what causes it, and how to solve the problems of data fragmentation.
Data fragmentation describes the problem when your data assets are split across your entire data ecosystem without a common thread uniting them:
The identifying characteristic of data fragmentation is a lack of a single location/solution that unifies all your data assets.
The ground truth represented in your data is only partially covered in each location/app/solution. Often, these partial truths come in conflict with one another.
“Data fragmentation” is also a technical term in distributed systems.
Distributed database management systems (DBMS) split data sets into subsets to optimize the system’s performance. This process is (also) called data fragmentation.
Why is it useful? DBMS’ optimization algorithms use data fragmentation to improve SQL/NoSQL query processing. By running queries on only subsets of all the (big) data sets, the query workloads need lower memory allocations to reach the same results.
Also, it might be helpful to fragment your data for data privacy reasons. For example, a data architect for a healthcare app can design a hybrid fragmentation policy that uses algorithms to split data based on different locations - one vertical fragmentation for EU customers and GDPR compliance (different backups and data sharing policies) and one for the rest of the world.
Distributed system design is an interesting topic (read more about it here).
However, this article will focus on the issues of having data fragmented across your operations, not as a technical term.
Let’s start by exploring what gives rise to a fragmented data ecosystem.
There is not a single process that leads to data fragmentation.
The variability of data operations will often lead to a fragmented data ecosystem:
As you can see, all of the options above stem from a lack of a holistic data management policy. And they happen naturally, as you use data.
But is this even problematic? What are the mass data fragmentation problems?
When your data is fragmented, you expose your company to 9 crucial data risks.
Data fragmentation often arises from each department running its own data collection, transformation, and storage practices.
When processes are divided into (departmental) data silos, each silo creates its own version of the business truth.
This is when conflicting business definitions arise.
The example above of Marketing using a different definition for new customers than Sales is just one illustration. The more fragmented your data, the harder it is to have consistent business definitions.
When data is dispersed across different datasets and systems, your frontline workers suffer and make mistakes.
For example, when your Customer Success Representative (CSR) talks to a client, there is a higher chance of looking at the wrong payment data and angering the client, if the CSR needs to check in four places before getting a definite answer to why you cut your client’s access to the platform.
When data is fragmented, different access rules are applied for each data set.
Maybe you don't have access credentials for the database you need. Or you do not have downloading privileges for the Excel file you need. Often we realize this too late since a lot of information is needed right then and now.
Solving problems in real-time becomes a wishful thinking exercise.
Similar to the issue above, when data is fragmented it can more easily be deleted by accident. Someone can delete an Excel file in good faith, not knowing this will cause a problem down the line.
Fragmented data is often duplicated. Different data silos keep a copy of the same information multiple times.
Information replication eats away unnecessary data storage space. For the majority of data initiatives, this is not a problem. But the moment your organization works with unstructured data (IoT data, images, videos, sound clips), the storage space bill can become prohibitively expensive.
Whether it is a data collection pipeline extracting information from a data source you already have in-house or a duplicated database system storing customer details already present in your CRM, data fragmentation duplicates work.
Not knowing what data you have leads to your workers unnecessarily and unknowingly replicating the same efforts as their coworkers.
Developing data products, experimenting, and prototyping are all about speed.
Figuring out where is the data you need slows you down. Joining data across all your different locations for a single experiment prolongs the time to results.
When data is dispersed in different locations, under different data management rules, it is harder to manage and enforce consistent security best practices.
Fragmented data increases your exposure surface and increases the risks of security breaches.
Data fragmentation follows a vicious circle - the more fragmented your current data, the more likely your future data will be even more fragmented.
Partially, this stems from the broken window theory. But importantly, data fragmentation feeds itself, because it is always easier adding a new data source, a new data silo, or a new data storage than figuring out how to find, join, and sync the existing data assets.
Given the multiple risks of data fragmentation problems, how can we prevent them or even fix them?
When you're faced with a fragmented data ecosystem, there is only one solution: bring clarity with a coherent and holistic data management strategy.
Sit with all the department/team leaders (or other representatives of data silos) and record all your data assets. Plus points for understanding their data lineage and how data travels through your systems.
Wherever possible, try to join existing assets together. You might need to get technical here. For example, joining multiple customer identifiers (emails, addresses, telephone numbers, etc.) into a single customer representation view that can bridge different data sets.
Don’t be afraid to discard data assets. Especially when they are outdated or duplicated.
Set up a data governance framework that specifies how:
We wrote extensively on the topic of data governance automation, feel free to deep dive.
A good data governance framework will help you better understand your data assets, spot when fragmentation is happening, and work against fragmentation.
Your data operations and systems will tend towards chaos if left unattended. The only way to build resilient systems is to constantly:
Keboola is an end-to-end data operations platform that comes with in-built data governance tools, that help you automate a lot of the heavy lifting when implementing your data governance policies that resist data fragmentation:
But Keboola is not just a tool for automating data governance. It is designed to automate and speed up all data operations, from ETL pipeline construction and maintenance to deploying machine learning models in production.
We offer a no-questions-asked always-free tier. Try Keboola out and check for yourself what Keboola can do for you.