How to get started with data lineage

No items found.

And cut through the complexity of data operations...

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

Modern enterprises leverage over 400 data sources to stay ahead of the competition.

The sheer volume and complexity of data operations raise several challenges for intrepid organizations:

Changes to source data are not timely identified and propagated throughout the system. This causes delays in data alignment and mistrust in the data quality.
Maintenance of interdependent data pipelines that collect and refine data for analysis generates engineering overhead.
Business intelligence reports are inconsistent. Data analysts spend long hours reconciling the divergent metrics and key performance indicators across BI reports.

How to cut the complexity of data operations? Enterprises turn to data lineage for the answer.

1. What is data lineage?

Data lineage is the process of recording and visualizing data assets as they flow along your system.

It involves mapping the entire data lifecycle: from the raw data, through each component in the ETL (extract, transform, load) pipeline, to its destination (ex.: data storage or BI report).

The data journey should specify the following components:

Business applications and specific business processes that gather and manipulate data.
Datastores, where data is permanently or temporarily loaded: files, databases, data warehouses, and data lakes.
Transformations along the pathway including the business rules applied to changing and aggregating data.
The users, consumers, and owners of data.

Data lineage seems very similar to other metadata management concepts, such as data provenance, governance, and similar.

Understanding their interplay and differences between these categories of data management helps you better understand the role of data lineage in modern enterprises.

What is the difference between data lineage and data provenance?

Data provenance is concerned with establishing data’s origins to help data scientists and other end consumers better understand the data they work with.

Experts in academia and industry practitioners can’t agree whether data provenance and lineage refer to the same set of practices, or whether data provenance is a more abstract and high-level description of the data journey, while data lineage records the data flow in more details (e.g. recording actual table names and metadata changes).

However, they are both related to data governance.

What is the difference between data lineage and data governance?

Data governance refers to all the practices which are legally required (and highly desirable!) in managing your data across your operational systems.

Data lineage is just one of the tools in the utility box of data governance. It allows users to track and audit data along its flow. But data governance encompasses other practices as well, such as data governance policies, data stewardship, master data management (MDM), reference data management (RDM), etc.

2. Why is data lineage so important?

Data lineage is essential for two reasons:

It helps to fulfill regulatory compliance.
It brings multiple advantages to data operations.

Modern enterprises are bound by multiple legal requirements that define the constraints of their operations. Among many, the EU’s General Data Protection Regulation (GDPR) regulates how data should be tracked and transformed to preserve users’ anonymity and safety and keep data integrity high, the Basel Committee on Banking Supervision‘s standard number 239 supervises risk management in financial institutions, TRIM specifies consistent modeling across banks, etc.

Implementing data lineage gives enterprises a way to inspect their data to implement the regulators’ requirement for data traceability, auditing, and transparency. But it also has spillover advantages for other business areas.

The benefits of data lineage

Data lineage brings multiple benefits to enterprises:

Clarifies operations. Engineering data integrations can be hard. Multiple data sources can cause duplications of ETL pipelines and maintenance overhead. Consulting the data lineage map helps resolve engineering questions faster and speeds up the data integration of new sources.
Improves business decision-making. Data is the foundation of decision-making across departments. Marketing, sales, finance, manufacturing, operations, all departments use data for their strategic positioning. Understanding where data comes from and how it is transformed can disambiguate subtle interpretations. For example, if we know customer demographic data is solely collected through advertising APIs, we will limit our conclusions about “a typical customer” to customers, who were acquired via advertising. These data interpretations help you determine the appropriate trust level in data and therefore improve business decision-making.
Increase data quality. All data pipelines are bound to introduce errors in the data flow at some point. Data lineage is leveraged for error resolution via Root Cause Analysis (RCA). Consulting the mappings of the data journey helps you trace the error down the pathway to its source and remove the error. This speeds up error resolution as well as improves the data quality.
Ease data migrations. System migrations are performed for multiple reasons: to increase the IT resilience of the infrastructure supporting your systems, to replicate your databases for accessibility, etc. Consulting data lineage to understand the logic of data flow helps you reproduce the system accurately when migrating data.
Keep track of data changes. Data constantly changes at its source. New attributes are added, existing characteristics are removed, the interpretation of some data dimensions is altered. Data lineage supports impact analysis, which identifies data changes faster and helps your data scientists and business analysts understand the impact of the change on existing data.

4. How to get started with data lineage?

Recording your data flows can be done in four steps:

Start at the end. Document all end states of your data. That is, meticulously detail all BI reports, all tables, and all saved files where data ends. Make sure to stipulate the metric names as well as table names and field details.
For each end data asset, back engineer its path to its source. Make sure to note down all components (transformation scripts, intermediary storages, metadata alterations, business rules applied, etc.) that affected the data and all people, who touched the data for auditing purposes.
Assign owners to subflows. Lineage tracing can be intensive. Make sure to assign data owners who will monitor and understand data changes across their dedicated pathways.
Visualize data journeys in a flow diagram. Create a higher-level abstraction that showcases the pathways data takes from the source to its end destination.

Creating a comprehensive data lineage can be exhaustive. Manually assembling data lineage becomes practically impossible as soon as your data operations increase beyond a trivial level.

For complex data operations, there are practically only two options:

Create data lineage maps for only a subset of data journeys. Limit your lineage recording only for those data pathways that are crucial to your business. This can be sensitive data tracing for regulatory compliance or strategic data tracing for business impact.
Rely on software that helps you record and trace data lineage at scale. Keboola can help you implement data lineage alongside your operations with minimal-to-no additional work.

5. How can Keboola help you with data lineage?

Keboola is an end-to-end data platform. It is built to automate the entire data integration pipeline: from collecting data, transforming it, and storing it for analysis.

Though its central features are focused on automating data work, Keboola was built with tracing data lineage in mind.

Each data operation within the platform is tagged with operational metadata describing user activity, job activity, data flow, schema evolution, data pipeline performance, compliance with security rules, etc.

The extensive metadata is triggered by the platform running its usual operations and you do not have to do any additional work. Simply build data lineage on the fly and automatically, to understand where the data is coming from and how it is used.

Take Keboola for a spin. Keboola has an always-free, no-questions-asked plan. So, you can explore all the power Keboola has to offer. Feel free to give it a go or reach out to us if you have any questions.

Online