Join our newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Run your data operations on a single, unified platform.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Download the file

Oops! Something went wrong while submitting the form.
How To
July 1, 2021
The 4 best data lineage tools
Learn why you need a data lineage tool and the four best tools on the market.

The modern enterprise taps into over 400 different data sources to extract the insights that sharpen its competitive edge. 

The complexity, though, does not stop at the origin, where data is generated.

To get valuable insights from raw data enterprises must extract data from its source, transform the data (clean and aggregate it), and finally load the data into a data warehouse or BI tool, where it is served to data scientists for analysis. 

The ETL (extract, transform, load) data pipelines require the expert knowledge of data engineers to set up and maintain. 

Without a data lineage tool, which shines a light on the data flows through the complex ecosystem of interdependent data flows, enterprises would be flying blind.

What is a data lineage tool?

A data lineage tool is software that allows you to view and inspect, well, the data lineage

So, the question is what is data lineage? 

It is a mapping of the data lifecycle as it moves from its source to its final destination. 

The data journey should specify multiple entities:

  • The data provenance, aka the sources from where data is extracted (3rd Party Apps, APIs, CSV files, databases, data warehouses, etc.).
  • The transformations each data point undergoes (e.g. apply business logic to define which are new customers, aggregate sales on quarterly levels, remove outliers, …). So you can backtrack transformed data back to its origin.
  • Datastores, where data is either temporarily or permanently loaded: files, databases, data warehouses, and data lakes.
  • Integrations with other tools, such as business intelligence software (Looker, Metabase, Tableau, PowerBI, …) or CRMs.

The data journey - or if you prefer, the flow of data - should be inspectable throughout the lineage. 

That is, data lineage tools provide full visibility of data changes via audit trails that showcase in fine granularity each touchpoint and how it changed the data. 

Data lineage software’s capturing capabilities allow an in-depth introspection into the data assets and how they changed through time: the person, who accessed the data, the person or process, who initiated data changes, and give the auditor the possibility to replay the events, which led to the state you are inspecting.

So, why do enterprises turn to data lineage tools? 

Why do you need a data lineage tool? 

Multiple necessities and benefits are driving the adoption of data lineage software:

  1. Data governance. Data governance refers to the rules and policies set in place for data management. It is especially crucial in modern enterprises that operate with big data volumes of information, as clear governance policies and practices offer enterprises introspection, control, and operational clarity for managing data.
  2. Regulatory compliance. Multiple legislative regulations, such as the GDPR, require enterprises to be able to trace data along its lineage. As an example, data privacy laws give users the right to request full introspection into how their data was collected (data provenance), who had access to it (audit logs), and the right to request full deletion. Especially the latter can be extremely hard if you do not have data lineage software to guide you through all the touchpoints and locations, where personal data resides in your data platforms.
  3. Data quality. Data lineage tools offer two advantages to improve data quality:
  1. Higher interpretability. Understanding how business data values were created along the data pipelines, helps us (and data analysts) better interpret the data we work with. Let’s say that you define “new customers” to be those purchasers, who shopped for the first time in the last 90 days (but not longer than that). The “new customer” tag is inserted into your data during a SQL transformation. Knowing this - via data lineage tools - helps you interpret a hypothetical drop in new customers. Instead of looking at purchases this week, you would go back into the transaction database and look at the right data - the last 90 days. The interpretability via data lineage software brings higher accuracy and consistency in decision-making.
  2. Root cause & impact analysis. Issues with data are often discovered downstream during business reports, not upstream, where the error is produced. Data scientists, analysts, and BI reporters find missing data, corrupted data, or plainly wrong data when searching for insights. Data lineage tools allow you to trace back those errors, via the data lineage, to the source. This expedites error removal and delivers faster and higher levels of data quality.
  1. Easier migrations. Successful migration from on-premise to the cloud (or vice versa) will not interrupt business operations. To move data between different systems, data ops engineers map all the workflows of how data travels, so they can recreate the same system at the new location. Data lineage tools allow them to understand the overall mechanisms within the data platforms faster and more accurately. 

Unlock the advantages of data lineage for your enterprise data by choosing the right tool for your company. 

The top 4 data lineage tools of 2021 

1. Keboola

Keboola is the end-to-end data operations platform. With Keboola you can automate your entire data integration pipeline: from collecting data, transforming it, and storing it for analysis.

At each step of the pipeline, Keboola automatically tracks all relevant metadata and constructs logs, which gives you a granular view of data lineage. 

Each operation on the data platform is extensively tagged with operational metadata describing the user activity on the events level, job activity (for automated workloads), the schema evolution throughout the changes, compliance with security rules, and even monitors for optimal pipeline performance. 

Main advantage? 

All the metadata collection is done automatically. You do not need to waste any energy building up the data lineage, or integrating with any additional tools. Simply perform your usual data operations, and data lineage comes out of the box.

Because Keboola does not just offer metadata, you can tap into its rich resources for any needs you have from end-to-end, such as:

  • Advanced features for machine learning and artificial intelligence (Jupyter notebooks, experimentation sandboxes, development branches (new!), data sharing, pipeline observability, and repeatability)
  • Features that support a common understanding of data, such as the Data Catalog, which acts as a business glossary of all your important terms and aids you with data discoverability.
  • Multiple languages to build your code: from SQL to Python, from Julia to R, and everything in between.
  • And many, many more. 

Take Keboola for a spin. Keboola has an always-free, no-questions-asked plan. So, you can explore all the power Keboola has to offer. Feel free to give it a go or reach out to us if you have any questions.

2. Dremio

Dremio is a querying engine that allows you to access your data in data lakes, such as on Amazon Web Services (AWS) or Microsoft Azure. 

Dremio establishes data lineage by constructing a data graph, where the relationship between your data sources, virtual datasets, and all your queries are maintained.

Dremio’s advantage is that it is easily used by non-technical personnel when constructing query-like tasks.

Its main disadvantage is that Dremio was built as a querying software, so does not offer out-of-the-box flexibility and maturity of features that comes from data operations platforms. For example, Dremio has hiccups when integrating from the data sources within the platform to other BI tools, as well as limited data governance and auditing features.

3. Kylo

Kylo is an open-source data lake management software platform. 

Kylo is rather versatile. It allows users to ingest, cleanse, validate, profile, and wrangle (in SQL) data lakes data, as well as monitor it.

Kylo maintains a rich metadata store that automatically fills in from the data tables. Users can access field- and table- metadata by querying, as well as visually inspect data provenance.

A common shortcoming is that Kylo necessitates specialized engineering skills to understand and use it. It can be thought of as an accelerator for data engineers, who want to build their own data lineage, not for non-technical users.

4. Octopai

Octopai is a different data lineage tool from the ones presented above. Instead of centralizing on a data operation task (end-to-end, querying, ingestion, …), Octopai does just data lineage.

It connects to your data storage or BI environment and extracts metadata to build data lineage insights. 

An obvious advantage of Octopai is that it does just this one thing and it does it well.

The disadvantage of Octopai is that because its features are limited to data lineage, business glossary, and data discovery, it does not build other tools that would pair well with data governance, such as access control, security monitoring, etc. Additionally, being rather novel, Octopai has the shortcoming of integrating with only a selected few technologies

Start with data lineage today

Start your journey towards better data tracking by:

  1. Diving deeper into the nuances of data lineage, by reading our in-depth guide.
  2. Take Keboola for a spin and check for yourself what it can do for your data ingestion and lineage. 
  3. Contact us in case you have any questions about lineage. Or if you just want to chat about something else. We are friendly that way. 

Recomended Articles