Join our newsletter

#noSpamWePromise
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
cross-icon
Subscribe

Run your data operations on a single, unified platform.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
cross-icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Download the file

#getsmarter
Oops! Something went wrong while submitting the form.
cross-icon
How To
August 26, 2021
What is data ingestion?
Learn everything you need to know about data ingestion.

We rely on advanced data platforms that extract data from multiple sources, clean it, and save it so data scientists and analysts can gain insights from data.

Data seems to flow seamlessly from one location to another, supporting our data-driven decision-making. The entire system runs smoothly because the engineering operations under the hood are correctly set and maintained.

In this article, we explore a data engineering paradigm called “data ingestion” that facilitates the movement of data throughout our enterprise.

Run a 100% data-driven business without any extra hassle.
Pay as you go, starting with our free tier.

What is data ingestion?

Data ingestion is the process that extracts data from raw data sources, optionally transforms the data, and moves the data to a storage medium where it can either be accessed, further transformed, ingested into a downstream data pipeline, or analyzed.

As you can see, data ingestion is an umbrella term encapsulating the movement of data from its data sources to its destination storage (relational database, data warehouse, or data lake).

That sounds very similar to ETL data pipelines. What is the difference?

Data ingestion vs ETL pipelines

The ETL process refers to the movement of data from its raw format to its final cleaned format ready for analytics in three basic steps (E-T-L):

  1. Extract. Data is extracted from its raw data sources.
  2. Transform. Data is transformed (cleaned, aggregated, etc.) to reshape it into a usable form.
  3. Load. The cleaned and aggregated data items are sent to the correct destination where they are loaded (read: saved) to the data storage of your choice.

What is the relationship between data ingestion pipelines and ETL pipelines?

There is no clear agreement regarding the two data architecture patterns. However, there are three interpretations:

1. ETL pipelines are the same as data ingestion pipelines

Data ingestion is synonymous with ETL - they are two interchangeable terms used to describe the refining of data for its analytic use.

2. Data ingestion is more similar to ELT

This interpretation focuses on the ELT pattern - the data ingestion process takes care of data extraction and loading (saving to a data warehouse like Snowflake, Amazon Redshift, or Google BigQuery), but not transformation.

Only later on, and in separate processes, data scientists, engineers, and analysts transform data to suit their needs.

The transformation logic is taken out of data ingestion. The proponents of this process emphasize that data ingestion should be considered as a separate unit, since it carries many challenges, such as ingesting unstructured or semi-structured data (more on this later).

3. The data ingestion process is categorically different for big data

The last interpretation considers data ingestion a distinct process from ETL or ELT. 

ETL and ELT processes are reserved for databases, data warehouses, and data marts, aka, all data storages that are not built exclusively for big data.

The moment we operate with big data, the challenge of extracting data that comes in large volumes, increased velocity, and wide variety, the usual architectural patterns fail us. 

We need to construct a separate ingestion layer that takes care of moving data from its raw sources to its destination. 

Proponents of this interpretation argue that data ingestion needs devoted data ingestion tools like Apache Kafka, Apache Flume, Hive, or Spark. 

… so, who is right?

We are not going to take a stance on the specific flavor data ingestion is supposed to take. Let’s leave that to the academics.

Instead, we are going to showcase the two types of data ingestion processing and talk about the challenges of data ingestion under all three interpretations.

Two types of data ingestion processing: batch vs stream

There are two approaches to data ingestion processing, depending on the frequency of the processing: batch and stream.

1. Batch processing

Batch processing extracts data from its source in discrete chunks and pushes them to the destination in fixed time intervals.

The batch data ingestion processing checks for new data after a fixed time interval has passed, and if new data has been sourced at the origin, the process will consume the new data and move it down the data pipeline.

You can think of batch processing as a cron job - every 15 minutes, the job checks for new data and runs the data ingestion process script.

2. Stream processing

Stream processing collects data from its source as it is generated or sourced. It represents real-time data ingestion - data is made available to consumers downstream as it is generated.

The low latency of this data ingestion paradigm allows businesses to work with near real-time data and supports real-time analytics and monitoring operations.

Unfortunately, streaming processing is a less effective data ingestion process. It consumes more resources to handle the incoming data since it needs to monitor the data sources continuously for new data items.

Run a 100% data-driven business without any extra hassle.
Pay as you go, starting with our free tier.

The Challenges of Data Ingestion

Data ingestion poses multiple interesting engineering challenges:

  1. Scaling. The data ingestion system needs to be able to scale as large volumes of data suddenly come rushing through the gates.
  2. Changes to source data. The schema, format, and type of data at the source can change. From API versioning to schema migrations, data ingestion requires a lot of engineering maintenance and upkeep to keep functioning correctly as data evolves at the source.
  3. New source data. As enterprises grow, their data grows. A data ingestion system must be built with horizontal growth in mind - it needs to be able to integrate new data sources into its system without affecting the existing operations. 
  4. Fault tolerance. A data ingestion process needs to handle faults in the system - from network downtimes that prevented scheduled batch ingestion to data losses during transformation, the system needs to be fault-tolerant. 

Streamline your Data Ingestion with Keboola

Keboola is an end-to-end data operations platform that automates ETL pipelines and data ingestion. 

With over 250 integrations between sources and destinations, Keboola can help you automate your data ingestion processes with a couple of clicks. 

Try it out. Keboola offers a no-questions-asked, always-free tier, so you can play around and build your pipelines leading to the data lake or data warehouse with a couple of clicks. 

Recomended Articles