We rely on advanced data platforms that extract data from multiple sources, clean it, and save it so data scientists and analysts can gain insights from data.
Data seems to flow seamlessly from one location to another, supporting our data-driven decision-making. The entire system runs smoothly because the engineering operations under the hood are correctly set and maintained.
In this article, we explore a data engineering paradigm called “data ingestion” that facilitates the movement of data throughout our enterprise.
Data ingestion is the process that extracts data from raw data sources, optionally transforms the data, and moves the data to a storage medium where it can either be accessed, further transformed, ingested into a downstream data pipeline, or analyzed.
As you can see, data ingestion is an umbrella term encapsulating the movement of data from its data sources to its destination storage (relational database, data warehouse, or data lake).
That sounds very similar to ETL data pipelines. What is the difference?
The ETL process refers to the movement of data from its raw format to its final cleaned format ready for analytics in three basic steps (E-T-L):
What is the relationship between data ingestion pipelines and ETL pipelines?
There is no clear agreement regarding the two data architecture patterns. However, there are three interpretations:
Data ingestion is synonymous with ETL - they are two interchangeable terms used to describe the refining of data for its analytic use.
This interpretation focuses on the ELT pattern - the data ingestion process takes care of data extraction and loading (saving to a data warehouse like Snowflake, Amazon Redshift, or Google BigQuery), but not transformation.
Only later on, and in separate processes, data scientists, engineers, and analysts transform data to suit their needs.
The transformation logic is taken out of data ingestion. The proponents of this process emphasize that data ingestion should be considered as a separate unit, since it carries many challenges, such as ingesting unstructured or semi-structured data (more on this later).
The last interpretation considers data ingestion a distinct process from ETL or ELT.
ETL and ELT processes are reserved for databases, data warehouses, and data marts, aka, all data storages that are not built exclusively for big data.
The moment we operate with big data, the challenge of extracting data that comes in large volumes, increased velocity, and wide variety, the usual architectural patterns fail us.
We need to construct a separate ingestion layer that takes care of moving data from its raw sources to its destination.
Proponents of this interpretation argue that data ingestion needs devoted data ingestion tools like Apache Kafka, Apache Flume, Hive, or Spark.
… so, who is right?
We are not going to take a stance on the specific flavor data ingestion is supposed to take. Let’s leave that to the academics.
Instead, we are going to showcase the two types of data ingestion processing and talk about the challenges of data ingestion under all three interpretations.
There are two approaches to data ingestion processing, depending on the frequency of the processing: batch and stream.
Batch processing extracts data from its source in discrete chunks and pushes them to the destination in fixed time intervals.
The batch data ingestion processing checks for new data after a fixed time interval has passed, and if new data has been sourced at the origin, the process will consume the new data and move it down the data pipeline.
You can think of batch processing as a cron job - every 15 minutes, the job checks for new data and runs the data ingestion process script.
Stream processing collects data from its source as it is generated or sourced. It represents real-time data ingestion - data is made available to consumers downstream as it is generated.
The low latency of this data ingestion paradigm allows businesses to work with near real-time data and supports real-time analytics and monitoring operations.
Unfortunately, streaming processing is a less effective data ingestion process. It consumes more resources to handle the incoming data since it needs to monitor the data sources continuously for new data items.
Data ingestion poses multiple interesting engineering challenges:
Keboola is an end-to-end data operations platform that automates ETL pipelines and data ingestion.
With over 250 integrations between sources and destinations, Keboola can help you automate your data ingestion processes with a couple of clicks.
Try it out. Keboola offers a no-questions-asked, always-free tier, so you can play around and build your pipelines leading to the data lake or data warehouse with a couple of clicks.