Use change data capture for faster yet consistent replications
In the age when data is the new oil, more than 80% of IT decision-makers delay their business decisions due to slow data processing.
Architecting your ETL pipeline with database replication can speed up your data processes.
Database replication creates an analytic database as a separate copy of your production database. This unburdens the transactional database from analytical queries while securing fresh data in the analytical database for faster time-to-insights.
Data replication can be achieved with many replication techniques. But change data capture (CDC) is one of the most popular architectural patterns.
In contrast to other techniques, such as full-table loads that copy entire tables at every replication cycle, change data capture is more ingenious. CDC copies only the rows that have changed since the last replication.
What are the advantages of CDC?
Unlike bulk load updates to sync data, CDC offers multiple benefits when replicating a data source:
- Fast. CDC replicates a lower data volume from the source database - only the rows that have changed - so it completes the replication process faster.
- Less burden on the network. CDC sends less data - only the rows that have changed - via the network from the source to the replica, so your broadband is not put under strain.
- Less burden on the production database. When CDC is implemented with the correct techniques, your production database is unaffected by the replication operations. This frees resources for transactions, instead of locking committed transactions until the replication is completed.
- Synchronous replication. CDC can be used for real-time data integration, allowing you to develop streaming analytics or replication with nearly zero delays.
- Flexible. CDC can adapt to changes in the database and table schemas, flexibly adjusting to your data changes and propagating your updates from the producers to the consumers of data.
How does change data capture work?
Change data capture identifies the rows in source tables that have changed since the last replication.
It replicates transactional changes to data, such as new data being added to a table, existing data being altered, or past data being removed. These are the create-update-delete (CUD) operations also known in SQL by their commands: INSERT, UPDATE, DELETE.
Moreover, CDC also identifies changes in metadata, such as schema migrations (column name changes, the addition of attributes to tables, etc.), and flexibly adjusts the target database for the schema changes.
There are multiple ways of implementing CDC. The implementation details have spillover effects on the efficiency of the CDC replication.
What are the different methods of implementing change data capture?
In general, there are three ways to implement change data capture: by replicating logs, replicating transactions, or by writing a custom script to perform CDC.
1. Log-based CDC
Enterprise databases commit all CUD transactions to the database transaction log for recovery in case of a database failover. The transaction log is kept separate from the operations of the relational database. Log-based CDC parses the logs for instructions on how to replicate transactions from the source database to the analytical one.
Log-based CDC has multiple advantages:
- Data integrity. Because of atomicity, transactions are either executed in full or not at all. Replicating transaction logs necessarily means no changes will be missed since all the changes are recorded there. This guarantees the replica database will be a pristine copy of the original database with extremely high reliability.
- Minimal latency. Transactions are recorded in the transaction logs either immediately at the time of execution or even before a transaction is committed. Continuously monitoring the logs and replicating them allows CDC to replicate in real-time or near-real-time.
- Low-to-no system impact. Logs are separate from the operational database system. Parsing logs does not affect transactions in the database, reducing the SQL query load strain on the production database.
Unfortunately, there are also drawbacks to log-based CDC:
- Different log formats. For example, PostgreSQL uses the write-ahead-log (WAL) format, while MySQL implements the Binary Log (binlog). The differences in formats are burdensome to bridge and it is hard to implement CDC as a universal solution for multiple databases.
- Limited log access. Log formats are usually proprietary, poorly documented, and some databases do not offer access to their logs. This impedes or fully prevents you from implementing a log-based CDC.
Log-based CDC is the most performant implementation of change data capture. Its main challenge is the feasibility of implementation. But once the hurdle is solved, log-based change data capture is the most performant CDC for your database replication.
2. Trigger-based CDC
Trigger-based CDC uses SQL database triggers on source transactions to perform the same transactions on the destination database. For example, you can set up triggers on a table to send data to a destination table “AFTER INSERT” or “AFTER UPDATE” commands are run on the source table.
There are multiple advantages to trigger-based CDC:
- Customizable. You can specify only certain operations in your replication, for example only INSERT triggers, or implement triggers only on certain tables. This allows you to construct a custom replica of the source database under special conditions, which is desirable for constructing data sets for specific cases, for example, a data set of financial data of first received data (INSERT) instead of final form data (UPDATE).
- No latency. Trigger-based CDC creates replicas almost at the same time as original data since the same transaction is committed to both the original database and replica database.
However, trigger-based CDC also has drawbacks:
- Executional overhead. The same transaction needs to be committed twice, increasing the burden on the production database.
- Implementational overhead. Triggers need to be set up for each table separately. The implementation and management overhead can be costly if your source database is big.
- Reliability issues. Some triggers never fire. For example, you cannot set up a trigger on TRUNCATE operations on a SQL server.
Trigger-based CDC can cause severe overheads. It is best deployed when only parts of the source database need to be replicated and there is an advantage exposing replicate data fast.
3. Custom CDC script
The third option for implementing CDC is writing your own CDC script. Custom CDC scripts use store procedures to trigger SQL queries that identify some predefined changes. Changes are usually tracked as separate columns. For example, having a timestamp column LAST_UPDATED that indicates the timestamp of the last change for a given row or a VERSION_NUMBER column that shows the consecutive version of the row update. The script then checks if the LAST_UPDATED/VERSION_NUMBER values differ between the replica and source table, and copies data from source to destination.
The advantages of custom CDC scripts are:
- High levels of specificity in replication. For example, a developer can specify multiple change-tracking columns, such as one for the timestamp of the last update, once for the version number, and one for flagging if the change contains a specific feature. This customizability can be used for highly refined conditional replication, such as specifying to run replication only for “Rows that have changed since 2021-01-01 12:00 and were beforehand changed at least three times (version number), as well as they describe the new feature launch (boolean feature flag)”.
However, custom scripts also come with disadvantages:
- Reliability issues. Schema changes are not automatically reflected in the replication, which can cause the replication to fail (silently) if the script is not manually updated.
- Performance issues. Stored procedures are often implemented as transactions, so the database incurs additional costs by running them, which slows down database performance.
Custom CDC scripts require a lot of maintenance and monitoring for little added benefits, especially in comparison to similar trigger-based CDC implementations. If specific queries are needed on a table, it is better to create a new view in the replica database after all the data has been migrated. Custom CDC is, therefore, most useful with low data volumes and for replication needs that are highly specific and limited in scope.
Simplify CDC with Keboola
Keboola is the end-to-end data platform that streamlines and automates the heavy lifting behind data operations.
CDC is easy to perform with Keboola. Simply connect your source database with an extractor that collects data from your source database and saves it to a staging database. From there, deploy writers to load the data into the replica database or data warehouse of your choice.
Why is Keboola better than other CDC tools?
- The breadth of data integrations. Keboola connects to over 250 sources and destinations. No need to write your change log capture system, Keboola has extractors that do that for you.
- Simplicity. The intuitive UI is built for ease and speed. All processes can be deployed in a couple of clicks without needing a data engineering certificate.
- No maintenance. All the components used in CDC are maintained by Keboola and its platform partners. Save time on debugging custom scripts and faulty open-sourced projects by relying on a professional team to take care of your database replication.
Try it for free. Keboola has an always-free, no-questions-asked plan. So, you can explore all the CDC power Keboola has to offer. Feel free to give it a go.