Understanding Change Data Capture (CDC)
Change Data Capture (CDC) is a technique for identifying and capturing changes made in a source system—like a transactional database—and ensuring that those changes are reflected in downstream systems. CDC enables near real-time data replication, analytics, and automation by tracking inserts, updates, deletes, and even schema changes in your databases.
Traditionally, teams faced several obstacles with CDC:
- Complex setup and configuration processes
- High costs from leading CDC vendors
- Ongoing maintenance and troubleshooting
- Difficulty restoring databases after replication failures
- Challenges capturing deletes and updates, especially on legacy systems without timestamp columns
- Handling schema shifts (adding, renaming, or deleting columns)
Keboola’s CDC solution addresses all these issues with a managed, cost-effective, and user-friendly platform.
Why Choose Keboola for CDC?
- Zero Maintenance: Fully managed service with minimal setup.
- Affordable Pricing: Significantly lower total cost of ownership compared to traditional CDC solutions.
- Resilient Replication: Handles failures gracefully, making database restores straightforward.
- Comprehensive Change Tracking: Captures inserts, updates, deletes, and schema changes—even if your source doesn't have timestamp columns.
- Legacy Compatibility: Works with systems lacking "updated_at" or similar incremental columns.
- Schema Shift Support: Automatically handles schema changes without breaking connectors or data pipelines.
How Traditional Extractors Fall Short
In traditional setups, extracting data from a database involves running queries like
SELECT * FROM orders WHERE updated_at > last_run_time
- . This approach:Only provides the current state, missing past changes and deletes.
- Requires
updated_at
- or similar columns for incremental fetches.
- Consumes additional database resources due to repeated queries.
- Fails to capture schema changes or deletions.
Without CDC, you end up downloading entire tables repeatedly, leading to inefficiency and increased infrastructure costs.
Keboola’s CDC: A Modern Approach
- Transactional Log Reading: Keboola CDC leverages database logs (e.g., MySQL binlogs, PostgreSQL WAL) to capture every change.
- No Performance Hit: CDC extracts changes from logs, not by running queries, minimizing database load.
- Comprehensive Data: Captures all operations—insert, update, delete, schema alterations—without requiring special columns.
- Configurable Frequency: Run CDC as often as every 5 minutes to achieve near real-time replication.
What Can CDC Capture?
CDC processes transactional logs, enabling you to:
- Audit every change (who, what, when)
- Mirror entire databases or select tables/columns
- Maintain accurate downstream analytics and reporting
- Track deletes and updates in real time
- Manage schema shifts seamlessly
Supported Databases and Integrations
Keboola CDC currently supports:
Coming soon: Oracle, MongoDB, SQL Server, Cassandra—ask us about early access. Thanks to our foundation on Debezium and DuckDB, new integrations are rapidly deployed. If your database isn’t listed, let us know!
Performance and Scalability
Keboola CDC delivers exceptional throughput:
- Mirrors up to 1 million rows per minute
- Handles hundreds of millions of records in minutes
- Benchmarked: 225M records, 20M changes, synchronized in 22 minutes
Whether you’re running a high-traffic e-commerce store or an enterprise warehouse, Keboola CDC scales to fit your needs—no slowdowns, no bottlenecks.
Getting Started: Easy Setup in Minutes
- Enable Transactional Logs: For MySQL, enable binlogs; for PostgreSQL, enable WAL. Detailed step-by-step guides are available in our help center.
- Connect Keboola: Add the CDC component in Keboola, select your database, and provide credentials.
- Validate Connection: Keboola checks permissions and log access to ensure a seamless connection.
- Select Data: Choose databases, schemas, tables, and columns to replicate. Use include/exclude filters for granular control.
- Apply Data Masking: Mask or hash sensitive columns (e.g., personal data, salaries) for compliance and privacy.
- Choose Synchronization Mode: Options for initial full load, then incremental log-based replication.
- Define Output: Decide on incremental or full loads, with or without deduplication, for your analytics workflows.
- Run and Monitor: Launch replication jobs, monitor progress, and review logs and metadata in Keboola Storage.
Advanced Features
- Column and Table Exclusions: Skip specific data you don’t need to replicate.
- Flexible Masking: Mask long text fields or hash PII with SHA-256 and custom salts.
- Incremental & Full Loads: Start with a full table load, then switch to incremental changes—reducing resource usage and time.
- Deduplication: Keboola CDC ensures only the latest value is retained for updated records, or tracks every change for slowly changing dimensions.
- Schema Shift Handling: Add, rename, or drop columns safely. Dropped columns appear as
null
- in historical data; nothing breaks.
Real-World Example: E-commerce Database
Suppose you manage an e-commerce platform with tables for orders, customers, and products. With Keboola CDC, you can:
- Replicate all tables or just the ones you select.
- Exclude sensitive columns like phone numbers or product descriptions.
- Hash customer emails for GDPR compliance.
- Set up incremental loads to capture every change—new orders, updates, and deletions.
- Automatically track schema shifts, like new columns for promotions or loyalty points.
Everything is monitored and logged, so you have a full audit trail for analytics and compliance.
Monitoring and Auditing Changes
Keboola CDC enriches replicated tables with metadata columns:
- KBC Operation: Indicates if a row was created, updated, or deleted.
- KBC Deleted: Flags deleted records for easy filtering.
- Schema Change Tables: Track all structural changes for downstream systems.
This enables powerful analytics:
- Filter out deleted rows for an up-to-date replica
- Analyze who changed what and when
- Trace schema evolution over time
Best Practices for CDC Replication
- For replica databases: Use incremental load with deduplication to mirror the source, filtering out deleted rows as needed.
- For change tracking: Use full load with deduplication to quickly identify what changed between runs.
- For large data volumes: CDC shines on tables with millions of rows—ideal for data warehouses and high-velocity transactional systems.
Non-Relational and Upcoming Integrations
Have data in MongoDB, Oracle, or another system? Keboola CDC’s architecture (built on Debezium) is designed for rapid support of new sources. Let us know your needs, and we’ll prioritize additional connectors.
Pricing and Plans
- Bring Your Own Database: Pay a flat monthly fee for unlimited CDC runs.
- Managed Storage: Pay per minute when using Keboola’s managed databases (BigQuery, Snowflake, etc.).
- Contact our team for custom pricing and volume discounts.
Frequently Asked Questions
- Do I need developer resources to set up CDC? No—Keboola provides guided setup, documentation, and support.
- How fast is CDC replication? Up to 1 million rows per minute, with near real-time sync every 5 minutes.
- What if my database schema changes? Keboola’s CDC automatically tracks and adapts to schema shifts.
- Is my data secure? Yes—apply masking, hashing, and granular access controls for sensitive data.
- Can I replicate only some tables or columns? Absolutely—use filters and exclusions for precise control.
Get Started with Keboola CDC
Keboola’s Change Data Capture solution empowers your business to:
- Maintain up-to-date analytics, dashboards, and reporting
- Reduce time and resources spent on manual data sync
- Ensure accurate, compliant, and secure data flows
- Scale effortlessly as your business grows
Ready to unlock seamless, real-time data replication? Try Keboola CDC now or contact our team for a personalized demo.