[Guest Post] How we build custom data extractors to meet client ETL Needs

No items found.

A guest post by Johnathan Brooks from 4 Mile Analytics on how they're building custom components in Keboola.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

On one of our webinars in April 2020 we talked about the developer portal and how our developer community are pushing the Keboola Connection platform into places that often surprise our own core team. Our partners often are the creative ones, adding their knowledge and expertise to expand our platform in service of our shared customers and their varying needs. This is a guest post, written by Johnathan Brooks, Solutions Architect at 4 Mile Analytics. Here’s their story on how they have developed some good practices building data orchestrations, and how they’re leveraging Keboola for developing custom components.

The search for a SaaS ETL tool: Flexible Data Extractors without Managing Infrastructure on Keboola

When choosing a software-as-a-service (SaaS) ETL tool, data teams typically ask a few key technical questions: Is the service well secured? How is our data protected? What steps will the tool automate in the ETL process, and what steps will our team manage?

All SaaS tools should have a great answer to each of these questions, or why would a team even consider it in the first place? However, a data team must also consider these questions of a prospective ETL SaaS tool:

Does the service support robust transformations in the programming language(s) used by our team?
Does the tool have extractors for all of our data sources?

Oftentimes, at least one of the last two questions is accompanied by a difficult answer, requiring additional work or a supplementary tool.

SaaS ETL tools generally specialize in one of these areas and can answer the question very gracefully — perhaps building as many out-of-the-box extractors as possible or offering support for complex transformations and orchestration of DAGs. The former may be considered ELT-focused, while the latter is often more ETL-focused.

But what if a team needs both?

Creating Robust Custom Data Extractors with Keboola

Enter Keboola , a cloud-based data platform. Given their support for orchestrating transformations in Python, R, Julia, OpenRefine and SQL (powered by Snowflake), most would be inclined to place them in the transformation-focused ETL camp. They also have a long list of data extractors and some out-of-the-box scaffolds to help pipelines run quickly.

Ultimately, not every data source is (or ever could be) ready to go as an extractor. So, what does a data team do if their chosen tool doesn’t have that connector?

One of our clients liked Keboola’s tooling and transformation functionality, but was initially concerned by some missing extractors. Does this mean that Keboola doesn’t actually cover both robust transformations and extractors for all sources?

Not at all. Keboola’s Custom Component Framework allows you to build arbitrary data extractors, data enrichment processors or writers and add them to Keboola. You build the process that runs within Keboola’s Docker runner, so you don’t need to worry about infrastructure. Despite being a rather simple concept, this framework is immensely powerful, with near-seamless flexibility to design or extend data extractors exactly how you need them to work.

Using this framework, we’ve built data extractors that exclude fields brought in by other platforms without the option for removal, which is often unacceptable due to issues with data security. Our team built data extractors for Help Scout and Apple Search Ads within a week, and have frequently handled ad-hoc requests with the custom framework.

Such flexibility is invaluable for many modern data and marketing teams , who are constantly working with new data sources and often messy data that requires a powerful transformation layer.

Building a MySQL Log-Based Extractor with the Keboola Custom Component Framework

We’ve also designed, developed and built MySQL log-based data replication, all within Keboola’s custom framework.

For those unfamiliar with log-based replication, it is a very quick form of change data capture which doesn’t require a replication key, but rather reads MySQL’s binary row-based logs to capture changes, including hard deletes.

This allows data to be refreshed very quickly, but, to put it simply, it’s a very complex project. Reading and interpreting binary logs and data events, handling schema changes, and appropriately interpreting events for all data types was a project and a half!

Today, it’s running live for our client. We also avoided building an expensive streaming platform with Kafka for handling those binlog events, as is typically the standard for reading MySQL binary logs, something like what Yelp published:

Instead, we just wrote the code to run near-real-time data syncs using the binary logs, and Keboola manages the rest. We write the output data in our extractor, and Keboola takes care of the orchestration, runs containers and works with their storage and encryption APIs:

Although the platform’s existing data extractor did not support log-based extraction, we were able to use the custom framework to build this complex extractor on Keboola. We now have the ability to replicate gigabytes of data changes to our data warehouse every half hour (including hard deletes), something that many people would have thought impossible on what might look like an ETL-focused platform at first glance.

Key Takeaways

For many data teams, choosing an ETL tool may feel like comparing trade-offs between robust data transformations and customizations (ETL), or a more hands-off set of pre-built data extractors, with data transformations after data load (ELT).

Although each has its place, Keboola has built a platform that can effectively function as both - a platform for complex transformations and orchestrations, as well as one with many pre-built extractors, including the ability to easily build your own!

At 4 Mile, we constructed a MySQL log-based data extractor (a data replication method often solved with complex streaming architecture), all within the Keboola custom component framework. Using its helpful built-in tools and flexible structure, even complex and unique data extractors can be easily built. The platform’s combined strengths — security, complex data transformations, sandbox environment, data extractors and powerful custom component framework — make it a great fit for their ever-evolving data platform needs.

If you’d like to explore the flexibility of complex data transformations and a robust data extraction layer and custom component framework, I’d recommend reaching out to the Keboola team!

Online