ClickHouse is cruising and has been gaining altitude fast, especially with teams that care about flexibility, speed and control (sound familiar?). As a cutting-edge, ingest-first database, it’s built for the kind of workloads that define today’s modern data infrastructure. It only made sense that Airbyte needed a first-class ClickHouse destination connector.
Here’s the story of how we built it.
Why ClickHouse, Why Now? ClickHouse represents a new generation of warehouse technology: ultra-fast ingestion, compute and storage decoupling, flexible deployment methods (self-hosted or cloud-native), and an incredibly powerful table engine.
From the very beginning, our community wanted better, more robust support for Clickhouse as a destination. In fact, a ClickHouse destination connector was one of the most upvoted GitHub issues in Airbyte’s open-source history.
But the existing connector wasn’t getting the job done.
Building It The Right Way While Airbyte and ClickHouse share a lot of similarities in their flexible approach to data ingestion and configuration, seamlessly integrating the two technologies presented some challenges in mapping concepts between the two domains. While there are 15 different Airbyte types there are roughly twice as many types in ClickHouse and many of those types have subvariants dictating their precision (e.g. U8, U16, U32, etc.) With such an imbalance, we were forced into the messy business of choosing some one-size-fits all precisions for our numericals and having configurable JSON support—folks on older ClickHouse instances that don’t support JSON can opt to store Airbyte Objects as Strings. Through a short beta process we landed on some reasonable values for most workflows and are able to expose more configuration options around mapping between the two domains if the need arises.
With the new v2 ClickHouse client, we were able to choose from 70+ input encodings, which presents a challenge in itself. Thankfully there is an excellent blog post precisely benchmarking the various options and we were able to settle upon a fast and ergonomic binary format (RowBinary ) with the default z4 compression.
When in the business of moving data, it is a breath of fresh air to see data-driven literature from the database devs themselves about how to best load data into their systems. As a cherry on top, their benchmark suite provided the examples we needed to build the first POC.
At Airbyte, we provide the ability to “dedupe” records in the destination data store—we are deliberately opaque on how we accomplish this as it is very destination specific. With such a necessarily vague feature, one of our guiding principles is to implement it in such a way an avid user of the destination database would expect it to be implemented—we want to avoid some worst-for-everyone compromise that only mostly works across all destinations.
Thankfully for ClickHouse this was a fairly straightforward task. ClickHouse out-of-the-box provides several “TableEngine(s)” which will perform a compaction-like background consolidation of rows for the sorting key (or product of keys) specified in the table definition—i.e. It dedupes data in the background while ingesting. While there are a few engine options, the ReplacingMergeTree engine best matches the “dedupe” feature we provide to our customers. Being able to use this option more or less allows us to focus on other features like schema change logic (detecting changing table structure and data type across syncs) and leave the deduping to ClickHouse.
Technical Tradeoffs & What’s Next This release wasn’t just about building a connector, it was about laying a foundation. We made informed tradeoffs around performance and flexibility, leaned into the table engine’s strengths, and exposed the right parameters to support today’s data pipelines.
We’ll be adding features for ClickHouse continuously. The good news is: we’ve built a solid base. It only gets better from here. In building the connector, we made informed tradeoffs around performance and flexibility, allowing for improved speed, scale and loading of data into Clickhouse.
There are a few limitations with the new Airbyte connector for Clickhouse:
When you configure your Airbyte sync to load data in 'Dedupe' mode, the primary key and sorting key in Clickhouse are set to the primary key for the corresponding stream in Airbyte. You are unable to pass a sorting key to ClickHouse that is different than the primary key being used to dedupe data. When you configure your Airbyte sync to load data in 'Append' mode, the primary key and sorting key will not be set in Clickhouse. To set the sorting key in your Clickhouse table on data replicated from Airbyte, you will need to create a view on top of the table loaded by Airbyte. Airbyte does not support replicating data into tables with partition keys. Just the Beginning Compared to the original marketplace connector, the certified connector comes with improved loading performance, Airbyte support, and will feature continuous updates. As such, we are deprecating the original marketplace connector and encourage all users to upgrade to the new certified connector.
Clickhouse is here to stay as one of the leading modern data warehouses with unparalleled flexibility. Our job was to give it a connector in line with Clickhouse’s high bar for quality. Now, with a fully supported, production-ready destination, they can go further. And so can you.
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->