ClickHouse is cruising and has been gaining altitude fast, especially with teams that care about flexibility, speed and control (sound familiar?). As a cutting-edge, ingest-first database, it’s built for the kind of workloads that define today’s modern data infrastructure. It only made sense that Airbyte needed a first-class ClickHouse destination connector.
If you are not familiar with us, Airbyte is the open standard for data movement. We offer a platform to move data from more than 600+ sources into target systems such as ClickHouse. We recommend considering Airbyte alongside Clickhouse when you need to move data from API SaaS systems, such as Salesforce, Hubspot or Zendesk Support, and move that data into a single place for downstream data modeling and analytics. You can easily deploy Airbyte Open Source on your own machine , or start a free trial of Airbyte Cloud .
Here’s the story of how we built the Clickhouse destination connector, and how you can set it up yourself.
Why ClickHouse, Why Now? ClickHouse represents a new generation of warehouse technology: ultra-fast ingestion, compute and storage decoupling, flexible deployment methods (self-hosted or cloud-native), and an incredibly powerful table engine.
From the very beginning, our community wanted better, more robust support for ClickHouse as a destination. In fact, an improved ClickHouse destination connector was one of the most upvoted GitHub issues in Airbyte’s open-source history. The existing connector we had in place wasn’t getting the job done. With this new release, you can now load data into Clickhouse using Airbyte with:
Over 3x improved performance Support for much larger datasets moved in a single sync, including above 1 TB in volume Proper data typing and direct loading - no more JSON blobs loaded into your database A connector actively maintained and supported by the Airbyte team Building It The Right Way - Implementation Overview While Airbyte and ClickHouse share a lot of similarities in their flexible approach to data ingestion and configuration, seamlessly integrating the two technologies presented some challenges in mapping concepts between the two domains. While there are 15 different Airbyte types, there are roughly twice as many types in ClickHouse. Many of those types have subvariants dictating their precision (e.g. U8, U16, U32, etc.). With such an imbalance, we at Airbyte were forced into the messy business of choosing some one-size-fits-all precisions for our numericals and having configurable JSON support. Folks on older ClickHouse instances that don’t support JSON can opt to store Airbyte Objects as Strings. Through a short beta process we landed on some reasonable values for most workflows and are able to expose more configuration options around mapping between the two domains if the need arises.
With the new v2 ClickHouse client, we were able to choose from 70+ input encodings, which also presented a challenge. Thankfully, there is an excellent blog post precisely benchmarking the various options and we were able to settle upon a fast and ergonomic binary format (RowBinary ) with the default z4 compression.
At Airbyte, we provide the ability to “dedupe” records in the destination data store. With such a necessarily vague feature, one of our guiding principles is to implement it in such a way an avid user of the destination database would expect it to be implemented—we want to avoid some worst-for-everyone compromise that only mostly works across all destinations.
Thankfully for ClickHouse this was a fairly straightforward task. ClickHouse out-of-the-box provides several “TableEngine(s)” which will perform a compaction-like background consolidation of rows for the sorting key (or product of keys) specified in the table definition—i.e. It dedupes data in the background while ingesting. While there are a few engine options, the ReplacingMergeTree engine best matches the “dedupe” feature we provide to our customers. Being able to use this option more or less allows us to focus on other features like schema change logic (detecting changing table structure and data type across syncs) and leave the deduping to ClickHouse.
Current Limitations & Coming Soon We’ll be adding features for ClickHouse continuously. The good news is: we’ve built a solid base. It only gets better from here. In building the connector, we made informed tradeoffs around performance and flexibility, allowing for improved speed, scale and loading of data into Clickhouse.
That said, there are a few limitations with the new Airbyte connector for ClickHouse. More specifically, certain query patterns and use cases are not well supported by Airbyte when loading data into ClickHouse which will be areas of improvement focused on next:
When you configure your Airbyte sync to load data in 'Dedupe' mode, the primary key and sorting key in ClickHouse are set to the primary key for the corresponding stream in Airbyte. You are unable to pass a sorting key to ClickHouse that is different from the primary key being used to dedupe data. When you configure your Airbyte sync to load data in 'Append' mode, the primary key and sorting key will not be set in ClickHouse. To set the sorting key in your ClickHouse table on data replicated from Airbyte, you will need to create a view on top of the table loaded by Airbyte. Airbyte does not support replicating data into tables with partition keys. Guide: Setting Up the ClickHouse Connector The rest of this post is a step-by-step outline to help you get started. We’ve filled in the boilerplate so you can drop in the specifics as you go.
Prerequisites:
A running ClickHouse instance (ClickHouse Cloud or self-hosted) with server version 21.8.10.19 or above. Permissions to create users in ClickHouse. An existing ClickHouse database instance. An Airbyte instance (self-hosted or Airbyte Cloud).
First, You can find a default admin user for your ClickHouse instance by clicking on ‘Connect’ in the screenshot above. The connect page also lists the host and portname for your ClickHouse instance ( “{hostname}:{port}”). Keep these handy, we’ll need them later. With the credentials for this default admin user, we will first create a separate ClickHouse user with the permissions to create databases and tables, write rows, and alter, drop or exchange tables. This user will be used by Airbyte to write data into ClickHouse:
curl --user 'default:{password}' \
--data-binary "CREATE USER airbyte_user IDENTIFIED BY '{password}';" \
https : //ljj6ots60q.us-east1.gcp.clickhouse.cloud:8443
Second, grant your new Airbyte user with the required permissions to write data to the intended database.
curl --user 'default:{password}' \
--data-binary "GRANT CREATE * ON *.* TO airbyte_user;" \
https : //{id}.us-east1.gcp.clickhouse.cloud:8443
curl --user 'default:{password}' \
--data-binary "GRANT CREATE TABLE, DROP TABLE, SELECT, INSERT, ALTER, TRUNCATE ON default.* TO airbyte_user;" \
https : //{id}.us-east1.gcp.clickhouse.cloud:8443
Third, we transition to Airbyte, where you can create a new destination. You may do this via UI or API. For an API reference, see here . Let’s start by selecting ClickHouse from the list of destinations:
From here, you can fill out the form. The ‘Connect’ view in ClickHouse above shares the host name and port needed. Pass in the credentials for the newly created airbyte_user, and the ClickHouse database you would like for data to be written into.
Once you’ve configured the ClickHouse destination, attach it to an existing source dataset. Airbyte has more than 600+ source connectors making it easy to consolidate data into one place from APIs, files, databases, knowledge bases and more. If you are looking to experiment with Airbyte before creating your first source connector, you may leverage the ‘Sample Data’ source, which can be created without any credentials.
Once your connection is up and running, you can monitor the status, inspect the logs, and see in real-time as records are loaded into ClickHouse. Here’s an example of the data, once it’s landed into your ClickHouse database.