As an analytics engineer, I am always on the search for new tools to improve my data stack. The world of data is constantly evolving and if you’re not adapting along with it, you’re falling behind. This is why it's important for an analytics engineer to stay informed on all the tools available in the market.
While I’ve only ever used Fivetran and Stitch to build data ingestion pipelines, I’ve recently come across a tool called Airbyte. Airbyte is an open-source data ingestion tool that helps you consolidate your data in your data warehouses, lakes and databases. Unlike Fivetran and Stitch, Airbyte is an open-source tool, making it easy to seek support on Slack and update existing connectors.
If you’ve read my articles on dbt and Prefect, you know one of my favorite features they offer are their Slack communities. You can get all of your questions answered within hours and share best practices with other analytics and data engineers. No need to create a support ticket and wait days for a response! This is key for companies looking to move fast.
Airbyte also offers over 100 data connectors, making it easy to create data ingestion pipelines to load data from a source to a destination. I’ve found that it is always best to find one tool that can do it all for each step of the ETL/ELT process. It will save you money, time, and important brainpower.
In this article I’m going to set up my first Airbyte connector, ingesting raw data from Mailchimp into Snowflake. I’m also going to do some basic data cleaning to make the raw data ready to analyze. Throughout, I’ll give my insight on my favorite features of Airbyte and some of the more challenging pieces as well.
Before you set up your first connector, you’ll need to follow the instructions here to set up Airbyte on your localhost using Docker Compose. I’ve found this to be a pain with other tools such as Airflow, but getting Airbyte started locally was a breeze.
Airbyte has a Mailchimp source connector. Like other data ingestion tools, Airbyte gives you the option to name your source on the UI. I recommend sticking with the name of the source unless you have multiple connectors for that same source that you need to set up. Since I have multiple Mailchimp accounts, I’m calling mine madison_mailchimp for my personal newsletter. Then select the data source you wish to connect to.
You can find the API information for Mailchimp by going to account and then selecting “API keys” from the “Extras” dropdown menu. Create a key and copy it into the “apikey” field on Airbyte. Add the username or email you use to sign in to Mailchimp and click “Set up source”.
Now you need to set up the destination that you want the Mailchimp data to be loaded into. I’ll be using Airbyte's Snowflake destination for this tutorial but feel free to use whatever destination works best for you. Similar to the connector name, I recommend choosing a name that matches the destination type. I only have one Snowflake environment, so I am simply calling it Snowflake.
Now, you need to head to Snowflake to find all of the information necessary to set up your destination.
Figuring out the permissions for tools that connect to Snowflake can be tricky. You need to make sure you are giving the AIRBYTE_USER the correct role and all the permissions on that role. I’m not going to lie, I had some failed attempts with setting up the user and role before deciding to copy the exact script given in the Airbyte documentation.
Use their script and simply replace the variables at the top with whatever you wish to name your role, username, database, schema, and warehouse.
Lastly, make sure you select “Internal staging” for the loading method in the last step. This is the best way to load data into Snowflake.
Next, you’ll be brought to a screen that contains more information to create a data ingestion pipeline from Mailchimp to Snowflake.
Choose to “mirror source structure” for your namespace configuration. You also have an option to add a table prefix. I recommend adding one with the name of your source connector. Tables across various connectors can have the same names (like campaigns or orders), so this is a good practice to put into place. Notice how I use madison_mailchimp_ as my table prefix.
If you look in the table, you can also see a column called “Sync mode” which specifies how you want the data to be ingested each time the connector syncs. There are four different options:
I personally prefer incremental append because I don’t have to worry about duplicated rows. It also doesn’t require as much storage in the data warehouse. Full refresh overwrite is only recommended when you have a small amount of data, otherwise this can take a long time to sync. It may also run through precious Snowflake credits if your warehouse is running for a while.
Then, select the tables you wish to upload to Snowflake and choose “basic normalization”. Airbyte automatically creates a schema that fits your data when you choose this option, making our job as analytics engineers much easier! Otherwise, we would then have to parse a JSON object in Snowflake and that’s no fun.
Airbyte also has an option to schedule your data syncs. You can choose every few hours, every 24 hours, or even manually. This all depends on your use case.
After clicking “sync now”, it can be helpful to read through the logs as the data is uploading to ensure everything is going as expected. This is also a great feature in case something goes wrong. Logs make it easy to pinpoint the exact error.
Now let’s head on over to Snowflake. Go to the database and schema that you choose to upload your connector's data to. You should see tables beginning with that table prefix that you specified.
Awesome! I can see all the Mailchimp tables that I ingested using the Airbyte data connector. Now it’s time to transform this data so it is ready to be analyzed.
Let’s start by looking at the “AIRBYTE_DATABASE”.”AIRBYTE_SCHEMA”.”MADISON_MAILCHIMP_CAMPAIGNS” table. Here’s a small subset of what it looks like:
The first thing that jumps out to me is the TYPE and STATUS columns. These are both of data type variant. This data type stores a string, but is not a string itself. I want these to be strings, but more importantly, I want to remove the double quotes.
These double quotes would make analysis quite messy, especially in Snowflake. Snowflake actually uses double quotes to select from the database, schema, and table names.
Using this variant data type with double quotes would cause a lot of messy queries where you’d have to work around that. Next, let’s cast these columns so that they’re strings.
Well, look at that! When you cast these columns from variant to string the double quotes are eliminated. That solves that problem.
Now let’s look at the two columns that we want to be timestamps in our final dataset- send_time and created_time. Right now they are both varchar data types. Casting them to timestamps will allow us to perform date functions like date_part(), dateadd(), and year() on them in future analyses.
Make sure you’re consistently casting the data in your base models to the same type of timestamps and dates. When I first started building my models, I would leave any timestamp columns that already had timestamp data types alone. No casting, nothing. They seemed ready to go!
Then, I started experiencing some weird issues when comparing timestamps. After a lot of digging, I realized my timestamp columns were all different types of timestamps. Now I make sure I cast them all to timestamp_ntz.
Lastly, I recommend renaming the columns in tables like this one to something more descriptive. You’ll be using many of these Mailchimp tables together, so having simple names like id and status will get confusing fast.
I recommend renaming them according to the table name. Here I am looking at the CAMPAIGNS table, so I would rename id to campaign_id. This is a clear and descriptive name across all Mailchimp tables.
Here I renamed:
Now the data is ready to be automated and analyzed by creating a data model! dbt is a great tool to store these SQL files used to create your data model. It then automates them so that you always have clean data available to you in your data warehouse.
Overall, I’ve really enjoyed using Airbyte to ingest my raw data. It was a seamless process besides setting up the correct permissions in Snowflake. The setup is clearly documented, the UI is user-friendly, and my raw data landed at the destination in a format that was easy to transform into a usable model.
However, I’ve struggled more with other data connectors like Facebook and Google. Because of the way they distribute API keys, it can be confusing to obtain the correct permissions. If you’re familiar with software development using these outside companies, this step should be a breeze for you. Otherwise, there may be a bit of a learning curve.
Because Airbyte is run locally, I would be interested to see the difficulty level of deploying it using a cloud solution like AWS. This is where I struggled the most with Prefect. Luckily, I was able to receive lots of support from their Slack channel. I imagine this would be the same with Airbyte.
Get all your ELT data pipelines running in minutes with Airbyte.
Learn how to load Facebook Ads data to BigQuery using Airbyte Cloud.
Learn how to export Google Analytics data to BigQuery using Airbyte Cloud.