No items found.

Build a data ingestion pipeline from Mailchimp to Snowflake

Learn how to easily build a data ingestion pipeline to load business data from Mailchimp to Snowflake.

As an analytics engineer, I am always on the search for new tools to improve my data stack. The world of data is constantly evolving and if you’re not adapting along with it, you’re falling behind. This is why it's important for an analytics engineer to stay informed on all the tools available in the market.

While I’ve only ever used Fivetran and Stitch to build data ingestion pipelines, I’ve recently come across a tool called Airbyte. Airbyte is an open-source data ingestion tool that helps you consolidate your data in your data warehouses, lakes and databases. Unlike Fivetran and Stitch, Airbyte is an open-source tool, making it easy to seek support on Slack and update existing connectors.

If you’ve read my articles on dbt and Prefect, you know one of my favorite features they offer are their Slack communities. You can get all of your questions answered within hours and share best practices with other analytics and data engineers. No need to create a support ticket and wait days for a response! This is key for companies looking to move fast.

Airbyte also offers over 100 data connectors, making it easy to create data ingestion pipelines to load data from a source to a destination. I’ve found that it is always best to find one tool that can do it all for each step of the ETL/ELT process. It will save you money, time, and important brainpower. 

In this article I’m going to set up my first Airbyte connector, ingesting raw data from Mailchimp into Snowflake. I’m also going to do some basic data cleaning to make the raw data ready to analyze. Throughout, I’ll give my insight on my favorite features of Airbyte and some of the more challenging pieces as well. 

Running Airbyte locally

Before you set up your first connector, you’ll need to follow the instructions here to set up Airbyte on your localhost using Docker Compose. I’ve found this to be a pain with other tools such as Airflow, but getting Airbyte started locally was a breeze. 

Setting up a Mailchimp source connector

Airbyte has a Mailchimp source connector. Like other data ingestion tools, Airbyte gives you the option to name your source on the UI. I recommend sticking with the name of the source unless you have multiple connectors for that same source that you need to set up. Since I have multiple Mailchimp accounts, I’m calling mine madison_mailchimp for my personal newsletter. Then select the data source you wish to connect to. 

Image by author

You can find the API information for Mailchimp by going to account and then selecting “API keys” from the “Extras” dropdown menu. Create a key and copy it into the “apikey” field on Airbyte. Add the username or email you use to sign in to Mailchimp and click “Set up source”. 

Setting up a Snowflake destination connector

Now you need to set up the destination that you want the Mailchimp data to be loaded into. I’ll be using Airbyte's Snowflake destination for this tutorial but feel free to use whatever destination works best for you. Similar to the connector name, I recommend choosing a name that matches the destination type. I only have one Snowflake environment, so I am simply calling it Snowflake.

Image by author

Now, you need to head to Snowflake to find all of the information necessary to set up your destination.

Figuring out the permissions for tools that connect to Snowflake can be tricky. You need to make sure you are giving the AIRBYTE_USER the correct role and all the permissions on that role. I’m not going to lie, I had some failed attempts with setting up the user and role before deciding to copy the exact script given in the Airbyte documentation. 

Use their script and simply replace the variables at the top with whatever you wish to name your role, username, database, schema, and warehouse. 

Lastly, make sure you select “Internal staging” for the loading method in the last step. This is the best way to load data into Snowflake.

Ingesting Mailchimp data into Snowflake

Next, you’ll be brought to a screen that contains more information to create a data ingestion pipeline from Mailchimp to Snowflake. 

Image by author

Choose to “mirror source structure” for your namespace configuration. You also have an option to add a table prefix. I recommend adding one with the name of your source connector. Tables across various connectors can have the same names (like campaigns or orders), so this is a good practice to put into place. Notice how I use madison_mailchimp_ as my table prefix. 

If you look in the table, you can also see a column called “Sync mode” which specifies how you want the data to be ingested each time the connector syncs. There are four different options:

  • Full Refresh Overwrite: resyncs and replaces all data.
  • Full Refresh Append: resyncs and creates duplicates of all rows.
  • Incremental Append: syncs new rows and appends to already synced rows.
  • Incremental Deduped History: syncs new rows and appends to already synced rows, also provides a de-duplicated view mirroring the state of the ingestion.

I personally prefer incremental append because I don’t have to worry about duplicated rows. It also doesn’t require as much storage in the data warehouse. Full refresh overwrite is only recommended when you have a small amount of data, otherwise this can take a long time to sync. It may also run through precious Snowflake credits if your warehouse is running for a while. 

Then, select the tables you wish to upload to Snowflake and choose “basic normalization”. Airbyte automatically creates a schema that fits your data when you choose this option, making our job as analytics engineers much easier! Otherwise, we would then have to parse a JSON object in Snowflake and that’s no fun. 

Image by author

Airbyte also has an option to schedule your data syncs. You can choose every few hours, every 24 hours, or even manually. This all depends on your use case. 

Image by author

After clicking “sync now”, it can be helpful to read through the logs as the data is uploading to ensure everything is going as expected. This is also a great feature in case something goes wrong. Logs make it easy to pinpoint the exact error. 

Now let’s head on over to Snowflake. Go to the database and schema that you choose to upload your connector's data to. You should see tables beginning with that table prefix that you specified. 

Image by author

Awesome! I can see all the Mailchimp tables that I ingested using the Airbyte data connector. Now it’s time to transform this data so it is ready to be analyzed. 

Cleaning the ingested data

Let’s start by looking at the “AIRBYTE_DATABASE”.”AIRBYTE_SCHEMA”.”MADISON_MAILCHIMP_CAMPAIGNS” table. Here’s a small subset of what it looks like:

Image by author

Casting string columns

The first thing that jumps out to me is the TYPE and STATUS columns. These are both of data type variant. This data type stores a string, but is not a string itself. I want these to be strings, but more importantly, I want to remove the double quotes. 

These double quotes would make analysis quite messy, especially in Snowflake. Snowflake actually uses double quotes to select from the database, schema, and table names.


select * from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"

Using this variant data type with double quotes would cause a lot of messy queries where you’d have to work around that. Next, let’s cast these columns so that they’re strings.


select
  id,
  CAST(type AS string),
  CAST(status AS string),
  send_time,
  resendable,
  created_time
from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"

Well, look at that! When you cast these columns from variant to string the double quotes are eliminated. That solves that problem. 

Casting date columns

Now let’s look at the two columns that we want to be timestamps in our final dataset- send_time and created_time. Right now they are both varchar data types. Casting them to timestamps will allow us to perform date functions like date_part(), dateadd(), and year() on them in future analyses.


select
  id,
  CAST(type AS string),
  CAST(status AS string),
  send_time::timestamp_ntz,
  resendable,
  created_time::timestamp_ntz
from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"

Make sure you’re consistently casting the data in your base models to the same type of timestamps and dates. When I first started building my models, I would leave any timestamp columns that already had timestamp data types alone. No casting, nothing. They seemed ready to go!

Then, I started experiencing some weird issues when comparing timestamps. After a lot of digging, I realized my timestamp columns were all different types of timestamps. Now I make sure I cast them all to timestamp_ntz.

Renaming columns

Lastly, I recommend renaming the columns in tables like this one to something more descriptive. You’ll be using many of these Mailchimp tables together, so having simple names like id and status will get confusing fast.

I recommend renaming them according to the table name. Here I am looking at the CAMPAIGNS table, so I would rename id to campaign_id. This is a clear and descriptive name across all Mailchimp tables.


select
  id AS campaign_id,
  CAST(type AS string) AS campaign_type,
  CAST(status AS string) AS campaign_status,
  send_time::timestamp_ntz AS campaign_sent_at,
  resendable AS is_campaign_resendable,
  created_time::timestamp_ntz AS campaign_created_at
from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"

Here I renamed:

  • type to campaign_type
  • status to campaign_status
  • send_time to campaign_sent_at
  • resendable to is_campaign_resendable
  • created_time to campaign_created_at

Now the data is ready to be automated and analyzed by creating a data model! dbt is a great tool to store these SQL files used to create your data model. It then automates them so that you always have clean data available to you in your data warehouse. 

Conclusion

Overall, I’ve really enjoyed using Airbyte to ingest my raw data. It was a seamless process besides setting up the correct permissions in Snowflake. The setup is clearly documented, the UI is user-friendly, and my raw data landed at the destination in a format that was easy to transform into a usable model.

However, I’ve struggled more with other data connectors like Facebook and Google. Because of the way they distribute API keys, it can be confusing to obtain the correct permissions. If you’re familiar with software development using these outside companies, this step should be a breeze for you. Otherwise, there may be a bit of a learning curve.

Because Airbyte is run locally, I would be interested to see the difficulty level of deploying it using a cloud solution like AWS. This is where I struggled the most with Prefect. Luckily, I was able to receive lots of support from their Slack channel. I imagine this would be the same with Airbyte.

Similar use cases

Load data from Facebook Ads to BigQuery

Learn how to load Facebook Ads data to BigQuery using Airbyte Cloud.

Export Google Analytics data to BigQuery

Learn how to export Google Analytics data to BigQuery using Airbyte Cloud.

Move data from LinkedIn Ads to BigQuery

Learn how to move marketing data from LinkedIn Ads to BigQuery for improved analytics, reporting, and insights.