Building your pipeline or Using Airbyte
Airbyte is the only open solution empowering data teams to meet all their growing custom business demands in the new AI era.
- Inconsistent and inaccurate data
- Laborious and expensive
- Brittle and inflexible
- Reliable and accurate
- Extensible and scalable for all your needs
- Deployed and governed your way
Start syncing with Airbyte in 3 easy steps within 10 minutes
Take a virtual tour
Demo video of Airbyte Cloud
Demo video of AI Connector Builder
What sets Airbyte Apart
Modern GenAI Workflows
Move Large Volumes, Fast
An Extensible Open-Source Standard
Full Control & Security
Fully Featured & Integrated
Enterprise Support with SLAs
What our users say
"The intake layer of Datadog’s self-serve analytics platform is largely built on Airbyte.Airbyte’s ease of use and extensibility allowed any team in the company to push their data into the platform - without assistance from the data team!"
“Airbyte helped us accelerate our progress by years, compared to our competitors. We don’t need to worry about connectors and focus on creating value for our users instead of building infrastructure. That’s priceless. The time and energy saved allows us to disrupt and grow faster.”
“We chose Airbyte for its ease of use, its pricing scalability and its absence of vendor lock-in. Having a lean team makes them our top criteria. The value of being able to scale and execute at a high level by maximizing resources is immense”
FAQs
What is ETL?
ETL, an acronym for Extract, Transform, Load, is a vital data integration process. It involves extracting data from diverse sources, transforming it into a usable format, and loading it into a database, data warehouse or data lake. This process enables meaningful data analysis, enhancing business intelligence.
Mailchimp is a global marketing automation platform aimed at small to medium-sized businesses. Mailchimp provides essential marketing tools for growing a successful business, enabling businesses to automate messages and send marketing emails, create targeted business campaigns, expedite analytics and reporting, and effectively and efficiently sell online.
Mailchimp's API provides access to a wide range of data related to email marketing campaigns. The following are the categories of data that can be accessed through Mailchimp's API:
1. Lists: Information about the email lists, including the number of subscribers, the date of creation, and the list name.
2. Campaigns: Data related to email campaigns, including the campaign name, the number of recipients, the open rate, click-through rate, and bounce rate.
3. Subscribers: Information about the subscribers, including their email address, name, location, and subscription status.
4. Reports: Detailed reports on the performance of email campaigns, including open rates, click-through rates, and bounce rates.
5. Templates: Access to email templates that can be used to create new campaigns.
6. Automation: Data related to automated email campaigns, including the number of subscribers, the date of creation, and the automation name.
7. Tags: Information about tags that can be used to categorize subscribers and campaigns.
Overall, Mailchimp's API provides a comprehensive set of data that can be used to analyze and optimize email marketing campaigns.
What is ELT?
ELT, standing for Extract, Load, Transform, is a modern take on the traditional ETL data integration process. In ELT, data is first extracted from various sources, loaded directly into a data warehouse, and then transformed. This approach enhances data processing speed, analytical flexibility and autonomy.
Difference between ETL and ELT?
ETL and ELT are critical data integration strategies with key differences. ETL (Extract, Transform, Load) transforms data before loading, ideal for structured data. In contrast, ELT (Extract, Load, Transform) loads data before transformation, perfect for processing large, diverse data sets in modern data warehouses. ELT is becoming the new standard as it offers a lot more flexibility and autonomy to data analysts.
Mailchimp is a global marketing automation platform aimed at small to medium-sized businesses. Mailchimp provides essential marketing tools for growing a successful business, enabling businesses to automate messages and send marketing emails, create targeted business campaigns, expedite analytics and reporting, and effectively and efficiently sell online.
A cloud data platform, Snowflake Data Cloud provides a warehouse-as-a-service built specifically for the cloud. The Snowflake platform is designed to empower many types of data workloads, and offers secure, immediate, governed access to a comprehensive network of data. Snowflake’s innovative technology goes above the capabilities of the ordinary database, supplying users all the functionality of database storage, query processing, and cloud services in one package.
1. Open the Airbyte platform and navigate to the "Sources" tab on the left-hand side of the screen.
2. Click on the "Add Source" button and select "Mailchimp" from the list of available connectors.
3. Enter a name for your Mailchimp source connector and click on the "Next" button.
4. Enter your Mailchimp API key in the provided field. You can find your API key by logging into your Mailchimp account and navigating to the "Account" section. From there, click on "Extras" and then "API keys".
5. Select the Mailchimp lists you want to sync with Airbyte by checking the boxes next to their names.
6. Choose the sync mode you want to use for your Mailchimp data. You can choose between "Full Refresh" and "Incremental".
7. Click on the "Test" button to ensure that your Mailchimp source connector is working properly.
8. If the test is successful, click on the "Create" button to save your Mailchimp source connector.
9. You can now use your Mailchimp source connector to sync your Mailchimp data with other destinations on the Airbyte platform.
1. First, navigate to the Airbyte website and log in to your account.
2. Once you are logged in, click on the "Destinations" tab on the left-hand side of the screen.
3. Scroll down until you find the Snowflake Data Cloud destination connector and click on it.
4. You will be prompted to enter your Snowflake account information, including your account name, username, and password.
5. After entering your account information, click on the "Test" button to ensure that the connection is successful.
6. If the test is successful, click on the "Save" button to save your Snowflake Data Cloud destination connector settings.
7. You can now use the Snowflake Data Cloud destination connector to transfer data from your Airbyte sources to your Snowflake account.
8. To set up a data transfer, navigate to the "Sources" tab on the left-hand side of the screen and select the source you want to transfer data from.
9. Click on the "Create New Connection" button and select the Snowflake Data Cloud destination connector as your destination.
10. Follow the prompts to set up your data transfer, including selecting the tables or data sources you want to transfer and setting up any necessary transformations or mappings.
11. Once you have set up your data transfer, click on the "Run" button to start the transfer process.
With Airbyte, creating data pipelines take minutes, and the data integration possibilities are endless. Airbyte supports the largest catalog of API tools, databases, and files, among other sources. Airbyte's connectors are open-source, so you can add any custom objects to the connector, or even build a new connector from scratch without any local dev environment or any data engineer within 10 minutes with the no-code connector builder.
We look forward to seeing you make use of it! We invite you to join the conversation on our community Slack Channel, or sign up for our newsletter. You should also check out other Airbyte tutorials, and Airbyte’s content hub!
What should you do next?
Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey:
As an analytics engineer, I am always on the search for new tools to improve my data stack. The world of data is constantly evolving and if you’re not adapting along with it, you’re falling behind. This is why it's important for an analytics engineer to stay informed on all the tools available in the market.
While I’ve only ever used Fivetran and Stitch to build data ingestion pipelines, I’ve recently come across a tool called Airbyte. Airbyte is an open-source data ingestion tool that helps you consolidate your data in your data warehouses, lakes and databases. Unlike Fivetran and Stitch, Airbyte is an open-source tool, making it easy to seek support on Slack and update existing connectors.
If you’ve read my articles on dbt and Prefect, you know one of my favorite features they offer are their Slack communities. You can get all of your questions answered within hours and share best practices with other analytics and data engineers. No need to create a support ticket and wait days for a response! This is key for companies looking to move fast.
Airbyte also offers over 100 data connectors, making it easy to create data ingestion pipelines to load data from a source to a destination. I’ve found that it is always best to find one tool that can do it all for each step of the ETL/ELT process. It will save you money, time, and important brainpower.
In this article I’m going to set up my first Airbyte connector, ingesting raw data from Mailchimp to Snowflake. I’m also going to do some basic data cleaning to make the raw data ready to analyze. Throughout, I’ll give my insight on my favorite features of Airbyte and some of the more challenging pieces as well.
Running Airbyte locally
Before you set up your first connector, you’ll need to follow the instructions here to set up Airbyte on your localhost using Docker Compose. I’ve found this to be a pain with other tools such as Airflow, but getting Airbyte started locally was a breeze.
Setting up a Mailchimp source connector
Airbyte has a Mailchimp source connector. Like other data ingestion tools, Airbyte gives you the option to name your source on the UI. I recommend sticking with the name of the source unless you have multiple connectors for that same source that you need to set up. Since I have multiple Mailchimp accounts, I’m calling mine madison_mailchimp for my personal newsletter. Then select the data source you wish to connect to.
You can find the API information for Mailchimp by going to account and then selecting “API keys” from the “Extras” dropdown menu. Create a key and copy it into the “apikey” field on Airbyte. Add the username or email you use to sign in to Mailchimp and click “Set up source”.
Setting up a Snowflake destination connector
Now you need to set up the destination that you want the Mailchimp data to be loaded into. I’ll be using Airbyte's Snowflake destination for this tutorial but feel free to use whatever destination works best for you. Similar to the connector name, I recommend choosing a name that matches the destination type. I only have one Snowflake environment, so I am simply calling it Snowflake.
Now, you need to head to Snowflake to find all of the information necessary to set up your destination.
Figuring out the permissions for tools that connect to Snowflake can be tricky. You need to make sure you are giving the AIRBYTE_USER the correct role and all the permissions on that role. I’m not going to lie, I had some failed attempts with setting up the user and role before deciding to copy the exact script given in the Airbyte documentation.
Use their script and simply replace the variables at the top with whatever you wish to name your role, username, database, schema, and warehouse.
Lastly, make sure you select “Internal staging” for the loading method in the last step. This is the best way to load data into Snowflake.
Ingesting data from Mailchimp to Snowflake
Next, you’ll be brought to a screen that contains more information to create a data ingestion pipeline from Mailchimp to Snowflake.
Choose to “mirror source structure” for your namespace configuration. You also have an option to add a table prefix. I recommend adding one with the name of your source connector. Tables across various connectors can have the same names (like campaigns or orders), so this is a good practice to put into place. Notice how I use madison_mailchimp_ as my table prefix.
If you look in the table, you can also see a column called “Sync mode” which specifies how you want the data to be ingested each time the connector syncs. There are four different options:
- Full Refresh Overwrite: resyncs and replaces all data.
- Full Refresh Append: resyncs and creates duplicates of all rows.
- Incremental Append: syncs new rows and appends to already synced rows.
- Incremental Deduped History: syncs new rows and appends to already synced rows, also provides a de-duplicated view mirroring the state of the ingestion.
I personally prefer incremental append because I don’t have to worry about duplicated rows. It also doesn’t require as much storage in the data warehouse. Full refresh overwrite is only recommended when you have a small amount of data, otherwise this can take a long time to sync. It may also run through precious Snowflake credits if your warehouse is running for a while.
Then, select the tables you wish to upload to Snowflake and choose “basic normalization”. Airbyte automatically creates a schema that fits your data when you choose this option, making our job as analytics engineers much easier! Otherwise, we would then have to parse a JSON object in Snowflake and that’s no fun.
Airbyte also has an option to schedule your data syncs. You can choose every few hours, every 24 hours, or even manually. This all depends on your use case.
After clicking “sync now”, it can be helpful to read through the logs as the data is uploading to ensure everything is going as expected. This is also a great feature in case something goes wrong. Logs make it easy to pinpoint the exact error.
Now let’s head on over to Snowflake. Go to the database and schema that you choose to upload your connector's data to. You should see tables beginning with that table prefix that you specified.
Awesome! I can see all the Mailchimp tables that I ingested using the Airbyte data connector. Now it’s time to transform this data so it is ready to be analyzed.
Cleaning the ingested data
Let’s start by looking at the “AIRBYTE_DATABASE”.”AIRBYTE_SCHEMA”.”MADISON_MAILCHIMP_CAMPAIGNS” table. Here’s a small subset of what it looks like:
Casting string columns
The first thing that jumps out to me is the TYPE and STATUS columns. These are both of data type variant. This data type stores a string, but is not a string itself. I want these to be strings, but more importantly, I want to remove the double quotes.
These double quotes would make analysis quite messy, especially in Snowflake. Snowflake actually uses double quotes to select from the database, schema, and table names.
select * from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"
Using this variant data type with double quotes would cause a lot of messy queries where you’d have to work around that. Next, let’s cast these columns so that they’re strings.
select
id,
CAST(type AS string),
CAST(status AS string),
send_time,
resendable,
created_time
from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"
Well, look at that! When you cast these columns from variant to string the double quotes are eliminated. That solves that problem.
Casting date columns
Now let’s look at the two columns that we want to be timestamps in our final dataset- send_time and created_time. Right now they are both varchar data types. Casting them to timestamps will allow us to perform date functions like date_part(), dateadd(), and year() on them in future analyses.
select
id,
CAST(type AS string),
CAST(status AS string),
send_time::timestamp_ntz,
resendable,
created_time::timestamp_ntz
from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"
Make sure you’re consistently casting the data in your base models to the same type of timestamps and dates. When I first started building my models, I would leave any timestamp columns that already had timestamp data types alone. No casting, nothing. They seemed ready to go!
Then, I started experiencing some weird issues when comparing timestamps. After a lot of digging, I realized my timestamp columns were all different types of timestamps. Now I make sure I cast them all to timestamp_ntz.
Renaming columns
Lastly, I recommend renaming the columns in tables like this one to something more descriptive. You’ll be using many of these Mailchimp tables together, so having simple names like id and status will get confusing fast.
I recommend renaming them according to the table name. Here I am looking at the CAMPAIGNS table, so I would rename id to campaign_id. This is a clear and descriptive name across all Mailchimp tables.
select
id AS campaign_id,
CAST(type AS string) AS campaign_type,
CAST(status AS string) AS campaign_status,
send_time::timestamp_ntz AS campaign_sent_at,
resendable AS is_campaign_resendable,
created_time::timestamp_ntz AS campaign_created_at
from "AIRBYTE_DATABASE"."AIRBYTE_SCHEMA"."MADISON_MAILCHIMP_CAMPAIGNS"
Here I renamed:
- type to campaign_type
- status to campaign_status
- send_time to campaign_sent_at
- resendable to is_campaign_resendable
- created_time to campaign_created_at
Now the data is ready to be automated and analyzed by creating a data model! dbt is a great tool to store these SQL files used to create your data model. It then automates them so that you always have clean data available to you in your data warehouse.
Conclusion
Overall, I’ve really enjoyed using Airbyte to ingest my raw data. It was a seamless process besides setting up the correct permissions in Snowflake. The setup is clearly documented, the UI is user-friendly, and my raw data landed at the destination in a format that was easy to transform into a usable model.
However, I’ve struggled more with other data connectors like Facebook and Google. Because of the way they distribute API keys, it can be confusing to obtain the correct permissions. If you’re familiar with software development using these outside companies, this step should be a breeze for you. Otherwise, there may be a bit of a learning curve.
Because Airbyte is run locally, I would be interested to see the difficulty level of deploying it using a cloud solution like AWS. This is where I struggled the most with Prefect. Luckily, I was able to receive lots of support from their Slack channel. I imagine this would be the same with Airbyte.
What should you do next?
Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey:
Ready to get started?
Frequently Asked Questions
Mailchimp's API provides access to a wide range of data related to email marketing campaigns. The following are the categories of data that can be accessed through Mailchimp's API:
1. Lists: Information about the email lists, including the number of subscribers, the date of creation, and the list name.
2. Campaigns: Data related to email campaigns, including the campaign name, the number of recipients, the open rate, click-through rate, and bounce rate.
3. Subscribers: Information about the subscribers, including their email address, name, location, and subscription status.
4. Reports: Detailed reports on the performance of email campaigns, including open rates, click-through rates, and bounce rates.
5. Templates: Access to email templates that can be used to create new campaigns.
6. Automation: Data related to automated email campaigns, including the number of subscribers, the date of creation, and the automation name.
7. Tags: Information about tags that can be used to categorize subscribers and campaigns.
Overall, Mailchimp's API provides a comprehensive set of data that can be used to analyze and optimize email marketing campaigns.
What should you do next?
Hope you enjoyed the reading. Here are the 3 ways we can help you in your data journey: