It turns out Stripe as a data source can be quite finicky! While we have used Segment to ingest different data like invoices, events, and transactions from Stripe, we have found it to be quite unreliable. There will be gaps in data from outages that occur yet don’t get fixed.
Another solution for ingesting data from Stripe, the Data Pipeline, is built by Stripe and comes with a hefty price tag just to access this single source. While I haven’t tried this out, the price tag alone is enough to deter me from finding a better solution. Luckily there is one!
Airbyte makes it easy to sync your Stripe data to your data warehouse for free, with just the cost of Airbyte Cloud or your own resources that you use to host the connector. The biggest hurdle in setting up your Stripe connections comes from understanding the different sync modes offered by Airbyte and knowing which is right for you. Lucky for you, this is what we will be walking through in today’s article!
What is Stripe?
Stripe is a platform that accepts payments. It is the main source of payment collections for many online businesses, making it such an important data source. Stripe often contains data about a company’s customer orders, invoicing, and taxes.
Airbyte Sync Modes
Airbyte offers four different sync modes that you can set up your connectors with. You even have the option to change the sync mode depending on the table, for the same connector. This allows for a lot of customizability between different tables.
Full Refresh- Overwrite
The full refresh method is one of the simplest methods offered by Airbyte. When a full refresh occurs, the connector is essentially starting from scratch and looking back at all of the available data for that specific Stripe table. In this case, it is also an overwrite mode. Instead of saving data that was previously synced, this method clears data in the destination table and reloads all data each time.
Full refreshes can be quite performance intensive depending on how much data you have available in your Stripe tables. Keep in mind that this mode does not consider history and how the data changes over time. It is just ingesting everything available at the source.
Because of this, it is not ideal for certain Stripe tables like Stripe Events that only sync for the last 30 days. If the sync is constantly clearing the destination table, you will be losing any data that is older than 30 days. So, for tables that only store a certain range of data, it is recommended that you use another sync mode.
Full Refresh- Append
This is another form of the simple full refresh method offered by Airbyte. However, this time, instead of deleting any data in the destination table, this mode appends data. Your previously ingested data will remain available and the data currently available at the source will be ingested into this table again.
Because all data at the source is appended to the destination table, a lot of duplicates will be created. It’s important to consider the nature of the table you are syncing and if this is ok. For many Stripe tables, this could eat up a lot of storage space and cause querying the table to be extremely slow.
However, this full refresh method does allow you to keep track of deletions or changes to a record, whereas the overwrite method does not. If you expect your records to be deleted or updated often, this method would be preferred. Still, I would only use full refresh methods as a last resort if the next two I’m going to mention aren’t available to you.
Incremental Sync- Append
Incremental syncs in Airbyte only sync new or modified records rather than records that have already been synced. No data is ever deleted using the append method because updated or new records are appended to the destination table.
The incremental sync append option requires you to specify something called a cursor, which is a timestamp column that represents when a source was last updated or created. updated_at columns are a good example of this.
Cursory columns allow Airbyte to sync the data from your Stripe sources incrementally, ensuring there are no duplicate values being ingested. For my Stripe tables, I use created as the cursory because this represents the time at which the object was created.
Incremental Sync- Deduped History
This is another type of incremental sync that deduplicates records in the final destination. Incremental sync deduped history is always the sync mode that I prefer using with my Stripe connections, as it is highly performant, eliminates a lot of the modeling on my end, and creates an additional table you can reference.
This specific sync only syncs new or modified data, but instead of appending the new row like with the append method, it updates the row that already exists. It does this by comparing primary keys and ensuring the primary key is unique at the destination. Deduped history does this by using the cursory field to sort the rows, only choosing the most recent one for each primary key.
Because of this, you do need to specify a primary key for Airbyte to use for each table being synced. Airbyte typically picks up on this automatically, but I always recommend double-checking. If it doesn’t, make sure you are validating your data and ensuring that the field you select is a true primary key. For my Stripe syncs, Airbyte chooses the primary key automatically but does not specify what field it is using.
While the end table itself doesn’t show the history of any specific primary key, the deduped history mode creates an intermediate history table (with a suffix of _scd for slowly changing dimension) that appends new records similar to the incremental append mode. Airbyte then models this table for you, looking at the started_at and ended_at columns in the _scd table, creating the deduped end table you will query.
I personally love that it creates two tables, one similar to the appended table that contains all records, and another that produces the output I want to query. Having this historical table with all records makes validation a breeze if I expect something to go wrong.
Important things to consider for Stripe sync modes
When choosing a sync mode to use with your Stripe tables, there are a few essential things to consider: data available to you from Stripe, modes available on each Stripe table, choosing a cursory field, and specifying the correct primary key.
Data available to you from Stripe
Keep in mind that not every source table will have all historical data available to you. Some sources only store data for the last 30, 60, or 90 days, making it imperative that you handle this in a way where you are maintaining these records at your destination. For example, Stripe events is a table that only syncs for the last 30 days. If you don’t ingest data prior to that 30-day window into your warehouse, you will lose it forever.
This also makes it important to maintain the data in the destination at all times, meaning you never want to overwrite it using a full refresh sync mode. That would cause you to lose all of the historical data prior to the 30-day window! Rather than using a full-refresh overwrite, it would be best to use an incremental sync mode since new records being created will simply be appended to the destination.
Read through Stripe’s documentation of the tables you are ingesting so you can plan accordingly.
Sync modes available on the Stripe table
While there are four different Airbyte sync modes, not all of them will always be available for a table. It’s important to look at which ones are available for a particular table and choose the best option.
For example, I’ve found that Stripe invoice line items only allows you to do full refresh syncs and not incremental syncs. Because of this, I’m forced to use full refresh append, creating duplicates and lots of extra records at the source. While this is not ideal, it is the only option that would make sense. It forces me to think through how I need to model the data in the staging layer in order to get rid of duplicates.
Choosing the appropriate cursor field
As mentioned earlier, when using an incremental sync mode, you need to choose a field that accurately determines when a row was generated within Stripe. Sometimes Airbyte will select this automatically, and other times you will need to specify this yourself.
Make sure you are always selecting the field that represents when the record was truly created in the system. It may involve reading through lots of documentation and understanding the nuances between fields to choose the right one. For most of the Stripe tables, Airbyte automatically selects the created field to be the cursory field.
Choosing the appropriate primary key
When using an incremental sync, you also need to choose the field that accurately denotes a unique row in Stripe. Similar to the cursory field, Airbyte often selects this automatically, but at other times you will need to select it yourself. Denoting this primary key helps Airbyte know which records to compare, choosing the most recent record to populate the output table. This helps to ensure that all records are deduped and accurately reflect the current picture at the source. For incremental syncs with Stripe, Airbyte chooses this primary key automatically.
Common Errors When Syncing Stripe Data
Large amounts of data
When I first set up Airbyte connectors for our Stripe data, they were failing 2 times before succeeding. This is because we had so much data in Stripe. The jobs were often timing out before they were able to finish ingesting all of the data available at the source. Be patient and if the connector fails a few times during the first few days it is set up, don’t fret. You may just have a large amount of data being moved! This takes time.
Running out of storage
I kept getting a “failure origin” error for some of my Stripe connectors. While I didn’t know what this meant at the time, I could see some data was loading into the destination table but not all. Again, these connectors were ingesting huge amounts of data. It turns out that I needed to scale up my resources to better handle this large volume of data. Keep in mind that this is only a problem if you are hosting Airbyte on your own resources, like AWS. With Airbyte Cloud, they would handle scaling these resources for you.
Airbyte is a great open-source solution for ingesting Stripe data. It is so important to have control over your own data, especially a source like Stripe which is so integral to the business. By using an open-source tool, you aren’t relying on a middleman or enterprise tool for high-quality data from the source.
With Airbyte you can also set up alerts to send to Slack channels or email, letting you know when a sync has failed. This makes it easy to monitor and alert the business to any potential data issues. Something as simple as alerts is often lacking with enterprise or 3rd-party tools. When you have full control over your data, you only have you as the bottleneck to fix any problems.