Data Insights

Bring Your Own Infra

Davin Chia
April 13, 2023
10 min read

Businesses face increasing regulatory compliance challenges in today’s data driven world. Traditional SaaS solutions - which require relinquishing data control  - often fall short. How is a business to tackle these challenges without being blogged down by complexity? 

Enter: Airbyte’s Bring Your Own Infrastructure (BYOI)!

The Need for Data Control and Compliance

Let’s set the stage: As a business owner, what is data control and compliance and why does it matter?

We’ll motivate this with a fictional example,

Imagine an e-commerce company called "GlobaMart". GlobaMart recently became the top-selling American online grocery chain, growing 500% YOY for the last five years. Riding on this wave of success, GlobaMart’s CEO decides to expand globally and begins selling products to customers around the world.

As GlobaMart expanded, they continued to collect and process customer data in the same way they did in the United States, without accounting for different data protection regulations in other countries. Why change something that’s working? The expansion was a success and growth continued skyrocketing!

One day, one of GlobaMart’s international customers filed a complaint with the local data protection authority, claiming that their personal data had been mishandled by the company. The company was investigated and found to be in violation of several data protection laws, resulting in a large fine and a damaged reputation.

GlobaMart's reputation suffered even further when news of the data breach spread to other countries. Customers in those countries began to lose trust in the company and stopped doing business with them. The company's stock price also took a hit.. you get the idea!

Although this example is exaggerated, the repercussions of not complying with data protection laws are very real. Legal journals are filled with expensive examples, such as British Airways and Marriott’s respective 183 million and 99 million GDPR fines.

Data control - an organization’s ability to appropriately store, secure and use data per data locality laws - is crucial to compliance as businesses expand globally and deal with data in more and more localities with disparate data protection laws.

Traditional SaaS and Data Control

The ‘buy-and-forget’ benefit of SaaS has dramatically commoditized software and unlocked efficiencies across all industries. Traditional SaaS has one big disadvantage in the data realm - you no longer control your data.

There is a simple reason for this - almost all SaaS vendors are architectured such that user data enters the vendor’s infrastructure as work is done. That is, as a user, I make a call to the vendor’s system with the relevant data. The vendor takes appropriate action based on the received data. While this convenience is the backbone of the `buy-and-forget` model, it means users no longer control their data and now need to account for data protection and remediation considerations when evaluating SaaS vendors. For example, Where does a vendor process data? Does a vendor support processing data within specific localities? What happens if a vendor suffers a data breach? The list goes on.

The silver bullet: minimize data leaving your infrastructure. Unless a company is fully committed to maximal SaaS with no internal infrastructure, it is highly likely data-protection aware processes and architectures already exist. Instead of adding more complexity (and processes), design all these error cases away by simply minimizing the need for data to leave, and focus on ensuring data-aware internal processes and architectures.

Naive Approaches to Data Control

“That sounds great. Now how do we actually minimize data leaving our infrastructure?!”

I’m glad you asked! There are several approaches of varying complexity.

Let’s go back to our GlobaMart example. Imagine you are GlobaMart’s Head of Data and get the following email from the CEO: “We are expanding to the EU and will need our data processes to adhere to GDPR policies.”

The first and most naive way of doing this is to have multiple identical deployments. We stand up an entirely parallel system in the EU and make the necessary changes so all systems and users are aware of the US - EU split. This diagram illustrates what this means with a simple data ingestion tool.

This is quick and dirty. We copy-and-paste architecture and infrastructure to ‘get things done’. The downsides are predictable - new Education and Access requirements. Drift risk between the two systems. Operator risk from using the wrong system. Overall, GlobaMart now has increased operational and process complexity with medium error risk and no long-term system longevity.

Limitless data movement with free Alpha and Beta connectors

The second and more sophisticated approach is “Upgrade all infrastructure and processes to adhere to the new regulations” and do a lift-and-shift to the EU. For brevity, we skip the US and EU privacy law analysis and jump straight to a general data-aware strategy: move all data processing to the region with stricter data-privacy laws. This is the EU in this scenario. Compliance is ensured through making sure all data is compliant.

The only wrinkle in this plan: migrations are often one of the hardest projects. Further, this still isn’t a long-term solution. What happens if GlobaMart starts directly processing payments and needs to become PCI compliant? Will we upgrade all infrastructure and processes once more? Overall, we exerted a lot of effort and ended up right where we started. Surely there must be a better way?

Control Data Plane Split: The Key to Flexibility

A far more sophisticated solution is the Control Data Plane Split.

This tried-and-tested approach involves separating the architecture into two distinct components:

  1. Control Plane: The ‘Brain’. This component houses all the business logic complexity and serves as the central configuration location, allowing easy and efficient development and operation.
  2. Data Plane: The ‘Hands’. Simple workers performing atomic operations. These workers can be moved anywhere, enabling businesses data processing to the source and maintain compliance with regional regulations.

Let’s use Airbyte, a Data Integration Platform, to illustrate how this works in practice.

This diagram illustrates Airbyte Cloud’s architecture with the Control Data Plane split.

Some details to note,

  • The Control Plane - Airbyte’s Brain - is on the left. This contains all the complex business logic and state, such as scheduling, configuration, permissioning and so on.
  • The Data Plane - Airbyte’s Hands - are on the right. We see two planes. The first is in Paris, while the second is a temporary plane for development. 
  • The control and date planes communicate asynchronously via specific data-plane queues. The control plane’s Routing Service places jobs in the relevant queue. The data planes constantly poll their specific queues for work, execute jobs as soon as they are aware of them, and update the control plane after.

Thus, Airbyte’s Control Data Plane split is a pull-based model with queues. There are many flavors of splits, and analyzing tradeoffs is outside the scope of this blog post.

It is immediately obvious an architecture like this easily solves GlobaMart’s issue - stand up a data plane in the EU region and configure jobs to be scheduled in the new region. Immediate business value with little to none complexity!

Careful readers will notice one interesting detail in the above diagram - the Control and Data planes are in different Clouds! Indeed this deployment flexibility is another benefit of this specific flavor of a Control Data plane split and is due to the Airbyte Data Plane’s minimal infrastructure requirements.

An Airbyte Data Plane only has two infrastructure requirements:

  1. The ability to run Docker containers.
    Docker is a commonly accepted infrastructure layer. All public Cloud providers have numerous Docker offerings with various tuning levers. Widely understood commodity technology.
  2. The ability to make outbound network connections.
    No firewall rules changes. Security departments can rest easy. 

This minimal set of requirements explains how the Control and Data planes can exist in different Clouds as the above diagram shows. This is how Airbyte Cloud is today. We are available in GCP and AWS and going to Azure - simply spinning up another data plane - is a question of when and not how. 

By separating the “Brain” from the “Hands”, the Control Data plane split minimizes operational complexity while ensuring businesses can scale their data operations to meet changing compliance requirements.

Airbyte's BYOI Solution

So, what does this mean for Airbyte’s users?

Cloud Users interested in using Airbyte Cloud while maintaining Data Control are now able to deploy an Airbyte Data Plane into their infrastructure. Airbyte will work with you to do the hard work of operating the Data Plane in your infrastructure.

OSS Users who want to continue using Airbyte OSS across different regions without the hassle of maintaining multiple Airbyte instances are now able to deploy various Airbyte Data Planes within their own infrastructure. This is a premium OSS offering and Airbyte will work with you to help set up the initial data plane and provide continuous operational advice and support as you scale your Airbyte usage.

Both of these are currently in Alpha, so please reach out here if you are interested! Please reach out on Slack for any questions or comments.

Thank you for using Airbyte and being patient as we work to make Airbyte better!

Ready to unlock all your data with the power of 300+ connectors?