At the very start of Airbyte, we were wondering which part of the data integration challenge we should solve first. We always had the intuition that open source was the way to go for that particular problem space. But should we start with batch data ingestion (ELT), event-driven data integration or reverse ETL?
We knew that building a vibrant community of contributors was key to solving data integration. Just think of all the places where data can be (APIs, Files, Databases, Queues...) or where data can go (Data Warehouses, Data Lakes, Databases, Files, APIs...). At that time, we were thinking of open sourcing an event-driven data integration platform like Segment. In the end, we weren’t convinced that we would be able to build a community of contributors and a significantly differentiating value proposition.
In this article, we want to share the reasons why, instead, we started by open sourcing data ingestion (ELT), and why and how we’ll continue open sourcing and commoditizing all data integration, including event based.
Why open sourcing ELT is so valuable
In July 2020, we decided to investigate the value behind open sourcing ELT. We reached out to all of Fivetran’s, StitchData’s, and Matillion’s customers we could find online. There were about 250 companies. We ultimately managed to talk to 45 of them over a period of 2 months.
During these calls, we realized that 100% of these companies, in addition to using these closed-source products, were also building and maintaining their own connectors in-house. They did so because either:
- The ELT solution didn’t support the connector they wanted.
- The solution supported it, but not in the way they needed.
- The solution was prohibitively expensive for their largest or most complex data sources.
Indeed, the hard part about data integration is not building the connectors, but actually maintaining them. This is why all the existing tools out there plateau at around 150-200 connectors. Consequently, every single company is forced to build data engineering teams. They need these teams to build in-house scripts to address all their custom connector needs and to handle their database replication that can’t be addressed due to the volume-based pricing that the ETL tools offer.
That’s why we went with the open-source approach! Being open source enables us to address the long tail of integrations. Our Connector Development Kit (CDK) allows our users to build new connectors in a matter of hours (instead of days!) for our community while keeping the development standardized, so that maintenance can be shared.
In the end, Airbyte users can address all their data ingestion and database replication needs with one platform.
What commoditizing ELT actually means for Airbyte
Open sourcing alone is not enough to commoditize data integration. Commoditizing entails 2 things:
- The long tail of connectors is supported with a good SLA.
- Companies can leverage those connectors in a scalable way at a price that is a no-brainer for them.
This is why Airbyte Cloud uses an infrastructure pricing model based on compute time. In addition to being predictable and transparent, this model enables the replication of high-volume sources like databases. As a matter of fact, database replication volume can be hundreds of times higher than for API sources.
This is also why we are creating Airbyte’s participative model to share revenues with connector contributors as long as they support their connectors’ SLAs. This is how Airbyte will eventually provide thousands of reliable connectors.
But ELT has always been only a first step for Airbyte.
Why ELT was the necessary first step
When you think of Airbyte, you think of a batch-based data ingestion process. But the data integration problem space is much larger than that. It includes the following:
- Streaming/real-time data and events ingestion (this includes use cases like Segment)
- Batch & Streaming data distribution (this includes the reverse-ETL use case)
The issue with reverse-ETL and Segment use cases is that the users of these technologies have little overlap with the potential contributors on that technology.
When you look at batch ETL / ELT, the users are data engineers (and data analysts), and the people who build and maintain the connectors are those same data engineers. This overlap enables Airbyte to grow a community of contributors: the data engineers leveraging Airbyte not only utilize it to make it work for their use case, but also contribute improvements.
However, the users of the other use cases are generally product, marketing, customer success, sales functions... They have nothing to do with the data engineering function that would be able to build those pipelines. This can be verified when you look at Rudderstack (open-source Segment) and Grouparoo (open-source reverse ETL). They haven’t been able to grow a community of contributors.
In the end, Airbyte is the only open-source data integration project with a growing community of data engineer contributors.
Reverse-ETL and Event-Driven is next for Airbyte
Airbyte is already offering access to 100+ pre-built ELT connectors, in addition to the CDK that enables any data engineering team to build ELT connectors in hours.
Offering reverse-ETL connectors and a new CDK dedicated to that use case is actually easy for Airbyte and is planned for 2022. The fact that the same data engineers leveraging Airbyte for ELT are tasked to build the custom reverse-ETL connectors for their company is an indication that Airbyte’s community will help cover the long tail of reverse-ETL connectors, too.
We will apply the same process for event-driven connectors, so that Airbyte can enable that use case as a third step.
Because all companies will prefer to use a unified platform for their data movement, Airbyte’s objective is to become the standard for all data movement. We want to build the first platform that covers the long tail of connectors, provides easy extensibility to address all your custom use cases, and offers all this with a transparent, predictable and scalable pricing.
Becoming the standard for all physical data movement will allow Airbyte to have a great distribution channel for all other data movement products that sit on top of the data pipeline, including data quality and observability, privacy, compliance, data discovery, and more.
That’s our vision: with Airbyte, we want to commoditize all types of data integration, power all organizations’ data movements, and make these movements smart.