Company Updates

Airbyte OSS gets API and Terraform Access, Our Integrations with AI and DataDog | The Drip July Edition

Justin Chau
August 11, 2023

Hey everyone, welcome to the July edition of the Drip where we take you downstream to cover highlights of our change-log, community and anything Airbyte related.

Terraform Provider Hits OSS

At Airbyte, we've always been committed to enhancing the developer experience and ensuring our users have the best tools at their disposal. Back in 2021, we introduced the Configuration API, designed primarily for internal communications within Airbyte components. While it served its purpose, we recognized the need for a more user-centric approach. That's why we developed the Airbyte API, offering a more refined and intuitive view of the Airbyte platform.

Earlier, our Airbyte API and the much-anticipated Airbyte Terraform Provider were exclusive to Airbyte Cloud. But we've got exciting news! Both these tools are now accessible on Airbyte Open Source and Airbyte Enterprise, making them available to all our users. This move not only showcases our dedication to the open-source community but also ensures that our tools are more widely adopted and beneficial.

For those who've been using the Configuration API, the transition to the Airbyte API is smooth. The main distinction is that the Airbyte API offers a REST interface, replacing the RPC style of the Configuration API. And if you're wondering about the future of the Configuration API, we plan to sunset its official support by early 2024.

We're also making strides with our Terraform Provider, which empowers engineering teams to seamlessly integrate with Airbyte. It's all about streamlining configurations and promoting collaboration. With the "Infrastructure-as-Code" approach, teams can now efficiently manage Airbyte resource settings, ensuring consistency and scalability.

Want to dive deeper into these advancements? Read the full article here and discover how we're revolutionizing data integration at Airbyte!

Harnessing the Power of AI: A Glimpse into Our Recent Endeavors

In the ever-evolving landscape of technology, AI continues to be a game-changer, reshaping industries and redefining the way we approach challenges. At Airbyte, we've been at the forefront of this revolution, making strategic moves to harness the potential of AI and deliver innovative solutions to our users. Two of our recent articles shed light on our journey and the strides we've made in this domain.

1. Why AI Shouldn't Reinvent ETL

In this insightful piece, we delve into the intricacies of ETL (Extract, Transform, Load) processes and the role of AI in it. While AI has made significant inroads in various sectors, it's essential to understand its limitations, especially when it comes to ETL. The article emphasizes the importance of not letting AI overshadow the foundational principles of ETL. Instead, the focus should be on leveraging AI to enhance these processes, ensuring data integrity and optimizing performance. By striking a balance between traditional ETL methods and the capabilities of AI, we can achieve a synergy that drives efficiency and innovation.

2. Chat with Your Data using OpenAI, Pinecone, Airbyte, and Langchain

Taking our commitment to innovation a step further, this tutorial showcases a groundbreaking integration of OpenAI, Pinecone, Airbyte, and Langchain. The fusion of these technologies allows users to interact with their data in a conversational manner. Imagine querying your datasets through natural language, receiving insights as if you're chatting with an expert. This seamless interaction is made possible by the synergy of the aforementioned platforms, each bringing its unique strengths to the table. The tutorial provides a step-by-step guide on setting up this integration, empowering users to tap into the power of conversational AI for data analysis.

Other notable tutorials you should check out are:

Our journey with AI is marked by a commitment to pushing boundaries and exploring new frontiers. By understanding the strengths and limitations of AI, we aim to create solutions that are not only innovative but also grounded in practicality. As we continue to chart this exciting path, we invite our community to join us, share their insights, and be a part of this transformative journey.

Advancements in Our Postgres Integration: Lessons from Handling Large Tables

At Airbyte, our commitment to providing robust and efficient data integration solutions has led us to continuously refine our Postgres Source connector. Over the past year, we've made significant strides in enhancing its capabilities, especially when dealing with very large Postgres tables. Here's a glimpse into our journey and the key lessons we've learned:

1. Reading Data in Its Natural Order

One of the primary challenges with large tables is the time it takes to read and sync the data. We discovered that reading data in the natural order it's stored on the disk, rather than trying to rearrange it, significantly boosts performance. By avoiding unnecessary sorting operations, we've managed to reduce sync times, especially for larger tables.

2. Understanding Data File Structure

Postgres stores table data in separate files, typically 1GB each. These files are divided into blocks, referred to as Pages, which are further divided into Tuples containing row data entries. By understanding this structure and leveraging the Current Tuple ID (CTID) - a hidden system column that points to a row's physical address on disk - we've optimized our queries to sync data more efficiently.

3. Chunking the Data Read Process

Initially, we tried streaming the entire dataset in one go, but this approach proved to be unreliable for larger databases. Instead, we now break down the read process into smaller sub-queries or "chunks." This not only speeds up the sync process but also enhances reliability.

4. Implementing Checkpointing

Given the potential for errors during long sync operations, we introduced checkpointing. This approach allows us to save the state of a sync at regular intervals. In the case of an interruption, we can resume the sync from the last saved checkpoint, ensuring data integrity.

5. Transitioning to Incremental Syncs

After the initial data snapshot, it's crucial to shift to an incremental approach for subsequent syncs. This ensures that only changed data is synced, reducing the load and time taken. While the CTID approach works well for the initial sync, for incremental updates, we leverage other methods like CDC, xmin, or user-defined columns.

6. Continuous Performance Measurement

Our journey has been marked by continuous testing and measurement. By profiling our code and monitoring throughput and reliability, we've been able to identify bottlenecks and make necessary improvements.

Our enhanced approach to the Postgres integration allows us to handle vast amounts of data without stressing the server. We've achieved a linear performance, ensuring reliable data syncs irrespective of table size. As we apply these learnings to other databases like MySQL and MongoDB, we remain committed to delivering unparalleled data integration solutions.

Introducing Airbyte's Datadog Integration: Elevate Your Monitoring Game

We're always striving to enhance your data integration experience. We're thrilled to announce our latest integration with Datadog, a leading monitoring and analytics platform. This integration is designed to provide Airbyte Enterprise users with a seamless way to monitor and analyze their data pipelines directly within Datadog dashboards.

Key Features of the Airbyte-Datadog Integration:

  • Extensive Metrics Collection: With the new integration, you can access a plethora of airbyte.* metrics, offering detailed insights into your data synchronization processes.
  • Easy Setup: Setting up the integration is a breeze. It involves configuring the datadog.yaml file, adding the Datadog Agent, updating the Docker Compose configuration, setting the right environment variables, and re-deploying Airbyte alongside the Datadog Agent.
  • Exclusive to Airbyte Enterprise: This integration is available for our Enterprise users, offering premium support, enhanced security features, and much more.

By integrating with Datadog, we aim to provide you with a holistic view of your data pipelines, helping you identify bottlenecks, optimize performance, and ensure seamless data synchronization.

Ready to elevate your monitoring game? Dive into the details of our Datadog integration on Datadog's official documentation and Airbyte's documentation. Harness the power of real-time metrics and enhance your data integration journey with Airbyte and Datadog!

Wrap Up

And that’s all we have for July’s edition of The Drip. Thanks for reading through. If you have any questions:

  • Please join our Slack community to talk to us on the Airbyte team as well as other fantastic folks in the community!
  • Also sign up for our Newsletter to keep up with the state of the art in Data Integration and the broader Data Engineering Ecosystem!

Limitless data movement with free Alpha and Beta connectors
Ready to unlock all your data with the power of 300+ connectors?
Try Airbyte Cloud FREE for 14 days