Article

Announcing PyAirbyte: Bringing the power of Airbyte to every Python developer

•

February 27, 2024

•

10 min read

We're happy to announce today that PyAirbyte is in public Beta! You can check out the livestream we did on YouTube here.

Until now, the value of Airbyte has been anchored in three pillars: Connectors, a robust platform orchestrating them, and intuitive interfaces. These elements are integrated into all of our product offerings, such as Airbyte Self-Managed, Airbyte Cloud, and Powered by Airbyte.

Today, we bring our users a different option: PyAirbyte, an open-source Python library that brings in the era of "Pipelines-as-Code", packaging our connectors and making them available as code, while removing the need for strict dependencies on Docker, Kubernetes, or an Airbyte Cloud account.

This latest addition to the Airbyte ecosystem represents a strategic evolution from our core offerings, catering to engineers who prefer a code-based approach to designing and managing data pipelines, in addition to our API and Terraform provider.

PyAirbyte is designed to bridge the gap between the flexibility of custom Python scripts and the power of a data integration platform. With PyAirbyte, Python users gain access to the majority of Airbyte’s data connectors – rather than having to build and maintain those themselves.

This initiative comes from recognizing that many engineers, particularly in data and AI, lean towards Python for project initiation due to its versatility and extensive library ecosystem. PyAirbyte caters to this preference, offering an easy-to-integrate Python library that simplifies data pipeline management.

PyAirbyte Architecture Overview

Source: A source object is using a Python connector and includes a configuration object. The configuration object is a dictionary that contains the configuration of the connector, like authentication or connection modalities. The source object is used to read data from the connector.
Cache: Data can be read directly from the source object. However, it is recommended to use a cache object to store the data. The cache object allows temporary storage of records from the source in a SQL database like a local DuckDB file or a Postgres or Snowflake instance.
Result: An object holding the records from a read operation on a source. It allows quick access to the records of each synced stream via the used cache object. Data can be accessed as a list of records, a Pandas DataFrame or via SQLAlchemy queries.

PyAirbyte Features (Beta)

Installation via PyPi

PyAirbyte is installed using the Python package manager pip, a standard tool for installing Python packages. This makes PyAirbyte easily accessible to anyone with a setup that supports Python >=3.9.

pip install airbyte

Easy source connector configuration

Users can easily get and configure the available source connectors. The list of source connectors can be obtained programmatically using the list_connectors() method. It’s also possible to install custom source connectors created by the user.

The get_source() method allows users to create and configure source connectors. In this example, a specific version of the Shopify connector is configured, demonstrating how users can specify their authentication, and other details.

The check() method is a useful tool for validating these configurations.

import airbyte as ab

# Create and configure the source connector:
source = ab.get_source(
   "source-shopify",
   install_if_missing=True,
   version="2.0.0",
   config={
       "shop": userdata.get("SHOP"),
       "credentials": {
           "auth_method": "api_password",
           "api_password": userdata.get("SHOP_API_PASSWORD")
       }
   }
)

# Verify the config and credentials:
source.check()

Flexible data stream management

The library's stream selection capabilities give users control over their data. By enabling the selection of specific data streams, PyAirbyte conserves computing resources and streamlines data processing.

The select_streams() method allows users to select the specific streams they want to work with.

# Select the streams to load to cache:
source.select_streams(["products", "product_variants", "collections", 
"customers"])

Other useful methods for stream management are get_available_streams() and select_all_streams() methods.

# List all available streams for the source:
source.get_available_streams()

# Select all streams for reading:
source.select_all_streams()

Versatile caching options

With support for multiple caching backends like DuckDB, MotherDuck, Postgres, Snowflake and BigQuery, PyAirbyte offers flexibility and allows users to align their caching strategy with their infrastructure and performance needs.

Default DuckDB Cache

If users don’t define a specific Cache, DuckDB is used as the default cache. DuckDB is an in-process SQL OLAP database, which means it runs within the same process as PyAirbyte and does not require a separate server. This simplifies the setup and is efficient for analytical queries.

# Read into DuckDB local default cache
cache = ab.get_default_cache()
result = source.read(cache=cache)

Custom Caches

In addition to DuckDB and MotherDuck, PyAirbyte supports Postgres, Snowflake and BigQuery for caching. Here’s an example of the usage with Postgres:

from airbyte.caches import PostgresCache

# Define a Postgres Cache and pass the necessary configuration
pg_cache = PostgresCache(
     host="localhost",
     port=5432,
     username="postgres",
     password="postgres",
     database="pyairbyte_demo"
)

# Select all of the source's streams and read data into the previously defined Postgres cache:
source.select_all_streams()
read_result: ab.ReadResult = source.read(cache=pg_cache)

Incremental data reading

PyAirbyte is able to read data incrementally. This feature is key for efficiently handling large datasets and reducing the load on data sources. By tracking changes in the data source, PyAirbyte can process only the new or updated data, which is more efficient than reprocessing the entire dataset.

Incremental reading is done by default for sources that support it. If users don’t want incremental mode, they can use the force_full_refresh boolean flag in cache operations such as read().

Direct Cache Interaction

Users can interact with cached objects without re-creating the source each time. The ability to read directly from the PyAirbyte Cache after data has been written is key for continuity of “Pipelines-as-Code”. This is possible because the Cache internally tracks metadata on what it has been synced before.

Interoperability with SQL, Python libraries and AI frameworks

Reading from a source returns a ReadResult object, which contains one or more Dataset objects, one per stream.

Dataset objects can be iterated upon to return records. Dataset objects also have these members:

get_pandas_dataframe() - Returns a Pandas DataFrame object.

# Read from the cache into a pandas Dataframe:
products = cache.get_pandas_dataframe(stream_name="products")

get_sql_table() - Returns a SQLAlchemy Table object, or an exception if there is no cache.

# Get table objects for the "collections" stream
collections_table = cache.get_sql_table("collections")

The Cache class also has its own additional methods for common use cases:

get_sql_engine() - Returns a SQLAlchemy Engine object, which can be used to perform advanced SQL operations against the SQL cache tables.

Other methods for interoperability include the to_documents(), which can be used to create documents compatible with popular AI frameworks like LlamaIndex and LangChain to facilitate building RAG pipelines.

PyAirbyte's compatibility with various Python libraries, like Pandas and SQL-based tools, opens up a wide range of possibilities for data transformation, analysis, integration into existing Python-based data workflows and AI frameworks.

For a more complete look at PyAirbyte’s features and usage, take a look at the documentation.

PyAirbyte upcoming features

These are some of the integration features that we are currently working on and that will be added to the library in the near future.

Integration with Airbyte hosted offerings

For teams looking to scale their operations, PyAirbyte will provide a clear path to migrate to Airbyte's hosted offerings. This transition is made with careful consideration of maintaining data integrity and operational consistency.

Here’s an example of how the integration between PyAirbyte and Airbyte Cloud may look like:

‍

Before Migration: The code on the left shows how a github_source is configured to sync data into a SnowflakeCache, from which the "repos" dataset is accessed and iterated over to print repository names.

After Migration: The code on the right displays the new setup after migrating to Airbyte Cloud. Here, a get_cloud_connection() function fetches the latest sync results from Airbyte Cloud. It also demonstrates how to initiate a new sync process directly via Python if needed. Post-migration, users can still access the "repos" dataset just as before, ensuring a consistent workflow.

Integration with data orchestrators

In addition, we're enhancing the library with integrations for data orchestration such as Airflow, Dagster and Prefect, enabling more efficient and controlled data workflows.

“Dagster users have been integrating Airbyte as a leading data ingestion service since the inception of the Dagster project. Airbyte has been a solution of choice for building reliable ETL/ELT pipelines at speed. We are very excited at the options that will open up with today’s release of PyAirbyte. Dagster has always believed in a philosophy of code-first, and composable systems that embrace software engineering best practices. Dagster’s Embedded ELT capability reflects this approach. Data Engineers realize the importance of careful version control and testing in building data platforms at scale. With the addition of PyAirbyte, users will be able to build deeper integration with this key data movement solution, working in their lingua franca —Python— and we look forward to demonstrating how Dagster and PyAirbyte can bring new capabilities to all Dagster users, be it in open-source or on Dagster Cloud.” - Pete Hunt, CEO at Dagster

‍

Integration with LangChain

We’ve teamed up with LangChain to offer a new document loading LangChain package built on top of PyAirbyte. The integration will equip AI developers with a large catalog of connectors for ingesting data into their LLM-driven applications. This new package will eventually replace the existing connector-specific LangChain/Airbyte integrations with a single unified interface to interact with Airbyte.

‍

We're ecstatic about PyAirbyte enabling LangChain users to load data from hundreds of sources with no infrastructure setup. Our existing Airbyte connectors are very popular already, and the new langchain-airbyte integration package makes Airbyte's loading pipelines even more accessible. - Erick Friis, Founding Engineer at LangChain

PyAirbyte Benefits

Leveraging the ubiquity of Python

Python's widespread adoption in the data and AI community means that many engineers are already familiar with its paradigms and libraries. PyAirbyte builds on this familiarity, making it easier for professionals to integrate it into their work without the steep learning curve often associated with adopting new technologies. This ubiquity also ensures a vast community for support and collaboration.

Decreasing time to value by enabling fast prototyping

PyAirbyte accelerates the journey from development to actionable insights. By removing the dependency on hosted services or cloud subscriptions, data teams can reduce the time it takes to go from setting up data pipelines to deriving value from them. This is particularly valuable in fast-paced environments where speed is critical to success.

Data engineers can also quickly set up and test data integrations, enabling rapid iteration and development. This means they can see their data in action sooner and make adjustments on the fly, without getting bogged down in the initial setup.

Reducing the need for custom ETL development

One of the significant advantages of PyAirbyte is that it helps eliminate the time-consuming and costly task of building custom ETL pipelines from scratch, a process that is not only prone to errors but also sometimes difficult to scale. Pre-built connectors allow teams to redirect their efforts from ETL coding to more strategic and valuable tasks.

Enabling best practices like version control and CI/CD

With PyAirbyte, data pipelines can be version controlled just like any other code, allowing for better collaboration among team members and historical tracking of changes. This integration into version control systems like Git means that changes to data processes can be documented, reviewed, and reverted if necessary, leading to greater stability and reliability in data operations.

By allowing data engineers to define and manage data pipelines as code, PyAirbyte also integrates into CI/CD practices, enabling automated testing and deployment.

Facilitating AI use cases

By leveraging PyAirbyte, developers can connect to a variety of data sources, ensuring that the data feeding into their AI models is not only rich and diverse but also easily accessible. From the outset, our goal has been to make the process of developing AI applications easier, that's why we're excited about adding more integrations, particularly with language model frameworks like LangChain.

Integrating with the Airbyte ecosystem

PyAirbyte is designed to be compatible with the larger Airbyte ecosystem, ensuring that teams can seamlessly move between local and cloud environments as their needs change. This compatibility provides the flexibility to choose the right tools for the job without worrying about interoperability.

Final thoughts

Ready to try PyAirbyte? Begin your journey by watching our getting started with PyAirbyte video. Or, if you're eager to jump into action, explore one of our Quicktart notebooks.

Encountered a bug or have a feature request? If you come across any bugs, have ideas for new features, or are interested in specific sources and caches, please let us know. Submit your issues and suggestions here. Your contribution is key to making PyAirbyte better for everyone!

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Thalia Barrera is a data engineer and technical writer at Airbyte. She has over a decade of experience as an engineer in the IT industry. She enjoys crafting technical and training materials for fellow engineers. Drawing on her computer science expertise and client-oriented nature, she loves turning complex topics into easy-to-understand content.