PyAirbyte brings the power of Airbyte to every Python developer
How does PyAirbyte work?
Installation via PyPi
- PyAirbyte is installed using pip, making it accessible to anyone with a setup that supports Python >=3.9.
Data ingestion with one Python statement
- PyAirbyte offers straightforward source connector configuration, with flexible data stream management and versatile caching options.
- Extract data from hundreds of sources and load it to a variety of SQL caches, like DuckDB, Postgres, Snowflake and BigQuery.
Interoperability with SQL, Python libraries and AI frameworks
- PyAirbyte cached data is compatible with various Python libraries, like Pandas and SQL-based tools, as well as popular AI frameworks like LangChain and LlamaIndex, to facilitate building LLM powered applications.
Compatibility with Airbyte Cloud and Open Source jobs
- PyAirbyte lets you run existing jobs in Airbyte Cloud & OSS, providing convenient access to synchronized datasets.
- Deploy your PyAirbyte connections as Airbyte cloud or OSS jobs for seamless integration.
Say goodbye to
custom ETL scripting
Leverage the Ubiquity of Python
Decrease Time to Value by Enabling Fast Prototyping
Reduce the Need for Custom ETL Development
Facilitate AI Use Cases
Enable Data Engineering Best Practices
Build to scale with your business
Frequently asked questions
Does PyAirbyte replace Airbyte?
No, PyAirbyte complements Airbyte by offering additional capabilities for Python environments. You can start prototyping with PyAirbyte and then transition to another Airbyte offering as your needs evolve or scale.
What is the PyAirbyte cache? Is it an Airbyte destination?
Yes, you can think of the PyAirbyte cache as a built-in destination implementation. We avoid the term “destination” to avoid confusion with our certified destinations.
Can I develop traditional ETL pipelines with PyAirbyte?
Yes, PyAirbyte supports traditional ETL pipeline development. Simply select a cache type that matches your data destination.
Can PyAirbyte import a source connector from a local directory that has python project files?
Yes, PyAirbyte can use any local install that has a CLI, and will automatically find connectors by name if they are on PATH.
Can I move millions of rows or TB of data using PyAirbyte?
PyAirbyte should be able to efficiently handle large data volumes by writing to disk first and compressing data. The native database provider implementations ensure fast and memory-efficient processing.
What are some potential use cases of PyAirbyte?
PyAirbyte is ideal for data experimentation and discovery outside traditional data-warehousing, and for testing data flows before production deployment.
How does PyAirbyte handle non-breaking schema changes?
We check for schema compatibility and plan to soon add support for handling additional columns added upstream.
Are you planning to add support for more cache types or allow custom cache implementations?
We're open to contributions! And if there's significant user demand, we may add the feature ourselves.
Can I use PyAirbyte to develop or test when developing Airbyte sources?
Absolutely. PyAirbyte is a useful tool for development and testing of Python-based sources.
Can I run my existing Airbyte Cloud or Open Source jobs from PyAirbyte?
Yes, PyAirbyte provides full interoperability with Airbyte Cloud and OSS. You can trigger existing hosted jobs and access resulting synced datasets. You can also deploy new jobs to Airbyte Cloud and OSS via PyAirbyte. Refer to documentation for usage.
Is PyAirbyte compatible with data orchestration frameworks like Airflow, Dagster, and Snowpark?
Yes, PyAirbyte is designed to work with various data orchestration frameworks.
Where does PyAirbyte store the state for incremental processing?
PyAirbyte stores the state in the _airbyte_state table, alongside the data, in databases like DuckDB, Postgres, Snowflake, BigQuery, or MotherDuck.
Is it possible to change the normalization step of a destination with PyAirbyte?
While direct modifications to property names aren't available, you can use the get_records() method to retrieve data as a Python dictionary and store it as desired.