It’s now possible to utilize the Airbyte sources for Gong, Hubspot, Salesforce, Shopify, Stripe, Typeform and Zendesk Support directly within your LangChain-based application, implemented as document loaders.
For example, to load the Stripe invoices for a user, you can use the AirbyteStripeLoader. Installing it is super simple, when you have LangChain installed locally you only need to install the source you are interested in, and you are ready to go:
pip install airbyte-source-stripe
After that, simply import the loader and pass in configuration and the stream you want to load:
Why does this matter?
This is the beginning of making Airbyte’s 300+ sources available as document loaders in LangChain.
Airbyte can move data from just about any source to your warehouse or vector database to power your LLM use case (check out this tutorial for setting up such a data pipeline!). This is normally done by using Airbyte Cloud or a local Airbyte instance, setting up a connection, and running it on a schedule (or via API trigger) to make sure your data stays fresh.
But if you are just getting started and are running everything locally, using a full Airbyte instance (including the UI, scheduling service, scale-out capabilities, etc..) may be overkill.
With this release, it’s easier than ever to run any Python-based source in LangChain directly within your Python runtime - no need to spin up an Airbyte instance or make API calls to Airbyte Cloud.
Moving between hosted and embedded Airbyte
As it’s the same code running under the hood, every Airbyte-built loader is compatible with the respective source in the Airbyte service. This means it’s trivial to lift your embedded loading pipeline into your self-hosted Airbyte installation or your Airbyte Cloud instance. The shape of the configuration object and the records is 100% compatible.
Running syncs on hosted Airbyte means:
- UI to keep track of running pipelines
- Alerting on failing syncs
- Easily running pipelines on a schedule
Running syncs with LangChain loaders means:
- No overhead for running yet another service
- Full control over timing and pipeline execution
Mapping Airbyte records to LangChain documents
By default, each record gets mapped to a Document as part of the loader, with all the various fields in the record becoming the metadata of the record. The text portion of the document is left as an empty string. You can pass in a record handler to customize this behavior to build the text part of a record depending on the data:
Since your python application is basically acting as the Airbyte platform, you have full control over how the “sync” is executed. For example you can still benefit from incremental syncs if your stream supports it by accessing the “last_state” property of the loader. This allows you to load only documents that changed since the last time you loaded, allowing you to update an existing vector database effectively:
For now, the following Airbyte sources are available as pip packages (with more to come):
- Gong pip install airbyte-source-gong
- Hubspot pip install airbyte-source-hubspot
- Salesforce pip install airbyte-source-salesforce
- Shopify pip install airbyte-source-shopify
- Stripe pip install airbyte-source-stripe
- Typeform pip install airbyte-source-typeform
- Zendesk Support pip install airbyte-source-zendesk-support
However, if you have implemented your own custom Airbyte sources, it’s also possible to integrate them by using the AirbyteCDKLoader base class that works with the Source interface of the Airbyte CDK:
You can also install sources from the main Airbyte repository by installing directly via git - for example, to fetch the Github source, simply run:
After that, the source is available to be plucked into the AirbyteCDKLoader:
Check out the connector development documentation for how to get started writing your own sources - it’s easy to get started with them and will allow you to move from local embedded loaders to using a hosted Airbyte instance seamlessly depending on your needs.
Any questions? We would love to hear from you
If you are interested in leveraging Airbyte to ship data to your LLM-based applications, please take a moment to fill out our survey so we can make sure to prioritize the most important features.
If you have questions or are interested in other existing sources being exposed as loaders this way, do not hesitate to reach out on our community slack channel or in the Airbyte channel on the LangChain discord server.