Article

Airbyte now supports extracting text from documents

•

November 7, 2023

•

5 mind read

Our S3, Azure blob storage and the newly introduced Google Drive sources now support extracting text from stored documents. The textual content of your documents is emitted as markdown which allows you to leverage this data in search scenarios and when building LLM powered applications.

Why extract raw text?

Airbyte has great support for extracting highly structured data from tools like Stripe, Hubspot or Zendesk. This data is valuable for companies especially when centralized in a single place like a warehouse because it’s possible to correlate information from multiple sources to draw conclusions.

However, in most companies a large amount of information isn’t represented by structured data fitting well structured schemas, but rather messy text documents containing information without strictly applied structures - meeting notes, specifications, roadmaps, descriptions of planned features and similar.

For a long time, this kind of information couldn’t really be leveraged effectively in automated ways. However, with the recent development of powerful language models, this is changing, making unstructured text as valuable as structured records to data analysts.

Airbyte is a data migration tool which allows you to extract data from a large variety of sources - unstructured data fits this paradigm just as well as structured data.

What is Airbyte’s role in this?

Airbyte can treat your messy file share the same way it can treat a database or the REST API of some bespoke service. It’s able to extract all valuable data and send it to your warehouse for all kinds of downstream processing.

The new experimental “Document File Type Format” is available for S3, Azure Blob Storage and Google Drive. It allows users to extract data from PDFs, Word, Powerpoint and Google documents just like structured data stored in the avro file format or CSV.

Instead of extracting tabular data with different columns as individual fields, the text content of the document is extracted and emitted as markdown to retain structural information like headings or lists from the original document. Using OCR technology, text can even be extracted from scanned documents. These connectors utilize the open source Unstructured library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the Unstructured docs and you can learn about other Unstructured tools and services at www.unstructured.io.

From here, Airbyte treats the content from your files just as every other record emitted from a regular source - it can be loaded into any destination such as warehouses or purpose-built databases.

For example, by loading text into a search database like Pinecone, Weaviate or Elasticsearch, it’s possible to build powerful search experiences based on your data. Conversational language models like ChatGPT become much more powerful when given access to the data present in your organization.

Just like for other data sources, Airbyte is able to keep track of changes and issue incremental syncs that only extract and load changes files, keeping your centralized knowledge base powering your search and chat applications in sync with the file shares used by your company.

Getting started

To get started with extracting text from documents, follow these steps:

Make sure your local Airbyte instance is up to date or sign up for an account on cloud.airbyte.com
Create a new S3, Azure blob storage or Google Drive source
In the source configuration, create a new stream and select the “Document File Type Format (Experimental)” format
Configure a destination to load your text data into (like your warehouse, Elasticsearch, Pinecone or Weaviate)
Run a sync

In case of Google Drive, this is how it can look like:

Create a folder in your drive and place some documents:

Configure the google drive source by copying and pasting the folder URL and configuring a stream matching all files ending with .pdf called pdf_files:

Set up a connection to sync the data to your warehouse destination. Once completed, all matching files will become rows in the pdf_files table with document_key and content columns for file name and respective text content.