Introduction To solve the problem that comes with LLMs hallucinating when they lack sufficient data, a technique called Retrieval augmented generation (RAG) is employed to provide the LLM with more context and recent data. In this tutorial, I will demonstrate the step by step procedure of getting the data from Microsoft sharepoint and creating a pipeline from scratch using Airbyte cloud and loading the data into Zilliz which is a vector database. From this database, we will create a RAG agent that answers questions while making use of the document content we loaded into Microsoft Sharepoint using Langchain , an official Zilliz partner and OpenAI LLM.
Prerequisites Airbyte Cloud. Microsoft Sharepoint tenant ID, Folder path with a document of your choice. OpenAI api key. Get a token from the official OPENAI dashboard Zilliz vector store database account. Create account here STEP 1: Setup Microsoft Sharepoint as an Airbyte Data Source. 1.1 Load some data into sharepoint. Before setting up a Sharepoint application on Airbyte as a source, you will need to load a document into a folder of your choice in the Sharepoint account. In your sharepoint drive, create a folder and add a document into it. Get the tenant ID of your app, the drive name and the folder path. This will help in configuring a connection in airbyte.
1.2 Setup Microsoft sharepoint as a source in Airbyte. Navigate to the airbyte Airbyte cloud dashboard, click sources then select Microsoft Sharepoint as your source from the source options. You require the following :
Tenant ID Client ID Client Secret UPS (User Principal Name) For this tutorial, we are using a csv test document and therefore we will create a stream that has csv format as shown below. Feel free to set up a format of your choice.
Alternatively, you can just use OAUTH that will verify directly on your PC without having to setup the credentials.
Once the source is successfully connected, go ahead and setup the destination Milvus connector.
STEP 2: Setup Milvus as an Airbyte Data Destination. 2.1 Setting up Zilliz To get started with the Zilliz database, create a new cluster and a collection within it. Navigate to the API keys page and copy the api key for pasting into airbyte.
Create
2.2 Setting up Milvus destination connector in airbyte This process involves three steps which include:
Processing - Splitting individual records into chunks so they can fit the context window Embedding - Convert text into vector representation using a pre-trained model. In this tutorial we will use Open AI’s text-embedding-ada-002. Indexing - Storing the vector embeddings for similarity search. You require an OPEN AI api token, Zilliz api token, your Instance endpoint url and collection name to load data into. Setup processing chunk size to 512.
After testing the connection, navigate to connections in the airbyte dashboard and sync the connection from the Microsoft sharepoint app to Zilliz vector datastore.
Make sure that the collection primary key has an auto_id
Set the metric type to Cosine and dimensions to 1000.
Once successful, we can sync the connection to the Microsoft sharepoint app source.
STEP 3: Langchain RAG Integration. To initialize a langchain RAG implementation based on the indexed data, we use the following code.
from langchain_community.vectorstores import Milvus from langchain_openai import OpenAIEmbeddingsembeddings = OpenAIEmbeddings(openai_api_key="my-key") vector_store = Milvus(embeddings=embeddings, collection_name="my-collection", connection_args={"uri": "my-zilliz-endpoint", "token": "my-api-key"}) vector_store.fields.append("text") # call vs.fields.append() for all fields you need from the metadata vector_store.similarity_search("test question")Conclusion This tutorial guides you through a step by step tutorial of loading data into Microsoft sharepoint, streaming the data into Zilliz vector database which stores indexed data to be used for similarity search on the go. We used langchain to integrate the data into a retrieval augmented generation operation making use of the indexed data.
About the Author About the Author