Building a RAG Architecture with Generative AI

•

January 6, 2025

•

20 min read

Summarize with ChatGPT

Artificial intelligence has undergone a remarkable transformation, largely due to large language models (LLMs). These models have opened up new possibilities in natural language processing (NLP), powering applications like automated chatbots and virtual assistants.

Despite their impressive text generation capabilities, LLMs still have limitations in tasks requiring external knowledge and factual information. Since retraining those large language models from scratch costs millions of dollars and takes months, you need better ways to give your existing LLMs access to the custom data.

To facilitate this, a new approach called retrieval-augmented generation (RAG) has emerged. According to research, the global market for RAG was valued at approximately $1,042.7 million in 2023 and is projected to increase 44.7% by 2030.

In this article, you'll discover how implementing RAG architecture can enhance the accuracy and relevance of GenAI applications.

What is Retrieval-Augmented Generation?

Retrieval-augmented generation is a robust technique that enriches the output of large language models by combining them with external knowledge bases. Without RAG, the LLM takes the user input and generates a response based on the pre-trained datasets, which can lead to outdated or incomplete information.

With RAG, an information retrieval component is introduced that utilizes the user input to first extract relevant data from additional sources. Now, both the user query and the retrieved information are passed to the LLM. The model uses the provided context and generates an accurate response to the query.

What Are the Benefits of RAG?

Deploying language models with RAG architecture offers several practical advantages. Here are a few of them:

Improved Accuracy

RAG improves the accuracy of responses by grounding generated outputs in factual data retrieved from external knowledge sources. This approach reduces the risk of generating outdated or incorrect outputs—commonly seen with traditional generative models, which solely depend on pre-trained data.

For example, suppose a user asks about recent developments in a particular field. In that case, RAG can retrieve the latest data from credible sources to ensure that the responses are well-informed and factually correct.

Cost-Effectiveness

Implementing RAG can be more cost-effective than retraining LLMs for specific tasks. Retraining requires significant computational resources and time, while RAG allows leveraging the existing models by improving their performance through external data integration. Therefore, it is easier and more affordable to include domain-specific knowledge in your AI applications.

Contextual Relevance

Through semantic search, RAG retrieves the most relevant data, ensuring that responses are tailored to the user's query. This combination of retrieval and generation lets the system dynamically adapt to specific questions and deliver context-aware answers. This makes RAG particularly useful for tasks like customer support, financial analysis, research assistance, and medical diagnostics.

Enhanced User Trust

RAG allows the LLM to include citations or references to the sources of information it uses to generate responses. This not only fosters confidence and trust in your generative AI solution but also gives the user the control to verify source documents if they require further clarification or deeper insights.

How RAG Works?

Let’s understand the workflow of the RAG architecture model in detail:

1. Gather External Data

External data refers to new information that is not part of the original training dataset of the LLM. This data can come from various sources such as APIs, databases, or document repositories. You should create a rich knowledge library that the model can refer to during response generation.

2. Tokenization and Chunking

The collected data undergoes tokenization, breaking the text into smaller units called tokens, which could be words, subwords, or characters. Next, chunking organizes these tokens into coherent groups for efficient processing. Typically, LLMs have token limits that restrict the number of tokens they can process in a single interaction. Therefore, chunking is essential to ensure that the input fits within these constraints while retaining meaningful context.

3. Embedding Generation

At this stage, each chunk of data is converted into vector embeddings using specialized embedding models like OpenAI and Cohere. These embeddings effectively capture the semantic essence of the data in a high-dimensional space. Similar chunks will have vector representations closer to each other in this space, which is beneficial for performing tasks like searching, clustering, and classifying.

4. Vector Database Storage

The generated embeddings are then stored in vector databases like Pinecone or Milvus. These databases also maintain metadata for each vector, including title, description, and data type, which can be queried using metadata filters.

5. Query Processing

When a user submits a query, it is converted into a vector representation using the same embedding model that processed the data chunks. This ensures that both the query and the stored data are represented in the same vector space, facilitating comparison.

6. Retrieval of Relevant Information

The query embeddings are compared to those stored in the vector database through a semantic similarity search. The system then prioritizes and retrieves the most relevant chunks based on their proximity to the query vector.

7. Augmentation

The retrieved information is then combined with the user query to form a comprehensive prompt for the LLM. This enriched input provides the model with additional context that it can use to create insightful responses.

8. Response Generation

Finally, the LLM processes the augmented prompt to generate an output. By leveraging its pre-trained capabilities alongside the contextual data from the prompt, the LLM produces an accurate answer specific to the user's query.

Advanced RAG Techniques to Enhance Performance

Let’s explore some of the advanced RAG techniques:

Intelligent Reranking

Reranking is an advanced technique employed to refine the list of retrieved documents before passing it to the generation component. Use a language model to score the relevance of each retrieved chunk. Then, you can utilize Hugging Face cross-encoder models to re-encode both the query and retrieved documents jointly for similarity scoring. You can also incorporate metadata into the scoring process for better ranking.

Hierarchical Indices

Hierarchical indices is an advanced RAG technique that establishes a multi-tiered system designed for efficient information navigation and retrieval. This approach involves creating a two-tiered indexing structure where the first tier consists of document summaries that provide an overview of the content, while the second tier contains detailed chunks of the documents. Both tiers are linked through metadata that points to the same location, facilitating rapid access to relevant information.

Prompt Chaining with Retrieval Feedback

In this method, each prompt acts as a step that refines the AI's output based on feedback from previous responses, facilitating dynamic adjustments to improve accuracy. The retrieval feedback loop assesses the relevance and correctness of generated answers, enabling the system to request additional context or clarification when necessary. This iterative refinement is particularly effective for complex queries, ensuring that the final output is precise, factually accurate, and contextually relevant.

Dynamic Memory Networks (DMNs)

Dynamic memory networks enhance the capacity of neural networks to perform tasks requiring reasoning over structured knowledge. DMNs integrate dynamic memory components to efficiently store and retrieve information as they process inputs. The architecture of DMNs consists of several key components—an input module, a memory module, a response module, and an output module. The input module encodes the given question or task, while the memory module updates its content dynamically based on contextually relevant information. This structure enables the model to access and utilize prior knowledge effectively during processing.

‍

Use Cases of RAG

RAG architecture offers transformative applications across various sectors. Here are a few of them:

Customer Support

RAG-enabled chatbots give contextually correct responses because they can search through vast knowledge bases and product manuals. Compared to traditional generation models, these chatbots deliver accurate answers rather than misleading results.

One notable example is Sendbird’s Shopify chatbot. This AI-powered chatbot assists Shopify store owners in enhancing customer support. It uses store data, advanced LLMs, and RAG AI capabilities to offer personalized product recommendations, increasing sales conversions. This chatbot instantly answers common questions regarding shipping, return policies, and product availability, which improves customer satisfaction.

Document Summarization

RAG aids in summarizing lengthy documents by retrieving key pieces of information and creating concise summaries. For example, Bloomberg, a global provider of financial news and information, launched an innovative AI-powered Earnings Call Summaries tool that revolutionizes financial research for analysts. This tool enables users to extract key insights on financial information, such as capital allocation, supply chain issues, and consumer demand.

Utilizing advanced technologies, including dense vector databases and RAG, the tool understands the essence of each paragraph. It then employs a large language model to summarize the most relevant information into concise bullet points.

Healthcare Diagnostics

RAG improves diagnostic processes by quickly analyzing patient data and retrieving relevant clinical guidelines. For example, when a physician assesses a patient with complex symptoms, RAG can pull historical patient data and recent studies to suggest possible diagnoses. This integration of real-time information enhances diagnostic accuracy and ensures timely interventions for patients.

Legal Research

RAG assists lawyers by sourcing relevant case law and statutes. When a lawyer queries a specific legal question, the RAG system retrieves pertinent documents and generates a comprehensive response that includes citations and summaries of relevant cases. This streamlines the research process and helps professionals to focus on analysis rather than data gathering.

RAG Challenges and Best Practices

Let’s understand the key challenges of RAG and the possible solutions to mitigate them effectively:

Missing Content in the Knowledge Base

When the relevant data isn't available in the knowledge base, the LLM might produce incorrect or misleading answers, known as hallucination. This happens because the model lacks sufficient factual grounding, leading it to create fabricated information.

‍Solution: This can be addressed by using prompt engineering. You should craft prompts that guide the LLM to recognize the limitations of the knowledge corpus that reduces the chances of generating false information. For example, you can structure the prompts to let the model state, “I don’t know because there is no information about this in the knowledge base.” This helps maintain the system's reliability and trustworthiness.

Difficulty in Extracting Answers from Retrieved Context

Another common challenge is that although the answer is present in the knowledge base, the LLM fails to extract it correctly. This often happens if the retrieved context contains too much noise or conflicting information, making it difficult for the model to pinpoint the right answer.

‍Solution: Quality data is key to semantic search and retrieval, which is the foundation of any RAG system. Clean the data by removing duplicates, irrelevant entries, and formatting inconsistencies. By maintaining a well-organized and clean knowledge base, the language model will be better equipped to extract accurate answers from the context it retrieves.

Latency and Scalability Bottlenecks

RAG introduces additional latency due to the retrieval step. As the document corpus expands, retrieval slows down, especially if the underlying infrastructure cannot scale adequately. This is highly problematic in real-time applications where quick responses are essential.

‍Solution: To reduce the latency caused by the multi-step retrieval and generation process, you can use asynchronous or multi-threaded architectures where retrieval can happen in parallel with other tasks to decrease wait times. Furthermore, implementing caching mechanisms for frequently requested data can be beneficial. This minimizes the need for repeated requests and improves response times for common inquiries.

Ethical and Legal Challenges

The use of external data sources in RAG systems raises essential ethical and legal considerations, particularly concerning data security and intellectual property rights. This can lead to privacy violations, data breaches, or the unintentional use of copyrighted material.

‍Solution: Implement strict data governance policies that comply with relevant regulations like GDPR. Ensure that the data used for retrieval is obtained and used ethically, with appropriate consent and data protection measures. Additionally, employing techniques such as data anonymization, secure data storage, and regular audits can help mitigate the risks associated with data usage in RAG systems.

Building a Rag Pipeline Using Airbyte

To streamline the process of making your data LLM-ready, you can leverage data movement platforms like Airbyte. It offers an extensive catalog of 550+ pre-built connectors, which you can use to extract data from various sources and load it into your desired target system.

While using Airbyte, you can directly store semi-structured or unstructured data in vector databases such as Pinecone, Chroma, or Milvus. These vector databases can then be integrated with LLM frameworks to enhance responses.

Key Features of Airbyte

Flexible Pipeline Development: Airbyte provides various options for building and managing data pipelines. These include an intuitive UI, APIs, a Terraform Provider, and PyAirbyte. You can pick the one that best aligns with your requirements.

‍Custom Connectors: It lets you build custom connectors within 30 minutes through its easy-to-use Connector Development Kit (CDK). You can also utilize Airbyte’s AI-powered Connector Builder to speed up the development process.

‍Retrieval-Based LLM Applications: The platform empowers you to develop retrieval-based conversational interfaces on top of your synced data using frameworks such as LangChain or LlamaIndex. This lets you quickly access relevant data through user-friendly queries.

‍Sync Resilience: Airbyte's Record Change History feature prevents synchronization failures caused by problematic rows, such as oversized or invalid records. If any record breaks the sync, Airbyte modifies it during transit, logs the changes, and ensures that the sync completes successfully.

‍Data Orchestration: It allows you to integrate with orchestrators, such as Prefect, Dagster, or Apache Airflow, to manage data pipelines. This helps you streamline complex workflows and data processes effectively.

‍Self-Managed Enterprise: Airbyte offers an Enterprise Edition with advanced features. These include enterprise source connectors, multitenancy, role-based access control (RBAC), and personally identifiable information (PII) masking to safeguard your sensitive information.

With a strong grasp of Airbyte's prominent features, let's explore how to build RAG pipelines using Airbyte, LangChain, and Pinecone.

Prerequisites

Sign up for an account on the Airbyte Cloud.
Create an account on the Pinecone website and generate a new API key from your Pinecone project settings. You can refer to the Pinecone documentation for detailed instructions.
Sign up for an account on OpenAI. Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.

Step 1: Configure Source Connector

Login to your Airbyte account. Navigate to Sources on the left-side panel of the dashboard and select the one you require. In this tutorial, let's use GitHub as the data source.

On the next page, authenticate your GitHub account and click on Set up source.

Step 2: Set Up Pinecone as Destination

On Airbyte’s dashboard, select Destinations and choose Pinecone.

Configure Pinecone Destination Connector

On the configuration screen, you'll see three sections:‍

‍Processing: Split up individual records into manageable chunks to fit within the context window. You can specify which fields to use as context and which serve as supplementary metadata. However, the metadata stored has a 40kb size limit.

Embedding: Here, you should choose the required embedding providers, like OpenAI and Cohere, and input the API key.

Indexing: Now, you should provide the connector with configuration information for your Pinecone index, such as the name of the index and an API key.

Step 3: Establish a Connection

Once the destination is configured successfully, set up a connection from the GitHub source to the Pinecone and activate the streams you want to sync. You can also define the frequency of your data syncs.
Choose between incremental syncs or full refreshes. When enabled, incremental sync will ensure only new data is embedded and updated, which could greatly optimize cost.
Click on the Test Connection button to verify if your setup works.
If the test is successful, then click Set Up Connection.

With this, you have created a data pipeline to sync the data from the source to the Pinecone vector database.

Step 4: Set Up a Chat Interface

The data is prepared to connect with the language model. Let’s build a chatbot interface using LangChain as the orchestration framework. Follow the steps below to create your own chatbot that uses GitHub issues to answer questions.

Install a few pip packages locally:

pip install pinecone-client langchain openai tiktoken

To initialize a LangChain QA chain based on the indexed data, use the below code:

# chatbot.py
from langchain import OpenAI
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
import pinecone
import os

embeddings = OpenAIEmbeddings()
pinecone.init(api_key=os.environ["PINECONE_KEY"], environment=os.environ["PINECONE_ENV"])
index = pinecone.Index("<your pinecone index name>")
vector_store = Pinecone(index, embeddings.embed_query, "text")

qa = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff", retriever=vector_store.as_retriever())

print("Connector development help bot. What do you want to know?")
while True:
    query = input("")
    response = qa.run(query)
    print(response)
    print("\nHow else may I assist you?")

To run this script, you should set OpenAI and Pinecone credentials as environment variables:

export OPENAI_API_KEY=...
export PINECONE_KEY=...
export PINECONE_ENV=...
export PINECONE_INDEX=...
python chatbot.py

Finally, fine-tune your chatbot model by providing a prompt, which helps the model to be more specific according to the use case.

prompt_template = """You are a specialized question-answering bot focused on GitHub issues and documentation for a product named Connector Builder. The documentation pages outline what can be done, while the issues document future plans and bugs. Utilize the following pieces of context to answer the question at the end. If you're uncertain about the answer, state that you do not know; avoid fabricating it. Mention the reference to the source of your information (including the GitHub issue number, if applicable), but only if you used that information in your response.

{context}
Question: {question}
Response:"""  
prompt = PromptTemplate( template=prompt_template, input_variables=["context", "question"] )

class ContextualRetriever(VectorStoreRetriever):     
def _get_relevant_documents(self, query: str, *, run_manager): 

# Retrieve relevant documents based on the query  
docs = super()._get_relevant_documents(query, run_manager=run_manager)
return [self.format_doc(doc) for doc in docs]

def format_doc(self, doc: Document) -> Document:
        if doc.metadata["_ab_stream"] == "issues":
            doc.page_content =  f"Excerpt from Github issue: {doc.page_content}, issue number: {int(doc.metadata['number']):d}, issue state: {doc.metadata['state']}"
        return doc
qa = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), chain_type="stuff", retriever=ContextualRetriever(vectorstore=vector_store), chain_type_kwargs={"prompt": prompt})

You can also find the full script on GitHub.

Final Thoughts

In this article, you've explored detailed insights on the workflow of RAG architecture and its transformative impact on large language models. RAG enhances the capabilities of LLMs by integrating real-time, external information, thereby addressing the limitations posed by static training datasets. The ability of RAG to dynamically access verified data sources empowers LLMs to deliver more accurate and contextually appropriate outputs across various applications.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program

The data movement infrastructure for the modern data teams.

Try a 14-day free trial