Scraping web data from Apify source into Airbyte for Langchain

Learn how to scrape data from a website and load it in a database using PyAirbyte and LangChain. Integrating web data into LLMs can enhance their performance by providing up-to-date and relevant information.

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

This tutorial will demonstrate how to scrape data from a website using Apify, load the scraped data using PyAirbyte, and store the data in a database using LangChain. Integrating web data into LLMs can enhance their performance by providing up-to-date and relevant information. This process can be complex, and this guide aims to simplify it for users.

Prerequisites

  1. Apify Account:
    • Follow the instructions in the Apify to set up your apify account and obtain the necessary access keys.
  2. Pinecone Account:
    • Create a Pinecone Account: Sign up for an account on the Pinecone website.
    • Obtain Pinecone API Key: Generate a new API key from your Pinecone project settings. For detailed instructions, refer to the Pinecone documentation.
  3. OpenAI API Key:
    • Create an OpenAI Account: Sign up for an account on OpenAI.
    • Generate an API Key: Go to the API section and generate a new API key. For detailed instructions, refer to the OpenAI documentation.

Install PyAirbyte and other dependencies

# Add virtual environment support in Google Colab
!apt-get install -qq python3.10-venv

# First, we need to install the necessary libraries.
!pip3 install airbyte openai langchain pinecone-client langchain-openai langchain-pinecone python-dotenv langchainhub

Setup Source Apify with PyAirbyte

The provided code configures an Airbyte source to extract data from specific dataset in apify.

To configure according to your requirements, you can refer to this references.

Note: The credentials are retrieved securely using the get_secret() method. This will automatically locate a matching Google Colab secret or environment variable, ensuring they are not hard-coded into the notebook. Make sure to add your key to the Secrets section on the left.

import airbyte as ab

source = ab.get_source(
    "source-apify-dataset",
    config={
        "token": ab.get_secret("API_TOKEN"),
        "dataset_id": ab.get_secret("DATASET_ID"),
    }
)
source.check()

This is a basic process of fetching data from Apify dataset using Airbyte and converting it into a format suitable for further processing or analysis.

source.select_all_streams() # Select all streams
read_result = source.read() # Read the data
review_list = [doc for doc in read_result["item_collection"].to_documents()] # We are only intrested in item_collection stream only

print(str(review_list[10]))

Use Langchain to build a RAG pipeline

The code uses RecursiveCharacterTextSplitter to break documents into smaller chunks. Metadata within these chunks is converted to strings. This facilitates efficient processing of large texts, enhancing analysis capabilities.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.split_documents(review_list)
print(f"Created {len(chunked_docs)} document chunks.")

for doc in chunked_docs:
    for md in doc.metadata:
        doc.metadata[md] = str(doc.metadata[md])

Created 493 document chunks.

from langchain_openai import OpenAIEmbeddings
import os

os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")
## Embedding Technique Of OPENAI
embeddings=OpenAIEmbeddings()

Setting up Pinecone

Pinecone is a managed vector database service designed for storing, indexing, and querying high-dimensional vector data efficiently.

from pinecone import Pinecone, ServerlessSpec
from pinecone import Pinecone

os.environ['PINECONE_API_KEY'] = ab.get_secret("PINECONE_API_KEY")
pc = Pinecone()
index_name = "apifyproductreview" # Replace with your index name


# Uncomment this if you have not created a Pinecone index yet

spec = ServerlessSpec(cloud="aws", region="us-east-1") # Replace with your cloud and region
pc.create_index(
        name = index_name,
        dimension=1536, # Replace with your model dimensions
        metric='cosine', # Replace with your model metric
        spec=spec
)

index = pc.Index(index_name)

index.describe_index_stats()

PineconeVectorStore is a class provided by the LangChain library specifically designed for interacting with Pinecone vector stores. from_documents method of PineconeVectorStore is used to create or update vectors in a Pinecone vector store based on the provided documents and their corresponding embeddings.

from langchain_pinecone import PineconeVectorStore

pinecone = PineconeVectorStore.from_documents(
    chunked_docs, embeddings, index_name=index_name
)

Now setting up a pipeline for RAG using LangChain, incorporating document retrieval from Pinecone, prompt configuration, and a chat model from OpenAI for response generation.

from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = pinecone.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")
print(rag_chain.invoke("What is overall review of products"))

Should you build or buy your data pipelines?

Download our free guide and discover the best approach for your needs, whether it's building your ELT solution in-house or opting for Airbyte Open Source or Airbyte Cloud.

Download now

Similar use cases

No similar recipes were found, but check back soon!