IT

LangChain으로 Vector DB 만들기

esmile1 2024. 9. 1. 07:10

LangChain, Pinecone, Large Language Models (LLMs)을 이용한  PDF and JSON files이  Vector DB로 변화되는 과정을 정리하였습니다. 

1. Setup Environment

  • Install Dependencies: Ensure you have Python installed. Install necessary libraries:
  • pip install langchain pinecone-client openai transformers sentence-transformers
  • Pinecone API Key: Get an API key from Pinecone and initialize it.
  • import pinecone pinecone.init(api_key="YOUR_PINECONE_API_KEY", environment="YOUR_ENVIRONMENT")

pip install langchain pinecone-client openai transformers sentence-transformers

 

import pinecone pinecone.init(api_key="YOUR_PINECONE_API_KEY", environment="YOUR_ENVIRONMENT")

2. Loading Documents

  • PDF Files: Use LangChain's document loaders for PDFs.
  • from langchain.document_loaders import PyPDFLoader loader = PyPDFLoader("path_to_your_pdf.pdf") documents = loader.load()
  • JSON Files: For structured data like JSON, you might need to convert it into a format LangChain can process, or directly use it if your JSON contains text data.
  • import json with open('kjv_bible_verses.json', 'r') as file: data = json.load(file) # Convert JSON to documents if needed documents = [{"page_content": str(item), "metadata": {"source": "kjv_bible_verses"}} for item in data]

3. Text Splitting

  • Split documents into chunks if they're too large.
  • from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) texts = text_splitter.split_documents(documents)

4. Embedding

  • Use an embedding model to convert text into vectors.
  • from langchain.embeddings import SentenceTransformerEmbeddings embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

5. Creating Pinecone Index

  • If not already created, create a Pinecone index.
  • index_name = "your-index-name" if index_name not in pinecone.list_indexes(): pinecone.create_index(index_name, dimension=1536, metric='cosine')

6. Adding Documents to Pinecone

  • Use LangChain's Pinecone integration.
  • from langchain.vectorstores import Pinecone pinecone_index = Pinecone.from_documents(texts, embeddings, index_name=index_name)

7. Querying with LLMs

  • Set up an LLM (like from Hugging Face or OpenAI).
  • from langchain.llms import HuggingFacePipeline from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "gpt2" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) llm =