LangChain, Pinecone, Large Language Models (LLMs)을 이용한 PDF and JSON files이 Vector DB로 변화되는 과정을 정리하였습니다.
1. Setup Environment
- Install Dependencies: Ensure you have Python installed. Install necessary libraries:
- pip install langchain pinecone-client openai transformers sentence-transformers
- Pinecone API Key: Get an API key from Pinecone and initialize it.
- import pinecone pinecone.init(api_key="YOUR_PINECONE_API_KEY", environment="YOUR_ENVIRONMENT")
pip install langchain pinecone-client openai transformers sentence-transformers
import pinecone pinecone.init(api_key="YOUR_PINECONE_API_KEY", environment="YOUR_ENVIRONMENT")
2. Loading Documents
- PDF Files: Use LangChain's document loaders for PDFs.
- from langchain.document_loaders import PyPDFLoader loader = PyPDFLoader("path_to_your_pdf.pdf") documents = loader.load()
- JSON Files: For structured data like JSON, you might need to convert it into a format LangChain can process, or directly use it if your JSON contains text data.
- import json with open('kjv_bible_verses.json', 'r') as file: data = json.load(file) # Convert JSON to documents if needed documents = [{"page_content": str(item), "metadata": {"source": "kjv_bible_verses"}} for item in data]
3. Text Splitting
- Split documents into chunks if they're too large.
- from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) texts = text_splitter.split_documents(documents)
4. Embedding
- Use an embedding model to convert text into vectors.
- from langchain.embeddings import SentenceTransformerEmbeddings embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
5. Creating Pinecone Index
- If not already created, create a Pinecone index.
- index_name = "your-index-name" if index_name not in pinecone.list_indexes(): pinecone.create_index(index_name, dimension=1536, metric='cosine')
6. Adding Documents to Pinecone
- Use LangChain's Pinecone integration.
- from langchain.vectorstores import Pinecone pinecone_index = Pinecone.from_documents(texts, embeddings, index_name=index_name)
7. Querying with LLMs
- Set up an LLM (like from Hugging Face or OpenAI).
- from langchain.llms import HuggingFacePipeline from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "gpt2" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) llm =
'IT' 카테고리의 다른 글
Google Cloud, n8n, Cloudflare를 활용한 무료 자동화 워크플로 구축 가이드 (1) | 2024.09.12 |
---|---|
n8n 소개 (4) | 2024.09.05 |
벡터 데이터베이스와 Pinecone (1) | 2024.09.01 |
LangGraph와 LangChain의 주요 차이점 (0) | 2024.08.31 |
LangGraph 단계별 사용가이드 (0) | 2024.08.31 |