더보기
60일 차 회고.
이제 세 번째 단위 프로젝트를 시작해야 해서 주제를 생각해 보기로 했다. 주제는 LLM을 연동한 내외부 문서 기반 질의응답 시스템인데, 이전 기수들의 결과물을 봐도 생각나는 게 없다.
1. RAG
1-1. RAG
LLM 문제점
- 모델 환각 현상(Model Hallucination Problem)
- LLM은 확률에 기반하여 텍스트를 생성한다.
- 충분한 사실 검증이 없다면, 일관성이 없지만 사실인 것 같은 내용을 생성할 수 있다.
- 적시성 문제(Timeliness Problem)
- 최신 데이터가 훈련에 포함되지 않을 수 있다.
- 데이터 보안 문제(Data Security Problem)
RAG(Retrieval Augmented Generation)
- 새로운 지식을 추가하여 LLM이 올바른 답변을 할 수 있도록 한다.
- RAG 특징
- 확장성(Scalability)
- 정확성(Accuracy)
- 모델이 검색을 통해 사실에 기반한 답변을 제공한다.
- 모델 환각 현상을 최소화할 수 있다.
- 제어 가능성(Controllability)
- 설명 가능성(Explainability)
- 범용성(Versatility)
- RAG 패러다임
- Native RAG
- Indexing
- 데이터 인덱싱
- 청크(chunk) 분할
- 임베딩 및 인덱스 생성
- Retrieve
- Generation
- Indexing
- Advanced RAG
- Pre-Retrieval Process
- 데이터 인덱싱 최적화
- 임베딩
- Post-Retrieval Process
- ReRank
- 프롬프트 압축(Prompt Compression)
- RAG Pipeline Optimization
- Pre-Retrieval Process
- Modular RAG
- Native RAG
1-2. Native RAG
Loader
- 다양한 소스에서 문서를 불러오고 처리한다.
- page_content
- Metadata
- LangChain Document
from langchain_core.documents import Document
# 1
document = Document(
page_content="Apples are red.",
metadata={"title": "apple_book"}
)
document
# Document(metadata={'title': 'apple_book'}, page_content='Apples are red.')
document.page_content
# 'Apples are red.'
document.metadata
# {'title': 'apple_book'}
# >= 2
docs = [
Document(
page_content="Apples are red.",
metadata={"title": "apple_book"}
),
Document(
page_content="Blueberries are blue.",
metadata={"title": "blueberry_book"}
),
Document(
page_content="Bananas are yellow.",
metadata={"title": "banana_book"}
)
]
len(docs)
# 3
docs[0].page_content
# 'Apples are red.'
docs[0].metadata
# {'title': 'apple_book'}
- LangChain Loader
# WebBaseLoader
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(web_path="")
docs = loader.load()
type(docs[0])
# langchain_core.documents.base.Document
# with BeautifulSoup4
import bs4
loader = WebBaseLoader(
web_paths="",
bs_kwargs={
"parse_only": bs4.SoupStrainer()
},
header_template={
"User-Agent": ""
}
)
docs = loader.load()
# TextLoader
from langchain_community.document_loaders import TextLoader
loader = TextLoader(DATA_PATH+".txt")
docs = loader.load()
# CSVLoader
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader(
file_path=DATA_PATH+".csv",
encoding="cp949"
)
docs = loader.load()
loader = CSVLoader(
file_path=DATA_PATH+".csv",
encoding="cp949",
source_column="연도"
)
docs = loader.load()
docs[0].metadata
# {'source': '2004-01-01', 'row': 0}
# JSONLoader
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
file_path=DATA_PATH+".json",
jq_schema=".[]",
text_content=False
)
docs = loader.load()
loader = JSONLoader(
file_path=DATA_PATH+".json",
jq_schema=".[].phoneNumbers",
text_content=False
)
docs = loader.load()
docs[0].page_content
# '["483-4639-1933", "947-4179-7976"]'
# PDFLoader - PyPDFLoader
from langchain_community.document_loaders import PDFLoader
loader = PyPDFLoader(file_path)
docs = loader.load()
# PDFLoader - PyPDF2
import PyPDF2
results = []
with open(file_path, "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
print(len(pef_reader.pages))
print(f"Title: {pdf_reader.metadata.title}")
print(f"Author: {pdf_reader.metadata.author}")
print(f"Producer: {pdf_reader.metadata.producer}")
print(f"Create Date: {pdf_reader.metadata.creation_date}")
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text = page.extract_text()
results.append(
f"Page: {page_num} / Text: {text}\n"
)
# PDFLoader - PDFPlumber - 1
from langchain_community.document_loaders import PDFPlumberLoader
loader = PDFPlumberLoader(file_path)
docs = loader.load()
# PDFLoader - PDFPlumber - 2
## Text
import pdfplumber
results = []
with pdfplumber.open(file_path) as pdf:
print(pdf.metadata)
print(f"Page: {len(pdf.pages_}")
for page in range(len(pdf.pages)):
text = pdf.pages[page].extract_text()
results.append(
f"Page: {page} / Text: {text}\n"
)
## Image
with pdfplumber.open(file_path) as pdf:
pages = pdf.pages
im = pages[0].to_image(resolution=150)
im.draw_rects(pages[1].extract_words())
im
Splitter
- 토큰 제한이 있는 LLM이 여러 문장을 참고해 답변할 수 있도록 문서를 분할한다.
# CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=250,
chunk_overlap=50,
length_function=len,
is_separator_regex=False
)
## with Open()
splited_file = text_splitter.split_text(file)
metadatas = [
{"source": DATA_PATH+".txt"}
]
documents = text_splitter.create_documents(
texts=[file],
metadatas=metadatas
)
## with TextLoader()
from langchain_community.document_laoders import TextLoader
loader = TextLoader(DATA_PATH+".txt")
data = loader.load()
docs = text_splitter.split_documents(data)
# RecursiveCharacterTextSplitter
from langchain.text_splitter import RecurisveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=250,
chunk_overlap=50,
length_function=len,
is_separator_regex=False
)
## with Open()
splited_file = text_splitter.split_text(file)
metadatas = [
{"document": 1}
]
documents = text_splitter.create_documents(
texts=[file],
metadatas=metadatas
)
## with TextLoader()
from langchain_community.document_laoders import TextLoader
loader = TextLoader(DATA_PATH+".txt")
data = loader.load()
docs = text_splitter.split_documents(data)
# TokenTextSplitter
## OpenAI - tiktoken
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=600,
chunk_overlap=200,
encoding_name="o200k_base"
)
text = text_splitter.split_text(file)
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=600,
chunk_overlap=200,
encoding_name="o200k_base"
)
text = text_splitter.split_text(file)
text_splitter = TokenTextSplitter.from_tiktoken_encoder(
chunk_size=600,
chunk_overlap=200,
encoding_name="o200k_base"
)
text = text_splitter.split_text(file)
## Hugging Face - tokenizer
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter
model_name = "intfloat/multilingual-e5-large-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer,
chunk_size=600,
chunk_overlap=200
)
text = text_splitter.split_text(file)
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
tokenizer,
chunk_size=600,
chunk_overlap=200
)
text = text_splitter.split_text(file)
text_splitter = TokenTextSplitter.from_huggingface_tokenizer(
tokenizer,
chunk_size=600,
chunk_overlap=200
)
text = text_splitter.split_text(file)
# SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([file])
## Breakpoints
### Percentile
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=70
)
docs = text_splitter.create_documents([file])
### Standard Deviation
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="standard_deviation",
breakpoint_threshold_amount=1.25
)
docs = text_splitter.create_documents([file])
### Interquartile
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="interquartile",
breakpoint_threshold_amount=0.5
)
docs = text_splitter.create_documents([file])
'SK네트웍스 Family AI캠프 10기 > Daily 회고' 카테고리의 다른 글
| 62일차. Advanced RAG - Retriever & Reranker (0) | 2025.04.09 |
|---|---|
| 61일차. RAG - Native RAG(Vector DB) (0) | 2025.04.08 |
| 59일차. 프롬프트 엔지니어링 - LangChain(Output Parser & LCEL) & LLM 프로젝트 (0) | 2025.04.04 |
| 58일차 - Fine Tuning - LLM 평가지표 & 프롬프트 엔지니어링 - LangChain(Prompt & Model) (0) | 2025.04.03 |
| 57일차. Fine Tuning - PPO & DPO & LLM 프로젝트 (1) | 2025.04.02 |