본문 바로가기

SK네트웍스 Family AI캠프 10기/Daily 회고

60일차. RAG - Native RAG(Loader, Splitter)

더보기

 

60일 차 회고.

 

 이제 세 번째 단위 프로젝트를 시작해야 해서 주제를 생각해 보기로 했다. 주제는 LLM을 연동한 내외부 문서 기반 질의응답 시스템인데, 이전 기수들의 결과물을 봐도 생각나는 게 없다.

 

 

 

 

1. RAG

 

 

1-1. RAG

 

LLM 문제점

  • 모델 환각 현상(Model Hallucination Problem)
    • LLM은 확률에 기반하여 텍스트를 생성한다.
    • 충분한 사실 검증이 없다면, 일관성이 없지만 사실인 것 같은 내용을 생성할 수 있다.
  • 적시성 문제(Timeliness Problem)
    • 최신 데이터가 훈련에 포함되지 않을 수 있다. 
  • 데이터 보안 문제(Data Security Problem)

 

RAG(Retrieval Augmented Generation)

  • 새로운 지식을 추가하여 LLM이 올바른 답변을 할 수 있도록 한다.
  • RAG 특징
    • 확장성(Scalability)
    • 정확성(Accuracy)
      • 모델이 검색을 통해 사실에 기반한 답변을 제공한다.
      • 모델 환각 현상을 최소화할 수 있다.
    • 제어 가능성(Controllability)
    • 설명 가능성(Explainability)
    • 범용성(Versatility)
  • RAG 패러다임
    • Native RAG
      • Indexing
        • 데이터 인덱싱
        • 청크(chunk) 분할
        • 임베딩 및 인덱스 생성
      • Retrieve
      • Generation
    • Advanced RAG
      • Pre-Retrieval Process
        • 데이터 인덱싱 최적화
        • 임베딩
      • Post-Retrieval Process
        • ReRank
        • 프롬프트 압축(Prompt Compression)
      • RAG Pipeline Optimization
    • Modular RAG

 

 

1-2. Native RAG

 

Loader

  • 다양한 소스에서 문서를 불러오고 처리한다.
    • page_content
    • Metadata
  • LangChain Document
from langchain_core.documents import Document

# 1
document = Document(
    page_content="Apples are red.",
    metadata={"title": "apple_book"}
)
document
# Document(metadata={'title': 'apple_book'}, page_content='Apples are red.')
document.page_content
# 'Apples are red.'
document.metadata
# {'title': 'apple_book'}

# >= 2
docs = [
    Document(
        page_content="Apples are red.",
        metadata={"title": "apple_book"}
    ),
    Document(
        page_content="Blueberries are blue.",
        metadata={"title": "blueberry_book"}
    ),
    Document(
        page_content="Bananas are yellow.",
        metadata={"title": "banana_book"}
    )
]
len(docs)
# 3
docs[0].page_content
# 'Apples are red.'
docs[0].metadata
# {'title': 'apple_book'}
  • LangChain Loader
# WebBaseLoader
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_path="")
docs = loader.load()

type(docs[0])
# langchain_core.documents.base.Document

# with BeautifulSoup4
import bs4

loader = WebBaseLoader(
    web_paths="",
    bs_kwargs={
        "parse_only": bs4.SoupStrainer()
    },
    header_template={
        "User-Agent": ""
    }
)
docs = loader.load()
# TextLoader
from langchain_community.document_loaders import TextLoader

loader = TextLoader(DATA_PATH+".txt")
docs = loader.load()
# CSVLoader
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(
    file_path=DATA_PATH+".csv",
    encoding="cp949"
)
docs = loader.load()

loader = CSVLoader(
    file_path=DATA_PATH+".csv",
    encoding="cp949",
    source_column="연도"
)
docs = loader.load()
docs[0].metadata
# {'source': '2004-01-01', 'row': 0}
# JSONLoader
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path=DATA_PATH+".json",
    jq_schema=".[]",
    text_content=False
)
docs = loader.load()

loader = JSONLoader(
    file_path=DATA_PATH+".json",
    jq_schema=".[].phoneNumbers",
    text_content=False
)
docs = loader.load()
docs[0].page_content
# '["483-4639-1933", "947-4179-7976"]'
# PDFLoader - PyPDFLoader
from langchain_community.document_loaders import PDFLoader

loader = PyPDFLoader(file_path)
docs = loader.load()
# PDFLoader - PyPDF2
import PyPDF2

results = []
with open(file_path, "rb") as file:
    pdf_reader = PyPDF2.PdfReader(file)
    
    print(len(pef_reader.pages))
    
    print(f"Title: {pdf_reader.metadata.title}")
    print(f"Author: {pdf_reader.metadata.author}")
    print(f"Producer: {pdf_reader.metadata.producer}")
    print(f"Create Date: {pdf_reader.metadata.creation_date}")
    
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        text = page.extract_text()
        results.append(
            f"Page: {page_num} / Text: {text}\n"
        )
# PDFLoader - PDFPlumber - 1

from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader(file_path)
docs = loader.load()
# PDFLoader - PDFPlumber - 2

## Text

import pdfplumber

results = []

with pdfplumber.open(file_path) as pdf:
    print(pdf.metadata)
    print(f"Page: {len(pdf.pages_}")
    
    for page in range(len(pdf.pages)):
        text = pdf.pages[page].extract_text()
        results.append(
            f"Page: {page} / Text: {text}\n"
        )

## Image

with pdfplumber.open(file_path) as pdf:
    pages = pdf.pages
    im = pages[0].to_image(resolution=150)
    im.draw_rects(pages[1].extract_words())
im

 

Splitter

  • 토큰 제한이 있는 LLM이 여러 문장을 참고해 답변할 수 있도록 문서를 분할한다.
# CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=250,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False
)

## with Open()

splited_file = text_splitter.split_text(file)

metadatas = [
    {"source": DATA_PATH+".txt"}
]
documents = text_splitter.create_documents(
    texts=[file],
    metadatas=metadatas
)

## with TextLoader()

from langchain_community.document_laoders import TextLoader

loader = TextLoader(DATA_PATH+".txt")
data = loader.load()

docs = text_splitter.split_documents(data)
# RecursiveCharacterTextSplitter

from langchain.text_splitter import RecurisveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False
)

## with Open()

splited_file = text_splitter.split_text(file)

metadatas = [
    {"document": 1}
]
documents = text_splitter.create_documents(
    texts=[file],
    metadatas=metadatas
)

## with TextLoader()

from langchain_community.document_laoders import TextLoader

loader = TextLoader(DATA_PATH+".txt")
data = loader.load()

docs = text_splitter.split_documents(data)
# TokenTextSplitter

## OpenAI - tiktoken

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=600,
    chunk_overlap=200,
    encoding_name="o200k_base"
)
text = text_splitter.split_text(file)

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=600,
    chunk_overlap=200,
    encoding_name="o200k_base"
)
text = text_splitter.split_text(file)

text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    chunk_size=600,
    chunk_overlap=200,
    encoding_name="o200k_base"
)
text = text_splitter.split_text(file)

## Hugging Face - tokenizer

from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter

model_name = "intfloat/multilingual-e5-large-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer,
    chunk_size=600,
    chunk_overlap=200
)
text = text_splitter.split_text(file)

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer,
    chunk_size=600,
    chunk_overlap=200
)
text = text_splitter.split_text(file)

text_splitter = TokenTextSplitter.from_huggingface_tokenizer(
    tokenizer,
    chunk_size=600,
    chunk_overlap=200
)
text = text_splitter.split_text(file)
# SemanticChunker

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([file])

## Breakpoints

### Percentile

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70
)
docs = text_splitter.create_documents([file])

### Standard Deviation

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.25
)
docs = text_splitter.create_documents([file])

### Interquartile

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=0.5
)
docs = text_splitter.create_documents([file])