Advanced RAG: Pokročilé vyhledávání v dokumentech pomocí LlamaIndex

V tomto návodu vám ukážu, jak pomocí metody Advanced RAG a LlamaIndex vyhledávat a pokládat pokročilé dotazy na vaše odborné nebo firemní dokumenty. Jako jazykový model použijeme open source Zephyr-7b-alpha a API key z Hugging Face platformy.

Proč tato metoda?

Snížení halucinace modelu a zvýšení fakticity pomocí vlastních dat.
Některé jednoduší systémy RAG nevracejí dostatečný sémantický kontext na položený dotaz.
Dotazy je potřeba rozložit a transformovat na menší výzvy.
Rozdělit složitější dotazy na menší, pro lepší možnost vyhledávání.

Technologie

Model Zephyr-7b-alpha
Embedding model (bge-large-en-v1.5)
Framework Llama Index

Script

Pojďme rovnou na konkrétní kód.

Knihovny.

!pip install -qqq llama-index llama-hub langchain accelerate==0.21.0 bitsandbytes==0.40.2 transformers sentence_transformers InstructorEmbedding

Import knihoven a příprava Llama Index RAG.

# import nest_asyncio
# nest_asyncio.apply()

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import json
import torch
from pathlib import Path
import pandas as pd
pd.set_option("display.max_colwidth", -1)

from copy import deepcopy

# transformers
from transformers import BitsAndBytesConfig

# llama_index
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM
from llama_index import download_loader, Document, VectorStoreIndex, ServiceContext
from llama_index.node_parser import SentenceSplitter
from langchain.embeddings import HuggingFaceEmbeddings

from llama_index.indices.query.query_transform import HyDEQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine

from IPython.display import Markdown, display
from llama_index.response.notebook_utils import display_source_node

from llama_index.query_engine import RetrieverQueryEngine
from IPython.display import Markdown, display, HTML
from llama_index.retrievers import VectorIndexRetriever

from sentence_transformers import SentenceTransformer

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

Načteme naše data. V mém případě PDF studie o RAG VS FINE-TUNING: PIPELINES, TRADEOFFS, AND A CASE STUDY ON AGRICULTURE.

RAGxfinetuning Stáhnout

Načteme dokument a rozdělíme na chunky.

PDFReader = download_loader("PDFReader")
loader = PDFReader()
docs = loader.load_data(file=Path("/content/autogen_studie.pdf"))

node_parser = SentenceSplitter(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(docs)

len(nodes)

586.

Nastavíme API klíč HF a stáhneme model Zephyr a nastavíme základní parametry.

from google.colab import userdata

# huggingface api token
hf_token = userdata.get('hf_token')

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95, "do_sample":True},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)

Stáhneme embedding model BAAI/bge-large-en-v1.5.

embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

Nastavíme Index a Retriever.

# ServiceContext
service_context = ServiceContext.from_defaults(llm=llm,
                                               embed_model=embed_model
                                               )

# index
vector_index = VectorStoreIndex(
    nodes, service_context=service_context
)

HyDE Query Transformation

HyDE (Hypothetical Document Embeddings) je nový přístup k hustému vyhledávání.

Nejprve položíme jednoduchý dotaz. Poté položíme dotaz s použitím embeddings a poté i sumarizaci.

Dotazy zde pokládám v angličtině, protože Zephyr je primárně trénován v anglickém jazyce. Na dotaz by sice odpověděl i česky, ale šlo by pouze o překlad a ještě velmi špatný.

query_str = "what is the difference between the RAG method and finetuning?"
query_engine = vector_index.as_query_engine()
response = query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

Odpověď:

According to the provided context, the primary differences between the RAG method and finetuning are:

Initial cost: Creating embeddings for the RAG method has a low initial cost, while fine-tuning requires extensive work and a high initial cost.
Output: RAG provides more verbose and harder-to-steer output, while fine-tuning offers a precise and succinct output.
Input token size: RAG requires a larger input token size, while fine-tuning necessitates a minimal input token size, making it a more efficient option for handling large data sets.

Overall, RAG is highly effective in instances where data is contextually relevant and can improve accuracy in large models, while fine-tuning is highly effective for learning new skills in a specific domain and providing precise and succinct output. The choice between the two methods will depend on the specific needs and goals of the project.

hyde = HyDEQueryTransform(include_original=True, llm=llm)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

The primary difference between the RAG method and finetuning is that RAG uses pre-trained language models to generate vector representations of data, whereas finetuning involves training a pre-trained language model on a specific task or domain using a smaller dataset.

RAG is an attractive option for generating vector representations of data due to its low initial cost, but it can result in more verbose and harder-to-steer output. On the other hand, finetuning offers a precise, succinct output that is attuned to brevity, but it requires extensive work to fine-tune the model on new data and has a higher initial cost.

In terms of benefits, RAG is highly effective in instances where data is contextually relevant, such as in the interpretation of farm data. Fine-tuning, on the other hand, presents opportunities to learn new skills in a specific domain, like improving crop yield predictions or optimizing irrigation schedules based on weather patterns.

Overall, the choice between RAG and finetuning depends on the specific needs and constraints of the task or domain.

query_str = "What are the main conclusions of the study?"
query_engine = vector_index.as_query_engine()
response = query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

The study outlined in the given context information presents a methodology for fine-tuning AI models for generating question-answer pairs within the context of agricultural data. The study evaluates various models and their performance in generating question-answer pairs for RAG on GPT-4, Vicuna, and Llama2 13B. The dataset used in the study includes data from the USA, Brazil, and India. The metrics used to evaluate the effectiveness of the proposed methodology focus on both question and answer evaluation. The study suggests that the findings could be applied to other sectors, potentially leading to the development of more efficient AI models for various applications, including AI copilots for different industries.

hyde = HyDEQueryTransform(include_original=True, llm=llm)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

The main conclusions of the study are that while GPT-4 was only able to learn 47% of the new knowledge presented to it, fine-tuning significantly increased this percentage, with the fine-tuned model being able to learn up to 72% and 74% of the new knowledge, depending on whether RAG was used or not. These results suggest that fine-tuning can significantly improve the performance of GPT-4 in learning new knowledge.

Dle výstupů je vidět, že metoda HyDE má přesnější a fakticky přesnější odpovědi.

Sub-Question Query Engine

Původní metoda funguje poměrně dobře na přímé otázky, které nejdou tolik do hloubky. V případě složitějších dotazů už nemusí být výsledek natolik přesný. Pokud chceme získat relevantnější odpovědi, můžeme zkusit metodu Sub-Question Query Engine, kdy je dotaz rozložen na řadu dílčích poddotazů a každá z nich cílí na jednotlivé odpovědi a poté udělá syntézu konečné odpovědi.

Zkusme tedy nastavit engine pro subotázky a syntézu odpovědí.

from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.callbacks import CallbackManager, LlamaDebugHandler

import nest_asyncio
nest_asyncio.apply()

# Using the LlamaDebugHandler to print the trace of the sub questions
# captured by the SUB_QUESTION callback event type
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

# ServiceContext
service_context = ServiceContext.from_defaults(llm=llm,
                                               embed_model=embed_model,
                                               callback_manager=callback_manager
                                               )

# index
vector_query_engine = VectorStoreIndex.from_documents(
    docs, service_context=service_context, use_async=True
).as_query_engine()

# setup base query engine as tool
query_engine_tools = [
    QueryEngineTool(
        query_engine=vector_query_engine,
        metadata=ToolMetadata(
            name="RAG",
            description="examples of the use of RAG and finetuning",
        ),
    ),
]

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

response = query_engine.query("Specific examples of the use of RAG and finetuning in the study and their results?")

Generated 2 sub questions.
[RAG] Q: How was RAG used in the study and what were the results?
[RAG] Q: Can you provide specific examples of the use of RAG and finetuning in the study?
[RAG] A: Yes, the study conducted fine-tuning and RAG on large language models such as LLama 2, GPT-3.5, and GPT-4 to address complex problems in agriculture. Specifically, the study fine-tuned the models on new data to learn new skills in a specific domain, such as improving crop yield predictions or optimizing irrigation schedules based on weather patterns. On the other hand, RAG was used to improve accuracy in large models, especially in instances where data is contextually relevant, such as in the interpretation of farm data.

For example, in the study, the researchers demonstrated how to generate relevant questions and answers for datasets for specific industries, by leveraging structured document understanding, together with GPT-4 for question generation, and RAG for answer generation. The questions generated were highly specific to the respective sections they were derived from, and the model was able to utilize the entire text for generating insightful and comprehensive answers. 

The study also established performance baselines for LLMs when using RAG or fine-tuning, and presented insights on the strengths and limitations of LLMs within the agricultural domain. The results from Table 22 demonstrated that GPT-4 learned just 
[RAG] A: In the study, RAG (Recursive Autoencoder with Gated Units) was used in conjunction with fine-tuning to improve the accuracy of large language models in addressing complex problems in agriculture. RAG was found to be highly effective in instances where data is contextually relevant, such as in the interpretation of farm data, and led to an increase in accuracy to 72% and 74% with fine-tuning and RAG, respectively. This represents a significant improvement over the baseline accuracy of 47% for the base model without RAG. However, it is important to note that the input token size can increase with RAG, leading to a more verbose and harder to steer output token size. Fine-tuning, on the other hand, offers a precise, succinct output and presents opportunities to learn new skills in a specific domain. The initial cost of fine-tuning is high, but it offers minimal input token size, making it a more efficient option for handling large data sets. Overall, RAG and fine-tuning offer valuable insights into the strengths and limitations of LLMs within the agricultural domain, and can lead to more efficient models starting from the Q&A generation process.
**********

print(response)

In the study, RAG (Recursive Autoencoder with Gated Units) was used in conjunction with fine-tuning to improve the accuracy of large language models in addressing complex problems in agriculture. Specifically, the study fine-tuned the models on new data to learn new skills in a specific domain, such as improving crop yield predictions or optimizing irrigation schedules based on weather patterns. For example, the researchers demonstrated how to generate relevant questions and answers for datasets for specific industries, by leveraging structured document understanding, together with GPT-4 for question generation, and RAG for answer generation. The questions generated were highly specific to the respective sections they were derived from, and the model was able to utilize the entire text for generating insightful and comprehensive answers. RAG was found to be highly effective in instances where data is contextually relevant, such as in the interpretation of farm data. The study found that RAG led to an increase in accuracy to 72% and 74% with fine-tuning and RAG, respectively, compared to the baseline accuracy of 47% for the base model without RAG.

Počítejte s tím, že se jedná o nedeterministický model (ostatně jako většina LLM) a odpovědi budou tedy značně odlišné, pokud spustíte opakovaně. Důležité je však to, že se stále jedná o faktické odpovědi z dokumentu a ne nesouvisející halucinace.

Router Query Engine

Nyní si vytvoříme router pro výběr vhodné metody vyhledávání dle výše zmíněných postupů. Router sám by měl určit, která metoda je nejvhodnější pro daný položený dotaz a vrátit co nejpřesnější odpověď.

from llama_index import VectorStoreIndex, SummaryIndex, SimpleKeywordTableIndex

service_context = ServiceContext.from_defaults(llm=llm,
                                               embed_model=embed_model
)

## Define all the different indexes over same data

# vector index
vector_index = VectorStoreIndex(
    nodes, service_context=service_context
)

# summary index
summary_index = SummaryIndex(
    nodes, service_context=service_context
    )

# keyword index
keyword_index = SimpleKeywordTableIndex(nodes, service_context=service_context)

Nastavíme si query engine a metadata. Následně definujeme i jednotlivé nástroje.

summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    service_context=service_context
)

vector_query_engine = vector_index.as_query_engine(service_context=service_context)

keyword_query_engine = keyword_index.as_query_engine(service_context=service_context)

from llama_index.tools.query_engine import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=(
        "Useful for summarization questions related to Efficient RAG VS FINE-TUNING reserach paper"
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from RAG VS FINE-TUNING reserach paper related to Efficient Finetuning "
    ),
)

keyword_tool = QueryEngineTool.from_defaults(
    query_engine=keyword_query_engine,
    description=(
        "Useful for retrieving specific context from RAG VS FINE-TUNING reserach paper related to Efficient Finetuning "
        "using entities mentioned in query"
    ),
)

Máme různé LLM selektory a můžeme si vybrat podle potřeby. Můžeme vyzkoušet na konkrétních dotazech.

LLMSingleSelector

from llama_index.query_engine.router_query_engine import RouterQueryEngine
from llama_index.selectors.llm_selectors import LLMSingleSelector, LLMMultiSelector

router_query_engine  = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(service_context=service_context),
    query_engine_tools=[
        summary_tool,
        vector_tool,
        keyword_tool,
    ],
    service_context=service_context,
)

response = router_query_engine.query("what experiments were used in the study?")

print(str(response))

The experiments section of the study offers a comprehensive evaluation and analysis of various models and their performance in generating Q&A pairs within the context of agricultural data and using these for RAG or fine-tuning. The experiments are designed to be independent yet complementary, providing a multifaceted understanding of the performance of the models. The experiments explore the following areas: 1. Q&A Quality: This experiment assesses the quality of the generated Q&A pairs by three Large Language Models (LLMs), namely GPT-3, GPT-3.5, and GPT-4, under different context setups. The quality is evaluated based on several metrics including relevance, coverage, overlap, and diversity. 2. Context Study: This experiment investigates the impact of different context setups on the performance of the models in generating Q&A pairs. It evaluates the generated pairs under three context setups: no context, context, and external context. 3. Model to Metrics Calculation: This experiment compares how GPT-3.5 and GPT-4 behave when used to calculate the metrics for assessing the quality of the Q&A pairs.

LLMMultiSelector

router_query_engine  = RouterQueryEngine(
    selector=LLMMultiSelector.from_defaults(service_context=service_context),
    query_engine_tools=[
        summary_tool,
        vector_tool,
        keyword_tool,
    ],
    service_context=service_context,
)

print(str(response))

Zde vám to vrátí odpověď podle zadané otázky. Engine si vybere nástroje podle potřeby.

Můžeme vidět, že jednoduchý RAG nenabízí vždy optimální odpovědi, pokud však zahrneme pokročilé metody RAG jako HyDE, Sub-Question Query nebo Router Query, můžeme získat podstatně lepší a komplexnější odpovědi díky lepší práci s daty. Určitě je zde ještě mnoho prostoru na experimentování, neexistuje totiž v současné době nějaká univerzální metoda na všechno. Ostatně i studie, kterou jsem zkoumal, došla k závěru, že je nutné metodiky přizpůsobit danému případu použití a vše průběžně testovat.

Líbí se Vám obsah?

Odebírejte Newsletter, ať vám nic neunikne.

Odebírejte novinky ze světa AI

Sledujte mě na sítích.

NABÍZÍM..

Odebírejte novinky ze světa AI

Podobné články

Školení Claude Anthropic

Daniel Gottwald 28 června, 2025

Školení Claude pro firmy i jednotlivce. Díky tomuto kurzu se dozvíte, jak Claude používat při každodenní práci i na pomoc v osobním životě. Naučím vás

Číst více »