Choosing the Right Path: RAG or Fine-Tuning

sourajit roy chowdhury
8 min readOct 2, 2023

The development of Large Language Models (LLMs) has opened up a world of possibilities for creating a multitude of applications thanks to their powerful generative AI capabilities. However, it’s important to recognize that at its fundamental level, an LLM operates as a machine learning model, and the effectiveness and suitability of the applications it powers rely on the accuracy of its predictions. This prompts us to inquire: How does an LLM achieve precise predictions based on its knowledge? To gain insight into this process, it’s essential to grasp the primary knowledge sources that underpin LLM-based applications. At a broad level, these applications draw from three key knowledge sources.

  1. In-context knowledge
  2. External knowledge
  3. Parametric knowledge

In the realm of Large Language Models (LLMs), “in-context knowledge” pertains to information that can fit within the context-window of the LLM. Here, the context-window denotes the quantity of input that the LLM can handle in a single request for processing. In the scenario of in-context knowledge, it typically involves a contextual setting accompanied by a question. This question may either have an answer that can be found within the provided context or may be unrelated to it. In either case, the LLM’s goal is to focus solely on the provided context to generate an accurate response to the user’s query.

In-context Knowledge

The term “external knowledge” lacks a precise definition, but in simpler terms, it refers to two main scenarios. First, it can denote a vast amount of information that cannot be accommodated within the context-window of an LLM. Alternatively, it can signify that the source of knowledge is diverse and requires integration into the LLM during the system design phase for the LLM to access it effectively.

To enable LLMs to access external knowledge and provide answers to user queries, we employ a concept known as retrieval-based systems or Retrieval Augmented Generation (RAG). One popular type of RAG involves breaking down extensive datasets into smaller segments and storing them in a vector database. When a user submits a query, the system retrieves the top-k relevant segments based on that query, which, along with the query itself, serves as the in-context knowledge, as discussed earlier. It’s important to note that implementing RAG-based systems involves numerous intricacies and complexities, and they continue to be an active area of research.

Additionally, various types of retrieval systems can be employed based on the source of external knowledge:

1. Structured Data Source (CSV or SQL DB): In this scenario, the LLM translates the user query into SQL or Pandas code, which is then executed separately to retrieve relevant data from structured data sources.

2. API as Data Source: Many applications utilize APIs to fetch pertinent data based on user queries. LLMs excel at deciphering user queries and selecting the appropriate API to call with the user’s data.

3. Tools as Data Source: Different tools can be employed to retrieve information in response to user queries. LLMs can analyze and redirect user queries to the relevant tool, such as a mathematical, search, or design tool, based on the nature of the inquiry.

Retrieval Based LLM System

“Parametric knowledge”, in essence, signifies the knowledge that becomes ingrained within an LLM’s parameters as it undergoes training. In technical terms, these parameters are commonly referred to as the model’s weights and biases. The acquisition of parametric knowledge occurs through a two-step process.

First, during the initial pre-training phase of the LLM, an extensive dataset from the open internet is employed. Subsequently, the pre-trained LLM is fine-tuned using specialized data from a particular domain or niche. This fine-tuning process often utilizes private data exclusive to the organization or individual involved.

Fine-tuning LLM

Fine-tuning adjusts the way a model operates, whereas Retrieval Augmented Generation (RAG) enriches the model’s understanding by introducing external contextual information during the inference process.

Leveraging both RAG and fine-tuning in conjunction can be a powerful approach. However, the appeal of RAG lies in its transparency compared to the more complex fine-tuning process. It’s simpler to implement and monitor, as you can visually assess the input and output of the Large Language Model (LLM).

In the upcoming section, we will delve into analyzing both the techniques.

Technical Analysis

Fine-tuning

  1. Enhancing the model’s performance by adjusting its parameters with data specific to the particular domain.
  2. Carefully selecting and preparing data for supervised training through instructions or fine-tuning chat models using chat-based data.
  3. After fine-tuning, the model’s parameters remain fixed in time. To adapt to new datasets, re-fine tuning becomes necessary.
  4. Can have unnecessary hallucinations which needs to be dealt with.

RAG

  1. Implementing a data ingestion pipeline to regularly collect data, including updates from APIs and the integration of new data sources.
  2. Enhancing engineering efficiency by employing various retrieval methods tailored to specific use-cases.
  3. Establishing an agile architecture capable of managing multiple tools, APIs, or data retrievers.
  4. Proper prompt engineering can reduce the problem of hallucination to nearly zero.

Cost Analysis

Fine-tuning

  1. The training expenses can escalate due to the necessity of specialized GPU accelerators, with cloud options costing around $30 per hour.
  2. The model’s parameters remain fixed and must undergo regular fine-tuning with fresh datasets, leading to increased expenditures.
  3. Conversely, the input prompts requested during inference will be shorter and more direct, resulting in savings per token. This efficiency allows for cost-effective utilization of applications with higher query rates per minute.

RAG

  1. Since there is no need for training, you won’t incur any expenses associated with GPU-based hardware.
  2. However, it’s worth noting that managing the data intake process and making calls to external APIs may lead to ongoing costs, especially when you need to bring in new data or make updates to APIs.
  3. When dealing with extensive prompts that encompass both user queries and external contextual knowledge, it’s important to be aware that in high-volume applications, the cost per token can become quite significant.

Application Analysis

Fine-tuning

  1. Fine-tuning Large Language Models (LLMs) can significantly enhance the performance of applications associated with Artificial Narrow Intelligence (ANI), including tasks like sentiment analysis, machine translation, and entity extraction over smaller LMs.
  2. Fine-tuning becomes particularly valuable for use cases that possess extensive, carefully curated supervised datasets specific to their domain.
  3. In scenarios where applications demand rapid throughput and low latency, a fine-tuned LLM may serve as a suitable option. However, it’s crucial to conduct a more in-depth and comprehensive assessment before finalizing this choice.

RAG

  1. Applications that depend on fast moving external information resources.
  2. The creation of conversational agents requiring access to external tools, APIs, databases, and similar information sources.
  3. Applications demanding precision, with robust mechanisms for tracing back to the sources utilized by the Large Language Model to generate responses, all while maintaining a high level of transparency.

So far, we’ve explored both the fine-tuning and RAG approaches independently and examined them from various angles. Now, let’s consider whether these two approaches can be used together effectively.

Combining RAG & Finetuning

We will delve into the integrated approach by examining a practical use case.

Use-Case Description: Our objective is to develop an intelligent conversational agent specialized in the field of financial investment. This agent should be capable of addressing a wide range of user inquiries specific to this domain. It should not only comprehend the intricacies of financial jargon but also possess the ability to evaluate complex relationships to provide nuanced answer to end user queries.

Certainly, it’s clear that any general pre-trained LLM alone may not consistently deliver the desired results in terms of efficiency, complexity, accuracy, and reliability. All of these aspects hold significant importance, as any shortcomings could potentially erode the value of the investment organization in the eyes of its stakeholders.

System Design: The integrated approach involving both fine-tuning and RAG holds significant importance. The question of what to fine-tune and how to leverage RAG effectively may arise. While the detailed system design encompasses multiple facets and requires meticulous planning of the overall architecture, I will provide a high-level concept on how to utilize both fine-tuning and RAG concepts to address the aforementioned use case.

Fine-tuning can be approached in at least two phases. The first phase involves enhancing the parametric knowledge of a base Large Language Model (LLM) to comprehend financial language intricacies, understand complex relationships, and decode other nuanced aspects. The second phase focuses on fine-tuning the embedding model using financial data. This ensures that semantic and syntactic similarities are accurately captured in the dense embeddings, which can be more valuable than standard pre-trained embedding models. However, it’s important to note that the fine-tuning process may not be straightforward and can pose challenges such as continuous learning, gradient norm growth, and stagnation in validation loss.

The assessment of the fine-tuned model becomes especially crucial and relies heavily on the specific use case. In the context of financial investment, it is imperative that the model maintains a high standard of ethics and transparency in its predictions. This often necessitates thorough benchmark testing to ensure its reliability and accuracy.

On the flip side, once we’ve completed the fine-tuning phase, we’ll move on to the primary development of our conversational agent, incorporating the RAG (Retrieval-Augmented Generation) concept. In this stage, our code-based conversational agent will utilize the fine-tuned Language Model (LLM) to access relevant data from the vector database and craft personalized responses to user queries with subtlety.

Moreover, the fine-tuned embedding model will play a crucial role in vectorizing data during the data ingestion phase and also in generating vectors for user queries during the retrieval process.

It’s worth noting that the creation of a highly scalable and accurate RAG system presents numerous intricate challenges, both in terms of architecture and design, which are essential for achieving optimal results for your users.

The decision on whether to employ fine-tuning or RAG hinges on the specific task and the needs of the application.

Fine-tuning is preferable when striving for superior performance in tasks that demand the model to grasp intricate patterns and relationships. This method is well-suited for scenarios such as customer code migration and machine translation.

RAG offers a more efficient and transparent approach, making it a viable option when dealing with tasks where obtaining labeled data is a challenge due to scarcity or high costs. It shines in use cases like creative content generation and Question & Answering, offering improved accuracy and reduced risk of generating incorrect information.

It’s important to note that fine-tuning and RAG are not in competition; instead, they complement each other by enhancing Language Model capabilities. Each technique has its distinct objectives, mechanisms, advantages, and drawbacks, but they can also work together synergistically for mutual benefit.

--

--