google.com, pub-5261878156775240, DIRECT, f08c47fec0942fa0 Integrated Knowledge Solutions

Retrieval Augmented Generation: What is it and Why do we need it?

What is Retrieval Augmented Generation?

Generative AI is currently garnering lots of attention. While the responses provided by the large language models (LLMs) are satisfactory in most situations, sometimes we want to get better focused responses when employing LLMs in specific domains. Retrieval-augmented generation (RAG) offers one such way to improve the output of generative AI systems. RAG enhances the LLMs capabilities by providing them with additional knowledge context through information retrieval. Thus, RAG aims to combine the strengths of both retrieval-based methods, which focus on selecting relevant information, and generation-based methods, which produce coherent and fluent text. 

RAG works in the following way:

  1. Retrieval: The process starts with retrieving relevant documents, passages, or pieces of information from a pre-defined corpus or database. These retrieved sources contain content that is related to the topic or context for which you want to generate text.
  2. Generation: After retrieving the relevant content, the generation step takes over. It involves using the retrieved information as input or context to guide the generation of coherent and contextually relevant text. This can involve techniques such as fine-tuning large language models like GPT-3 on the retrieved content or using it as a prompt.
  3. Combination: The generated text is produced while taking into consideration both the retrieved information and the language model's inherent creative abilities. This allows the generated text to be more informative, accurate, and contextually appropriate.

How is RAG Useful?

Retrieval-augmented generation is useful for several reasons:

  1. Content Quality: By incorporating information from retrieved sources, the generated text can be more accurate, relevant, and factually sound. This is particularly important for applications where accuracy and credibility are crucial.
  2. Data Augmentation: Retrieval-augmented generation can be used to expand the dataset for fine-tuning language models. By combining the model's generative capabilities with real-world information, it can learn to produce more contextually relevant and diverse text.
  3. Expertise Integration: In domains that require domain-specific knowledge or expertise, retrieval-augmented generation can ensure that the generated content aligns with expert knowledge.
  4. Abstractive Summarization: When generating summaries, retrieval-augmented approaches can help ensure that the generated summary captures the most important and relevant information from the source documents.
  5. Question Answering: In question answering tasks, retrieval-augmented generation can improve the accuracy of generated answers by incorporating relevant information from a corpus of documents.
  6. Content Personalization: For chatbots and content generation systems, retrieval-augmented generation can enable more personalized and contextually relevant responses by incorporating information retrieved from a user's history or relevant documents.

The success of the RAG approach greatly depends upon how semantically close are the retrieved documents to help the generative AI system when it is responding to a user request. Retrieving meaningful chunks of text is done by nearest neighbor search implemented in a vector database with text being represented by word embeddings. Look for my next post to learn about this aspect of RAG implementation.

It's important to note that retrieval-augmented generation is a research-intensive area and involves challenges such as selecting the right retrieval sources, managing biases in retrieved content, and effectively integrating retrieved information with the language model's creative capabilities. However, it holds promise for improving the quality and utility of generated text across various NLP applications.









LLaMA 2 and its Symbolic Regression Explanation

On July 17, a new family of AI models, LLaMA 2 was announced by Meta. LLaMA 2 is trained on a mix of publicly available data. According to Meta LLaMA 2 performs significantly better than the previous generation of LLaMA models. Two flavors of the model: LLaMA 2 and LLaMA 2-Chat, a model fine tuned for two-way conversations, were released. Each flavor further has three versions with the parameters ranging from 7 billions to 70 billions. Meta is also freely releasing the code and data behind the model for  researchers to build upon and improve the technology.

There are several ways to access LLaMA 2 for development work; you can download it from HuggingFace or access it via Microsoft Azure or Amazon SageMaker. For those interested in interacting with the LLaMA 2-Chat version, you can do so by visiting llama2.ai, a chatbot model demo hosted by the venture capitalist Andreessen Horowitz. This is the route I took to interact with LLaMA 2-Chat.

Since I was reading an excellent paper on symbolic regression, I decided to query LLaMA 2-Chat about this topic. Before I show my chat with the model, let me explain symbolic regression if you are not familiar with it. In the traditional linear regression, the model form, linear or polynomial etc., is assumed and the coefficients/parameters of the model are determined to get the best possible accuracy. In contrast, the symbolic regression involves searching a space of analytical expressions with the corresponding parameter values to best model a given dataset. 

I started off by asking if LLaMA-2 Chat is better than GPT-4. I followed it up by asking about symbolic regression as shown below. 


The answer provided was not specific. So I asked LLaMA 2 for a concrete example. This resulted in the conversation shown below.

Clearly, the example provided is that of linear regression and not of symbolic regression. Pointing this out to LLaMA 2 resulted in the following conversation, where again I had to point out that symbolic regression searching for different functions.



As you can see, LLaMA 2 had difficulty explaining symbolic regression and needed to be prompted for making mistakes. Next, I decided to go to ChatGPT to see what kind of response it would produce. Below is the ChatGPT output.






As you can see, ChatGPT was clear in explaining symbolic regression and even mentioned about the use of genetic algorithms and genetic programming that are key to symbolic regression.


So my take is to stick with Chat-GPT for getting help on topics of interest. LLaMA 2 is lacking in providing clear explanations. Of course, my take is based only on conversation about one topic only.












Low Rank Adaptation (LoRA): Enhancing Fine-Tuning of LLMs

Pre-trained large language models (LLMs) are being used for numerous natural language processing applications. These models perform well out of the box and are fine-tuned for any desired down-stream application. However, fine-tuning these models to adapt to specific tasks often poses challenges due to their large parameter sizes. To address this, a technique called Low Rank Adaptation (LoRA) has emerged, enabling efficient fine-tuning of LLMs. In this post, we will try to understand LoRA, and delve into its importance and application in fine-tuning LLMs. We will begin our journey by first looking at the concept of rank of a matrix, followed by a look at matrix factorization, and then to LoRA.

Rank of a Matrix

The rank of a matrix indicates the number of independent rows or column in the matrix. As an example, consider the following 4x4 matrix A:

A = [[2, 4, 6, 8], [1, 3, 5, 7], [4, 8, 12, 16], [3, 9, 15, 21]]

Looking at the first and third row of this matrix, we see that the third row is just a scale up version of the first row by a factor of 2. The same is true for the second and fourth rows. Thus, the rank of matrix A is 2 as there are only two independent rows. 

The rank of a matrix of size mxn cannot be greater than min{m,n}. In other words, the rank of a matrix cannot be greater than the smallest dimension of the matrix. We say a matrix is a full rank matrix if its rank equals the largest possible rank for that matrix. 

When a matrix is not a full rank matrix, it tells us that the underlying matrix has some redundancy in it that can be exploited for data compression or dimensionality reduction. This is done by obtaining a low-rank approximation of the matrix. The process of obtaining a low-rank approximation of a matrix is involves matrix factorization. Some of these factorization methods are briefly described below. 

Matrix Factorization

Matrix factorization is the process of decomposing a matrix into multiple factors. Some of the matrix factorization are:

1. Singular Value Decomposition (SVD)

In SVD, a real-valued matrix A of size m x n is factorized as $ A =  UDV^t$, where 𝐔 is an orthogonal matrix of size m x m of left singular vectors and 𝐕 is an orthogonal matrix of size n x n of right singular vectors. The matrix 𝐃 is a diagonal matrix of size m x n of singular values. A low rank approximation to matrix A of rank r is obtained by using only a subset of singular values and the corresponding left and right singular vectors as given by the following expression. In other words, the approximation is obtained by the weighted sum of rank one matrices.
$ \hat{ \bf A} = \sum\limits_{j=1}\limits^{k} d_{jj}\bf U_j\bf V^t,\text{   }k\leq r$

SVD is a popular matrix factorization method that is commonly used for data compression and dimensionality reduction. It has also been used for compressing convolutional neural networks. You can read more about SVD and its use for compression at this blog post.

2. Principal Component Analysis (PCA)

PCA aims to find the principal components that capture the most significant variance in the data. It works with data matrices that have been normalized to have zero mean. Let's say $X$ of m rows and n columns is one such data matrix where each row represents an observation vector of n features. PCA computes the eigenvalues and eigenvectors of the covariance matrix $C = \frac{1}{(1-n)}XX^t$ by factorizing it as $\frac{1}{(1-n)}WD^tW$, where $W$ is an orthogonal matrix of eigenvectors and $D$ is the diagonal matrix of eigenvalues. PCA is a popular technique for dimensionality reduction.

3. Non-Negative Matrix Factorization (NMF)

NMF is another technique for obtaining low rank representation of matrices with non-negative or positive elements. Given a data matrix $A$ of m rows and n columns with each and every element $a_{ij} ≥ 0$, NMF seeks matrices $W$ and $H$ of size m rows and k columns, and k rows and n columns, respectively, such that $A≈WH$, and every element of matrices $W$ and $H$ is either zero or positive. The value of k is set by the user and is required to be equal or less than the smallest of m and n. The matrix $W$  is generally called the basis matrix, and $H$ is known as expansion or coefficient matrix. The underlying idea of this terminology is that a given data matrix $A$ can be expressed in terms of summation of k basis vectors (columns of $W$) multiplied by the corresponding coefficients (columns of $H$). Compared to SVD, the NMF based factorization offers a better interpretation of the original data matrix as it is represented/approximated as a sum of positive matrices/vectors. NMF has been used to perform document clustering, making recommendations, visual pattern recognition such as face recognition, gene expression analysis, feature extraction, source separation etc. Basically, it can be used in any application where data matrix $A$ has no negative elements. You can read more about NMF at this blog post.


Low Rank Adaptation (LoRA) of Large Language Models

The first thing to note is that LoRA doesn't perform a low rank approximation of the weight or parameter matrix; it rather modifies it by generating a new low rank matrix that captures the needed parameter changes as a result of the fine tuning the LLM. The pre-trained matrix $W$ is frozen while fine tuning and the weight changes are captured in a delta weight matrix $\Delta W$ through gradient learning. The delta weight change matrix is a low rank matrix which is set as a product of two small matrices, i.e. $\Delta W = AB$. The $A$ matrix is initialized with values coming from a gaussian distribution while $B$ matrix is initialized with elements all equal to zero. This ensures that the pre-trained weights matrix is the only contributing matrix at the start of fine tuning. The figure below illustrates this setup for LoRA.


LoRA Scheme: Matrix W is kept fixed and only A and B are trained.

Let's now try to understand the reasoning behind LoRA and its advantages. The main motivation is that the pretrained models are over-parameterized with low intrinsic dimensionality. Further, the authors of LoRA hypothesize that change in weights during model fine tuning also has a low intrinsic rank. Thus, it is suffice to use a low rank matrix to capture the weight changes during fine tuning. LoRA offers several advantages. First, it is possible to share the pretrained model for several downstream tasks with each task having its own LoRA model. This obviously saves storage needs as well as makes task switching easier. Second, LoRA makes the LLMs adaptation for different tasks easier and efficient. Third, it is easy to combine with other fine tuning methods, if desired. As an example of the parameter efficiency of LoRA, consider the pretrained matrix of size 200x400. To perform adaptation, let matrix $A$ be of size 200x8 and matrix $B$ be of size 8x400 giving rise to the delta weight change matrix of the desired size of 200x400. The number of parameters thus needed by LoRA is only 200*8+8*400 = 4800 as compared to the number of parameter, 200*400 = 80000, needed to adjust without LoRA.

An important consideration in using LoRA is the choice of the rank of the $\Delta W$ matrix. Choosing a smaller rank leads to a simpler low-rank matrix, which results in fewer parameters to learn during adaptation. However, the adaptation with a smaller rank $\Delta W$ may not lead to the desired performance. Thus, the rank choice offers a tradeoff that typically requires experimentation to get the best adaptation. 

LoRA in PEFT

PEFT stands for a general parameter-efficient fine-tuning library from Huggins Face that includes LoRA as one of its techniques. The few lines of codes below illustrate its basic use.

       

from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# output: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282

       
 

In the above example, mt0-large model is being fine tuned for a sequence to sequence conversion task. The rank of the delta weight change is specified as 8.  The model has 1.2 B parameters but LoRA needs only 2.36M parameters, 19% of the total parameters, to train. If we are to change the rank to 12, the number of trainable parameters increases to 3538944, 28.7% of the total parameters. Clearly, the choice of rank is an important consideration when using LoRA.

LoRA's performance has been evaluated against full fine tuning and other efficient techniques for parameter computation. LoRA has been found to generally outperforms other efficient fine tuning techniques by a significant margin while yielding comparable or better performance than full fine tuning. 

To wrap up, LoRA is an efficient technique for fine tuning large pretrained models. It is poised to play an important role in fine tuning and customizing LLMs for numerous applications.

It would be my pleasure to hear your comments/suggestions to make this site more interesting.