google.com, pub-5261878156775240, DIRECT, f08c47fec0942fa0 Integrated Knowledge Solutions: Language Models
Showing posts with label Language Models. Show all posts
Showing posts with label Language Models. Show all posts

Difference Between Semi-Supervised Learning and Self-Supervised Learning

There are many styles of training machine learning models including the familiar supervised and unsupervised learning to active learning, semi-supervised learning and self-supervised learning. In this post, I will explain the difference between semi-supervised and self-supervised styles of learning. To get started, let us first recap what is supervised learning, the most popular machine learning methodology to build predictive models. Supervised learning uses annotated or labeled data to train predictive models. A label attached to a data vector is nothing but the response that the predictive model should generate  for that data vector as input during the model training. For example, we will label pictures of cats and dogs with labels cat and dog to train a Cat versus Dog classifier. We assume a large enough training data set with labels is available when building a classifier.

When there are no labels attached to the training data, then the learning style is known as unsupervised learning. In unsupervised learning the aim is to partition the data into different groups based upon similarities of the training vectors. The k-means clustering is the most well-known unsupervised learning technique. Often, the number of the data groups to be formed is specified by the user.

Semi-Supervised Learning

In a real world setting, training examples with labels need to be acquired for a predictive modeling task. Labeling or annotating examples is expensive and time-consuming; many application domains require expert annotators. Thus, we often need ways to work with a small labeled training data set. In certain situations, we may be able to acquire, in addition to a small labeled training data set, additional training examples without labels with labeling being too expensive to perform. In such cases, it is possible to label the unlabeled examples using the small available set of labeled examples. This type of learning is referred as semi-supervised learning and it falls somewhere between supervised and unsupervised learning. 

The term semi-supervised classification is often used to describe the process of labeling training examples using a small set of labeled examples for classification modeling. A similar idea is also used in clustering in an approach known as the semi-supervised clustering. In semi-supervised clustering, the goal is to group a given set of examples into different clusters with the condition that certain examples must be clustered together and certain examples must be put in different clusters. In other words, some kind of constraints are imposed on resulting clusters in terms of cluster memberships of certain specified examples. For an example of semi-supervised classification, you can check this blog post. In another blog post, you can read about constrained k-means clustering as a technique for semi-supervised clustering.

Transfer Learning

In certain situations we have a small set of labeled examples but cannot acquire more training examples even without the labels. One possible solution in such situations is transfer learning. In transfer learning, we take a trained predictive model that was trained on a related task and re-train it with our available labeled data. The re-training fine-tunes the parameters of the trained model to make it perform well for our predictive task. Transfer learning is popular in deep learning where many trained predictive models are publicly available. While performing transfer learning, we often employ data augmentation to the available labeled examples to create additional examples with labels. The common data augmentation operations include translation, rotation, cropping and resizing, and blurring.

Self-Supervised Learning

The Self-supervised learning is essentially unsupervised learning wherein the labels, the desired predictions, are provided by the data itself and hence the name self-supervised learning. The objective of the self-supervised learning is to learn the latent characteristics of the data that could be useful in many ways. Although the self-supervised learning has been around for a long time, for example as in autoencoders, its current popularity is primarily due its use in training the large language models. 

The example below shows how the desired output is defined via self-learning. In the example, the words in green are masked and the model is trained to predict the masked words using the surrounding words. Thus, the masked words function as labels. The masking of the words is done in a random fashion for the given corpus and thus no manual labeling is needed.




The idea of random masking is not the only way to self-generate labels; several variations at the word level as well as the sentence level are possible and have been successfully used in different language modeling efforts. For example, self-learning can be employed to predict the neighboring sentences that come before and after a selected sentence in a given document. 

The tasks defined to perform self-supervised learning are called pretext tasks because these tasks are not the end-goal and the results of these tasks are used for building the final systems. 

Self-generation of labels for prediction is easily extended to images to define a variety of pretext tasks for self-supervised learning. As an example, images can be subjected to rotations of (90 degrees, 180 degrees etc.) and the pretext task is defined to predict the rotation applied to the images. Such a pretext task can make the model learn the canonical orientation of image objects. Data augmentation is also commonly used in self-supervised learning to create image variations. 

All in all, self-supervised learning is a valuable concept that eliminates the need for external annotation. The success of large language models can be majorly attributed to this style of machine learning.

Retrieval Augmented Generation: What is it and Why do we need it?

What is Retrieval Augmented Generation?

Generative AI is currently garnering lots of attention. While the responses provided by the large language models (LLMs) are satisfactory in most situations, sometimes we want to get better focused responses when employing LLMs in specific domains. Retrieval-augmented generation (RAG) offers one such way to improve the output of generative AI systems. RAG enhances the LLMs capabilities by providing them with additional knowledge context through information retrieval. Thus, RAG aims to combine the strengths of both retrieval-based methods, which focus on selecting relevant information, and generation-based methods, which produce coherent and fluent text. 

RAG works in the following way:

  1. Retrieval: The process starts with retrieving relevant documents, passages, or pieces of information from a pre-defined corpus or database. These retrieved sources contain content that is related to the topic or context for which you want to generate text.
  2. Generation: After retrieving the relevant content, the generation step takes over. It involves using the retrieved information as input or context to guide the generation of coherent and contextually relevant text. This can involve techniques such as fine-tuning large language models like GPT-3 on the retrieved content or using it as a prompt.
  3. Combination: The generated text is produced while taking into consideration both the retrieved information and the language model's inherent creative abilities. This allows the generated text to be more informative, accurate, and contextually appropriate.

How is RAG Useful?

Retrieval-augmented generation is useful for several reasons:

  1. Content Quality: By incorporating information from retrieved sources, the generated text can be more accurate, relevant, and factually sound. This is particularly important for applications where accuracy and credibility are crucial.
  2. Data Augmentation: Retrieval-augmented generation can be used to expand the dataset for fine-tuning language models. By combining the model's generative capabilities with real-world information, it can learn to produce more contextually relevant and diverse text.
  3. Expertise Integration: In domains that require domain-specific knowledge or expertise, retrieval-augmented generation can ensure that the generated content aligns with expert knowledge.
  4. Abstractive Summarization: When generating summaries, retrieval-augmented approaches can help ensure that the generated summary captures the most important and relevant information from the source documents.
  5. Question Answering: In question answering tasks, retrieval-augmented generation can improve the accuracy of generated answers by incorporating relevant information from a corpus of documents.
  6. Content Personalization: For chatbots and content generation systems, retrieval-augmented generation can enable more personalized and contextually relevant responses by incorporating information retrieved from a user's history or relevant documents.

The success of the RAG approach greatly depends upon how semantically close are the retrieved documents to help the generative AI system when it is responding to a user request. Retrieving meaningful chunks of text is done by nearest neighbor search implemented in a vector database with text being represented by word embeddings. Look for my next post to learn about this aspect of RAG implementation.

It's important to note that retrieval-augmented generation is a research-intensive area and involves challenges such as selecting the right retrieval sources, managing biases in retrieved content, and effectively integrating retrieved information with the language model's creative capabilities. However, it holds promise for improving the quality and utility of generated text across various NLP applications.









Exploring Large Language Models: Types and Applications

Large language models (LLMs) are currently the craze. Who hasn't heard of ChatGPT that can deliver all kinds of responses to user prompts, be a recipe or suggestions for vacation or an essay on a topic for a term paper. It is all possible because of the underlying large language models.

So what are large language models? How do these models work? What can we do with these models? Let's try to answer these questions without going into much technical details.

What are Large Language Models?

We will begin by first trying to understand what is a language model. Think about using your cell phone for messaging. As you enter text, your cell phone tries to guess the word you are typing, see the figure below. Under the hood, a language model is computing probabilities for the next character/word and is displaying the top three or five most probable characters/words. 


There are a few types of language models such as rule-based models, statistical language models, and the recurrent neural networks (RNNs). The rule-based models rely on predefined linguistic rules and heuristics to perform their calculations. These models require experts to manually create and fine-tune rules, making them inflexible and limited in handling complex language patterns. 

Statistical language models use probabilistic methods to estimate the likelihood of a sequence of words. These models utilize n-grams, which are sequences of n words, to predict the probability of the next word based on the previous ones. While statistical models offer improved language processing capabilities, they still struggle with understanding context and long-range dependencies.

RNNs are neural networks with memory; these are designed to process sequential data, making them ideal for modeling language. The internal memory enables them to consider context from previous words while predicting the next word. However, standard RNNs are unable to capture long-term dependencies due to a training bottleneck, the "vanishing gradient" problem. 

The large language models are deep learning models that use the transformer architecture to learn the dependencies among words. These models have 100+ billion parameters that are set by training. There are a number of features of the transformer architecture that have made them the architecture of choice for sequential data. Even, images can be used with the transformer architecture by considering them as sequences of small blocks of pixels. The foremost feature of the transformer architecture is the self-attention mechanism which weighs importance of different words in a given context. It, thus, allows the  transformer architecture to capture dependencies across the entire input sequence making them highly effective in language modeling tasks. Another important feature is that the architecture looks at all the input words of a sentence at the same time which is a key to the use of the attention mechanism. 

The transformer architecture consists of two main components: the encoder and the decoder. Both the encoder and decoder are composed of multiple layers of self-attention and feedforward neural networks. The encoder receives an input sequence and produces a sequence of hidden states. The decoder accepts a target sequence and uses the encoder’s output to generate a sequence of predictions. Exceedingly large amounts of text data, sourced from books, websites, wikipedia, and multitude of other sources, are used to train the transformer model. The training is done by following the self-supervised learning modality. Typical approaches to self-supervised learning is to mask certain amount of text and train the transformer to predict the text. Instead of masking, the next sentence prediction is also used for training. It is the self-supervised learning approach that has made the training of large language models removing the need for expert annotators.

Pre-trained LLMs

There are a multitude of pre-trained large language models that have been released for use. Before listing some of the popular pre-trained models, let's categorize them in terms of their architecture and usage.

  • Encoder-only Models
  • Decoder-only Models
  • Encoder-Decoder Models

The encoder-only models are the models that are trained to predict masked or missing words. The pre-trained models produce a high-dimensional vector representation of the input text, known as embeddings. [You can read about embeddings at the post "Words as Vectors".] These models can be fine-tuned for a variety of NLP tasks, such as sentiment analysis, named entity recognition, and question answering. These models are also called auto-encoding models.

The decoder-only models as one would expect use only the decoder part of the transformer architecture. These models are generally trained by having the model to predict the next word of the input text. These models are best suited for text generation. These models are also called autoregressive models.

The encoder-decoder models use both the encoder and the decoder components of the transformer architecture. The pre-training of these models replaces a chunk of the input text by a single mask and the model is trained to predict the entire chunk of the masked input text. These models are also known as sequence-to-sequence models. These models are suitable for text summarization, translation, or generative question answering tasks.

In many cases, you want to adapt a pre-trained model for your specific task in a particular domain, for example finance. This is done by applying transfer learning to the pre-trained model with a dataset specific to the application domain. Such models are called fine-tuned models.

Examples of Large Language Models (LLMs)

Below is a non-exhaustive list of LLMs.

1. GPTs

The GPT (Generative Pre-trained Transformer) series of models from OpenAi is perhaps the most well-known LLMs. The release of ChatGPT based on GPT3.5 in November 2022 kind of created an artificial intelligence storm. This series of models are decoder-only models and are being used for text generation, summarization and question-answers. GPT-4, the most recent model in the series, is being used in Microsoft's Bing Chat.

2. LaMDA

LaMDA which stands for Language Model for Dialogue Applications is a LLM from Google. It was trained on dialogue and thus exhibits superior conversational performance. It is mainly being used internally at Google and an earlier version of Google Bard was based on this model.

3. PaLM-2

This model was released by Google in May of this year. It is a state-of-the-art language model with improved multilingual, reasoning and coding capabilities. It was trained with text from over 100 languages, scientific papers, and code from numerous public sources. As a result, PaLM-2 is claimed to offer multilingual, reasoning, and programming capabilities. The current version of Google Bard is based on PaLM-2.

4. LLaMA

This model was released by Meta in February earlier this year. It is an auto-regressive language model and comes in different sizes: 7B, 13B, 33B and 65B parameters. It is good for question answering, and reading comprehension tasks.

5. BERT

BERT from Google stands for Bidirectional Encoder Representations from Transformers. It is an encoder-only type LLM. BERT uses bidirectional context to generate representation for words. What this means is that in the sentence "I bought an apple phone", the unidirectional context for encoding the word "apple" is "I bought an" while the bidirectional context brings in the next word "phone" also. Clearly, the bidirectional context provides a more targeted representation. BERT has been used for question-answer, sentiment analysis, and text classification. DistilBERT is a compressed version of BERT with fewer parameters but with equally good performance.

6. T5

This is an encoder-decoder transformer model from Google. It is suitable for tasks including machine translation, question answering, abstractive summarization, and text classification.

An Example of LLM Usage

Here, we are going to take a look at using LLMs for our daily tasks. The example that we are going to look at about using ChatGPT to get code for building an app to perform next day stock price prediction. We will give a prompt to ChatGPT specifying what we want. The prompt and the response from ChatGPT are shown below. If we want to do this on your own end, you will need to get an account with OpenAI.


In the subsequent prompts, I ask ChatGPT to give me code for downloading stock data. Then I prompt ChatGPT to make a python app out of it. All of this is performed satisfactorily and the app works fine. You can read about the responses from ChatGPT and get the complete code at "Create a Simple Stock Price Prediction App using ChatGPT" blogpost. 

Issues with LLMs Usage

Many organizations including Microsoft have been quick to deploy LLMs in their products. At the same time, a large group of researchers have been concerned with potential harms that can result with LLMs becoming more and more powerful. Some issues that have emerged from the current LLMs are:

1. Incorrect and Made-up Answers

Instances of incorrect and fabricated yet convincing responses have been reported by many. Fabricated and inaccurate answers. Thus, the responses from LLMs shouldn't be taken at face value and must be reviewed before usage.

2. Data Privacy and Confidentiality 

One needs to observe caution as any sensitive, confidential, and proprietary information used in prompts may end up being included in responses to other users. 

3. Model Bias

The LLMs have been found to exhibit bias which arises from the data from the wild that is used for training them. Bias exhibited by a model in use can create legal issues. 

4. Intellectual and Copyright Issue

Since models like ChatGPT have been trained using data from the web, the training data includes copyrighted material available of the web. This can result in copyright violations.

5. Fraud and Scamming Risk

Given that it is easy to generate fake data and misinformation, the scams using LLMs are definitely going to increase. As a consumer, we need to be on alert for such possibilities.

Going Forward with LLMs

LLMs are here to greatly impact the society on almost all of its facets. There will be large benefits from the use of LLMs and at the same time certain challenges are emerging in terms of dealing with fake yet convincing looking information being spread. While the thrust of LLMs development so far has been on producing bigger and bigger models, it appears that the focus is shifting to making LLMs more efficient and more accurate in their responses. We are also going to see the models being made domain specific.

I hope you enjoyed reading this exploration of LLMs. If you want to learn more, I suggest you visit huggingface transformer library where you will find information on many transformer models as well as demos to show their usage.