Clare S. Y. Huang Data Scientist | Atmospheric Dynamicist

Generative AI with LLMs Week 2 (2)

Here are course notes I am taking from the DeepLearning.ai course on Coursera: Generative AI with Large Language Models.

Topic: Parameter efficient fine-tuning (PEFT)

Parameter efficient fine-tuning (PEFT)

Full fine-tuning of large LLMs is challenging

  • Trainable weights Additional 12-20x weights may be too large to handle on consumer hardware:
  • Temp memory
  • Forward Activations
  • Gradients
  • Optimizer states

PEFT

  • can perform on single GPU
  • less prone to catastrophic forgetting
  • PEFT fine-tuning saves spaces and is flexible
  • efficient adaptation of the original model to multiple tasks

PEFT Trade-offs

  • Parameter efficiency
  • Memory efficiency
  • Model performance
  • Inference cost
  • Training speed

PEFT methods

Selective

  • Select subset of initial LLM parameters to fine-tune
  • Researchers have found that the performance of these methods is mixed
  • there are significant trade-offs between parameter efficiency and compute efficiency

Reparameterization

  • Reparameterize model weights using a low-rank representation
  • LoRA

Additive

  • Add trainable alyers or parameters to model
  • Adapters:
    • add new trainable layers to the architecture of the model
    • typically inside the encoder or decoder components after the attention or feed-forward layers
  • Soft Prompts:
    • keep the model architecture fixed and frozen
    • focus on manipulating the input to achieve better performance
    • can be done by:
      • adding trainable parameters to the prompt embeddings
      • keeping the input fixed and retraining the embedding weights

PEFT techniques 1: LoRA (Low-Rank Adaptation of Large Language Models)

Slide p.95 captures the idea!

  • this model has the same number of parameters as the original: little to no impact on inference latency
  • Researchers have found that applying LoRA to just the self-attention layers of the model is often enough to fine-tune for a task and achieve performance gains
  • (in principle, you can also use LoRA on other components like the feed-forward layers)
  • biggest savings in trainable parameters achieved by applying LoRA to attention layers’ weights matrices

Concrete example using base Transformer as reference

Slide p.96 captures the idea!

LoRA: Low Rank Adaption of LLMs

Slide p.99 captures the idea!

Sample ROUGE metrics for full vs. LoRA fine-tuning

Comparable evaluation metrics!

Choosing the LoRA rank (trade-offs)

  • Hu et al. 2021, “LoRA: Low-Rank Adaptation of Large Language Models”
  • Effectiveness of higher rank appears to plateau
  • Relationship between rank and dataset size needs more empirical data

PEFT techniques 2: Soft prompts

Prompt tuning is not prompt engineering!

Limitations of prompt engineering:

  • require a lot of manual effort to write and try different prompts
  • limited by the length of the context window
  • not guaranteed to achieve desired performance

Prompt tuning adds trainable “soft prompt” to inputs

  • add additional trainable tokens to your prompt
  • let supervised learning process to determine their optimal values
  • the set of trainable tokens is called a “soft prompt”
    • it gets prepended to embedding vectors that represent your input text
    • its vectors have the same length as the embedding vectors of the language tokens
    • including somewhere between 20 and 100 virtual tokens can be sufficient for good performance

Soft prompts

  • virtual tokens that can take on any value within the continuous multidimensional embedding space
  • via supervised learning, the model learns the values for these virtual tokens that maximize performance for a given task

Full Fine-tuning vs prompt tuning

Full fine-tuning

  • the training data set consists of input prompts and output completions or labels
  • the weights of the large language model are updated during supervised learning.

prompt tuning

  • the weights of the large language model are frozen
  • the underlying model does not get updated - the embedding vectors of the soft prompt gets updated over time to optimize the model’s completion of the prompt

  • Prompt tuning is a very parameter efficient strategy because only a few parameters are being trained.
  • You can train a set of soft prompts for one task and a different set for another.

Performance of prompt tuning

  • Lester et al. 2021, “The Power of Scale for Parameter-Efficient Prompt Tuning”
  • prompt tuning doesn’t perform as well as full fine tuning for smaller LLMs.
  • prompt tuning performance increases as the model size increases
  • As number of model parameters get to around 10B, prompt tuning can be as effective as full fine tuning

Interpretability of soft prompts

  • issue: interpretability of learned virtual tokens
  • because the soft prompt tokens can take any value within the continuous embedding vector space, the trained tokens don’t correspond to any known token, word, or phrase in the vocabulary of the LLM
  • an analysis of the nearest neighbor tokens to the soft prompt location shows that they form tight semantic clusters
  • i.e., the words closest to the soft prompt tokens have similar meanings
<< Previous Page