Instruction tuning LLMs
Large language models (LLMs), which are artificial intelligence (AI) systems trained to process and generate human language, are trained on vast amounts of data from the internet. They learn about language and other complex linguistic concepts by optimizing the model using the next-word-prediction objective, in which the model is initially trained to predict the next word in a sequence of tokens.
Building upon this foundation, instruction tuning is a method that improves LLM usability by fine-tuning a pre-trained LLM on a dataset of instruction-response pairs. This approach makes pre-trained LLM models more efficient and better aligned with human-centric queries and outputs.
Advantages of Instruction Tuning
Better human preference alignment and model efficiency
Instruction tuning offers a more cost-efficient alternative to full model retraining.
Challenges of Instruction Tuning
Creating high-quality instruction-output pairs is labor-intensive and requires significant expert involvement across both specific and general domains.
This approach is limited to models and applications that support supervised fine-tuning.
In academia, some argue that instruction tuning may only capture surface-level patterns.
Methodology
The instruction tuning methodology consists of two primary steps:
Step 1: Instruction dataset creation. This step compiles entries that each contain a natural-language instruction specifying a task. An optional input may provide additional information, and the dataset includes the desired output for each task. Datasets can originate from human experts or from large language models (LLMs) capable of generating or understanding text.
Step 2: Supervised fine-tuning. In this step, the constructed dataset is used to further train a pre-trained model. The model learns to interpret instructions (and optional inputs) to predict the associated outputs, following a stepwise process.
Synthetic Data via Distillation
Synthetic data generation is one of the fastest and most cost-efficient methods for creating a dataset of instruction-output pairs. This is important because creating data by human annotators or filtering data from the internet places a significant burden on time and human effort.
The distillation process involves imparting knowledge from a highly capable teacher model to a simpler one. The smaller model is more efficient to run on consumer-grade systems.
InstructGPT
One of the earliest models to use Instruction fine-tuning (and various other reinforcement learning algorithms) was InstructGPT(176B), developed by OpenAI. It was initialized with GPT-3. The process of fine-tuning this model was:
Supervised fine-tuning on a human-filtered instruction dataset
Training a reward model to predict human preference based on an already curated dataset, which ranks each solution from best to worst.
Use the proximal policy optimization (PPO) technique—an algorithm for training reinforcement learning models—to further train the model from step 1 and the reward model. Repeat the entire process until the desired output quality is achieved.
After completing the fine-tuning process, InstructGPT outperformed GPT-3 by 10% on TruthfulQA in terms of truthfulness and by 7% on RealToxicityPrompts in terms of toxicity.
