InstructGPT: fine-tuned GPT-3
Large language models (LLMs) can frequently produce nonsensical, toxic, or made-up text that can easily fool typical users. These unintended behaviours stem from the inherent shortcomings of the language modelling objective: predicting the next token given a sequence of tokens. This objective is different from: following the user’s instructions helpfully and safely. To ensure LLMs are useful, they must produce helpful, honest, truthful, and harmless output in response to users’ instructions.
A useful LLM should meet the following criteria:
Assist users in completing their tasks effectively.
Avoid fabricating information or misleading users.
Prevent causing physical or emotional harm to users or society.
This article provides a brief introduction to the methodology used to train the InstructGPT (1.3B) model, developed by the OpenAI team. InstructGPT was one of the first successful models, more capable than its predecessor, which was 100x larger. InstructGPT was better at following the human instructions more appropriately.
Methodology
OpenAI used the Reinforcement learning with Human Feedback (RLHF) method to fine-tune their GPT-3 model to follow user instructions. Human annotators labelled their data and created a dataset of the instructions and desired outputs. The dataset was used to train a baseline model and a reward model. The reward model was trained to score the instruction output according to human preferences.
This reward model was used to fine-tune the baseline model to follow instructions as evaluated by human evaluators. The Proximal Policy Optimisation (PPO) algorithm was used to maximise the reward. The process can be summarised as follows:
Collect demonstration data and train a supervised policy: The desired dataset is collected and annotated by human labellers. This dataset is then used to fine-tune a pre-trained GPT-3 model using supervised learning.
Collect comparison data and train a reward model: The fine-tuned model generates instruction-output pairs, which are annotated by human labellers for instruction alignment. Human annotators rank the preferred outputs for each input. This dataset is then used to train a reward model (RM) to predict human-preferred outputs.
Optimise a policy using PPO: fine-tune the base model using the reward model's outputs.
Repeat steps two and three until the model achieves the desired level of performance.
Improvements
This process improves the instruction-following capability of the resulting fine-tuned model. The following were the findings:
Outputs from InstructGPT (1.3B) are preferred over those from GPT-3 (175B). This indicates that, despite having a hundred times as many parameters, GPT-3 outputs were not preferred over InstructGPT.
InstructGPT outperformed GPT-3 by a factor of two on the TruthfulQA dataset.
InstructGPT models generate approximately 25% fewer toxic outputs than GPT-3 when prompted to be respectful.
