RLHF (reinforcement learning from human feedback)
A post-training technique to align AI models with human preferences using feedback.
Detailed Definition
RLHF is a post-training technique that goes beyond next-token prediction and fine-tuning by teaching AI models to behave the way humans want them to—making them safer, more helpful, and aligned with our intentions. This process works in two stages: First, human evaluators compare pairs of outputs and choose which is better, training a "reward model" that learns to predict human preferences. Then, the AI model learns through reinforcement learning by trying to generate responses that the reward model predicts humans would prefer.
Learning MethodsMore in this Category
Few-Shot Learning
The ability of AI models to learn new tasks with only a small number of training examples.
Fine-tuning
A post-training technique to specialize a trained model on specific data for a particular task.
Post-training
Additional steps taken after a model's initial training is complete to make it more useful.
Training/Pre-training
The process by which an AI model learns by analyzing massive amounts of data.
Supervised learning
A type of machine learning where a model is trained on labeled data with correct answers provided.
Unsupervised learning
A type of machine learning where a model is given unlabeled data to find patterns on its own.