Logo

RLHF (reinforcement learning from human feedback)

Learning Methods
Letter: R

A post-training technique to align AI models with human preferences using feedback.

Detailed Definition

RLHF is a post-training technique that goes beyond next-token prediction and fine-tuning by teaching AI models to behave the way humans want them to—making them safer, more helpful, and aligned with our intentions. This process works in two stages: First, human evaluators compare pairs of outputs and choose which is better, training a "reward model" that learns to predict human preferences. Then, the AI model learns through reinforcement learning by trying to generate responses that the reward model predicts humans would prefer.