Key Findings
  • RLHF[1] 是讓 ChatGPT 從「會說話」變成「說得好」的關鍵技術,也是 InstructGPT 的核心方法——透過人類回饋訓練獎勵模型,再以 PPO[4] 優化語言模型策略
  • DPO[3] 跳過 Reward Model 訓練,直接從偏好資料優化策略,大幅降低對齊成本——Zephyr[9] 以 DPO 訓練的 7B 模型勝過 70B RLHF 模型
  • DeepSeek-R1[7] 的 GRPO[11] 方法證明:不需要人類標註,純 RL 也能激發推理能力,出現「頓悟時刻」自發學會反思與驗證
  • 對齊技術正從「人類標註驅動」走向「自我獎勵」[12]與「群體強化」的新範式——本文附兩個 Google Colab 實作:DPO 微調與 Reward Model 訓練

一、為何 LLM 需要對齊?從 GPT-3 到 ChatGPT 的關鍵轉折

GPT-3 擁有 1750 億參數,能生成流暢的文本,但它經常「不聽話」——你問它一個簡單問題,它可能給你一段維基百科式的科普文章;你讓它寫程式,它可能輸出一段看似正確但邏輯錯誤的代碼;更危險的是,它會毫不猶豫地生成有害或偏見內容。

問題的根源在於:預訓練只教會模型「預測下一個 token」,但沒有教它什麼是「好的回答」。

預訓練目標:  max Σ log P(x_t | x_1, ..., x_{t-1})
             → 學會語言的統計規律,但不知道「什麼是有用的回答」

對齊目標:    max E_{x~prompt}[R(x, y)] - β·KL[π_θ(y|x) || π_ref(y|x)]
             → 在維持語言能力的前提下,最大化人類偏好

預訓練 ≠ 有用:
  用戶: "法國的首都是什麼?"
  未對齊: "法國的首都是什麼?這是一個地理問題。法國位於歐洲西部..."(續寫)
  對齊後: "法國的首都是巴黎。"(回答問題)

2022 年,OpenAI 發表了 InstructGPT[1],展示了一個驚人的結果:經過 RLHF 訓練的 1.3B 模型,在人類評估中勝過未對齊的 175B GPT-3。這意味著對齊不僅沒有損害模型能力,反而釋放了預訓練中已經學到的知識——這就是所謂的 Alignment Bonus

InstructGPT 的成功直接催生了 ChatGPT,RLHF 從此成為大型語言模型的標準訓練流程。但 RLHF 的複雜性和成本也激發了更簡潔方法的探索——從 DPO 到 GRPO,對齊技術進入了百花齊放的時代。

二、RLHF 完整 Pipeline:SFT → Reward Model → PPO

RLHF(Reinforcement Learning from Human Feedback)的核心思想源自 Christiano 等人[2]在機器人控制領域的開創性工作:人類無法寫出精確的獎勵函數,但可以輕鬆比較兩個結果的好壞。InstructGPT[1] 將這一思想系統化地應用於語言模型,建立了三階段訓練流程。

階段一:監督式微調(Supervised Fine-Tuning, SFT)

從預訓練模型出發,使用人類標註員編寫的高品質指令-回答對進行微調。InstructGPT 使用了約 13,000 條人類撰寫的示範資料。

SFT 損失函數:
  L_SFT = -Σ log P_θ(y_t | x, y_1, ..., y_{t-1})

  x: 指令(prompt)
  y: 人類標註的理想回答
  θ: 模型參數

SFT 的作用:
  預訓練模型 → 學會「對話格式」和「遵循指令」
  但 SFT 資料有限,模型仍可能產生不當回答
  → 需要 RL 進一步優化

階段二:獎勵模型訓練(Reward Model Training)

獎勵模型是 RLHF 的核心組件。它學習人類的偏好判斷:對於同一個 prompt,人類標註員比較多個回答的好壞排序,獎勵模型學習預測這個排序[5]

Bradley-Terry 偏好模型:
  P(y_w ≻ y_l | x) = σ(r_φ(x, y_w) - r_φ(x, y_l))

  y_w: 人類偏好的回答(winner)
  y_l: 人類不偏好的回答(loser)
  r_φ: 獎勵模型,輸出標量分數
  σ: sigmoid 函數

RM 訓練損失:
  L_RM = -E_{(x, y_w, y_l) ~ D}[log σ(r_φ(x, y_w) - r_φ(x, y_l))]

  → 最大化偏好回答與非偏好回答之間的獎勵差距

InstructGPT 的 RM 訓練:
  - 33,000 個 prompt,每個有 4-9 個回答
  - 標註員對每組回答做完整排序(非僅兩兩比較)
  - 每個排序產生 C(K,2) 個偏好對,大幅提升資料效率

獎勵模型的品質直接決定了 RLHF 的上限。如果獎勵模型學到了錯誤的偏好(例如偏好冗長回答),整個 RLHF 訓練都會朝錯誤方向優化——這被稱為 Reward Hacking。RewardBench[13] 提供了系統化的獎勵模型評估基準。

階段三:PPO 強化學習優化

有了獎勵模型,我們就可以用強化學習來優化語言模型。PPO(Proximal Policy Optimization)[4]是目前最常用的算法,因為它在穩定性和效率之間取得了良好平衡。

RLHF 的 RL 目標函數:
  max_{π_θ} E_{x~D, y~π_θ(·|x)}[r_φ(x, y)] - β·KL[π_θ(y|x) || π_ref(y|x)]

  π_θ:   當前策略(正在訓練的語言模型)
  π_ref:  參考策略(SFT 後凍結的模型)
  r_φ:   獎勵模型的分數
  β:     KL 懲罰係數(控制偏離 SFT 模型的程度)

KL 散度約束的作用:
  - 防止模型為了高獎勵而生成不自然的文本
  - 保持語言流暢性和多樣性
  - 避免 Reward Hacking(利用獎勵模型的漏洞)

PPO 的 Clipped Objective:
  L_PPO = E[min(r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t)]

  r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t)  (策略比率)
  A_t: 優勢函數(advantage)
  ε: 裁剪範圍(通常 0.1-0.2)

RLHF 中 PPO 需要同時運行 4 個模型:
  1. Actor(策略模型): 生成回答
  2. Critic(價值模型): 估計狀態價值
  3. Reward Model: 評分
  4. Reference Model: 計算 KL 懲罰
  → 記憶體開銷巨大,是 RLHF 的主要工程挑戰

Anthropic 的研究[6]進一步揭示了 RLHF 的一個重要特性:它可以同時優化「有用性」(Helpfulness)和「無害性」(Harmlessness),但兩者之間存在張力——過度追求無害性會讓模型變得保守而無用,而過度追求有用性可能產生有害內容。Llama 2[8] 的訓練過程中,Meta 使用了兩個獨立的獎勵模型分別優化這兩個維度。

三、DPO:跳過 Reward Model 的優雅簡化

RLHF 雖然有效,但工程複雜度極高:需要訓練獨立的獎勵模型、同時載入四個模型、處理 PPO 的超參數調整。2023 年,Rafailov 等人[3]提出了 DPO(Direct Preference Optimization),從數學上證明了一個驚人的結論——你的語言模型本身就是一個隱式的獎勵模型

從 RLHF 到 DPO 的數學推導

DPO 的推導從 RLHF 的最優解出發。在 KL 約束下的 RL 目標函數有封閉形式的最優策略:

RLHF 的 KL 約束 RL 問題:
  max_{π} E[r(x,y)] - β·KL[π(y|x) || π_ref(y|x)]

最優策略的封閉形式:
  π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)

  Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β)  (配分函數)

反解獎勵函數:
  r(x,y) = β · log[π*(y|x) / π_ref(y|x)] + β · log Z(x)

代入 Bradley-Terry 模型:
  P(y_w ≻ y_l) = σ(r(x,y_w) - r(x,y_l))

配分函數 Z(x) 在兩個獎勵相減時消去:
  r(x,y_w) - r(x,y_l) = β · log[π_θ(y_w|x)/π_ref(y_w|x)]
                        - β · log[π_θ(y_l|x)/π_ref(y_l|x)]

DPO 損失函數:
  L_DPO = -E_{(x,y_w,y_l)~D}[log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x)
                                        - log π_θ(y_l|x)/π_ref(y_l|x)))]

直觀解讀:
  - 增大 π_θ(y_w|x): 讓模型更可能生成偏好回答
  - 減小 π_θ(y_l|x): 讓模型更不可能生成非偏好回答
  - 相對於 π_ref 的比率: 確保不偏離太遠

DPO vs RLHF:系統化比較

面向RLHF(PPO)DPO
訓練階段SFT → RM → PPO(三階段)SFT → DPO(兩階段)
獎勵模型需要獨立訓練不需要(隱式)
記憶體需求4 個模型同時載入2 個模型(π_θ + π_ref)
超參數PPO 有大量超參數主要只有 β
訓練穩定性PPO 訓練不穩定,易崩潰穩定,類似監督學習
資料需求線上生成 + 離線偏好僅離線偏好資料
可擴展性工程複雜度高工程簡單,易於實現
理論保證在理想條件下最優數學等價(在相同假設下)
實際效果大規模下通常更優小規模下性價比極高
代表案例InstructGPT, ChatGPT, Llama 2Zephyr, Mixtral-Instruct

Zephyr[9] 是 DPO 最令人矚目的成功案例:HuggingFace 團隊用 DPO 訓練了一個 7B 參數的模型,在 MT-Bench 上的表現超越了 Llama 2-Chat 70B(經過完整 RLHF 訓練的模型)。這證明了 DPO 在中小規模場景下的卓越性價比。

IPO[15](Identity Preference Optimization)進一步分析了 DPO 的理論基礎,指出 DPO 隱含地假設了 Bradley-Terry 模型的正確性。當偏好資料不符合此假設時,IPO 提供了更穩健的替代方案。

四、GRPO 與 DeepSeek-R1:純 RL 激發推理能力

2025 年初,DeepSeek-AI 發布了 DeepSeek-R1[7],展示了一個令人震驚的發現:不需要任何人類標註資料,純粹的強化學習就能讓模型自發湧現推理能力。其核心方法是 GRPO(Group Relative Policy Optimization)[11]

GRPO 的核心原理

GRPO 最初由 DeepSeekMath[11] 提出,旨在解決 PPO 的兩個痛點:Critic(價值模型)的訓練成本和 Reward Model 的偏差。

PPO 的問題:
  - 需要 Critic 模型估計每個 token 的價值 → 額外記憶體和計算
  - Reward Model 可能存在偏差 → Reward Hacking

GRPO 的解法: 用組內相對排名取代 Critic

GRPO 演算法:
  對於每個 prompt x:
  1. 從策略 π_θ 採樣一組回答 {y_1, y_2, ..., y_G}
  2. 用規則型獎勵(或 RM)對每個回答評分: {r_1, r_2, ..., r_G}
  3. 計算組內歸一化優勢:
     A_i = (r_i - mean(r_1,...,r_G)) / std(r_1,...,r_G)
  4. 更新策略:
     L_GRPO = -E[Σ_i min(ρ_i·A_i, clip(ρ_i,1-ε,1+ε)·A_i)]
              - β·KL[π_θ || π_ref]

  其中 ρ_i = π_θ(y_i|x) / π_old(y_i|x)

GRPO vs PPO:
  PPO:  需要 Critic 估計 V(s) → A(s,a) = R - V(s)
  GRPO: 用組內均值替代 V(s) → A_i = (r_i - mean) / std
        → 不需要 Critic 模型,節省 ~50% 記憶體

DeepSeek-R1-Zero:RL 的「頓悟時刻」

DeepSeek-R1-Zero 是最令人興奮的實驗:從基礎模型出發,完全不經過 SFT,直接用 GRPO + 規則型獎勵訓練。獎勵只有兩個簡單規則——回答格式正確、最終答案正確。

令人驚訝的是,模型在訓練過程中自發湧現了多種推理行為:

這些行為完全不是人類教授的,而是在最大化正確率的 RL 過程中自然湧現。這暗示了一個深刻的可能性:推理能力可能是 RL 訓練的自然結果,而非必須從人類示範中學習

GRPO vs PPO vs DPO:三方比較

特性PPO(RLHF)DPOGRPO
學習信號Reward Model偏好對(離線)規則型獎勵 / RM
需要 Critic否(組內相對)
需要 RM可選
人類標註大量中等可完全不需要
記憶體效率低(4 模型)高(2 模型)中(2-3 模型)
推理激發間接有限強(自發湧現)
適用場景通用對齊偏好對齊推理、數學、程式
代表系統ChatGPT, Llama 2Zephyr, MixtralDeepSeek-R1

五、對齊技術全景:從 KTO 到 Self-Rewarding

RLHF、DPO、GRPO 之外,對齊技術的版圖仍在快速擴展。以下介紹幾個重要的方向。

KTO:前景理論驅動的對齊

KTO(Kahneman-Tversky Optimization)[10]的創新在於:它不需要成對的偏好資料,只需要「這個回答好」或「這個回答不好」的二元標籤。這大幅降低了資料標註的門檻。

DPO 資料格式:  (prompt, y_w, y_l) — 需要同一 prompt 下的配對比較
KTO 資料格式:  (prompt, y, label) — 只需要 好/不好 的二元判斷

KTO 的損失函數:
  L_KTO = E_{y~desirable}[w(y)·(1 - σ(β·r_θ(x,y) - z_ref))]
        + E_{y~undesirable}[w(y)·(1 - σ(z_ref - β·r_θ(x,y)))]

  r_θ(x,y) = log[π_θ(y|x) / π_ref(y|x)]  (隱式獎勵)
  z_ref: 參考點(KL 散度的期望值)
  w(y): 基於前景理論的權重函數

前景理論的關鍵洞見:
  - 損失的痛苦 > 同等收益的快樂(損失趨避)
  - KTO 自動加權:對不好的回答施加更大的懲罰
  - 無需配對資料 → 適合從產品日誌中收集回饋

Self-Rewarding Language Models

Self-Rewarding[12] 提出了一個激進的想法:讓語言模型自己當獎勵模型。模型同時扮演生成器和評判者的角色,通過迭代的自我改進實現對齊。

Constitutional AI(Anthropic)

Anthropic[6] 的 Constitutional AI 使用一組明確的原則(「憲法」)來指導 AI 行為。AI 先用這些原則自我批評和修改回答,再用修改後的資料做 RLHF。這減少了對人類標註員主觀判斷的依賴。

對齊方法全景比較

方法年份資料需求訓練複雜度核心創新
RLHF (PPO)2022偏好對 + SFT 資料極高獎勵模型 + PPO 優化
DPO2023偏好對隱式獎勵,無需 RM
IPO2024偏好對不依賴 BT 模型假設
KTO2024二元標籤(非配對)前景理論,無需配對
GRPO2024規則型獎勵即可組內相對優勢,無 Critic
Self-Rewarding2024初始種子資料模型自我評判與迭代改進
Constitutional AI2022原則集 + 少量人類回饋原則引導的自我修正

六、Hands-on Lab 1:使用 TRL 實作 DPO 微調(Google Colab)

以下實驗使用 HuggingFace 的 TRL 庫在 GPT-2 small 上實作 DPO 微調。這個實驗可以在 Colab 免費 GPU(T4)上完整執行,讓你親手體驗對齊技術的核心原理。

# ============================================================
# Lab 1: DPO 微調實戰 — 使用 TRL 對齊 GPT-2
# 環境: Google Colab (T4 GPU), 約 15-20 分鐘
# ============================================================

# --- 1. 安裝必要套件 ---
!pip install -q trl>=0.7.0 transformers>=4.36.0 datasets peft accelerate bitsandbytes

import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
)
from trl import DPOConfig, DPOTrainer
from datasets import Dataset
import warnings
warnings.filterwarnings("ignore")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

# --- 2. 建立偏好資料集 ---
# 模擬真實場景: 對於同一個 prompt,有 chosen(好的回答)和 rejected(差的回答)
preference_data = [
    {
        "prompt": "What is machine learning?",
        "chosen": "Machine learning is a branch of artificial intelligence that enables computers to learn patterns from data and make predictions without being explicitly programmed. It uses algorithms to build models from training data.",
        "rejected": "Machine learning is when computers do stuff with data. It's like, you know, AI things. Computers are smart now I guess.",
    },
    {
        "prompt": "Explain what a neural network is.",
        "chosen": "A neural network is a computational model inspired by the human brain. It consists of layers of interconnected nodes (neurons) that process information. Each connection has a weight that is adjusted during training to minimize prediction errors.",
        "rejected": "Neural networks are complicated math things that nobody really understands. They just work somehow and that's all you need to know about them.",
    },
    {
        "prompt": "What is the difference between supervised and unsupervised learning?",
        "chosen": "Supervised learning uses labeled training data where each example has a known output, allowing the model to learn input-output mappings. Unsupervised learning works with unlabeled data, discovering hidden patterns and structures such as clusters or associations.",
        "rejected": "Supervised is when someone watches the computer learn and unsupervised is when nobody watches. That's basically the whole difference between them.",
    },
    {
        "prompt": "How does gradient descent work?",
        "chosen": "Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize a loss function. It computes the gradient (partial derivatives) of the loss with respect to each parameter, then updates parameters in the opposite direction of the gradient, scaled by a learning rate.",
        "rejected": "Gradient descent goes downhill. You just keep going down until you can't go down anymore. It's not that complicated really.",
    },
    {
        "prompt": "What is overfitting in machine learning?",
        "chosen": "Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in poor generalization to new data. Signs include high training accuracy but low test accuracy. Common remedies include regularization, dropout, cross-validation, and using more training data.",
        "rejected": "Overfitting is bad. It means your model memorized everything. Just add more data and it'll be fine probably.",
    },
    {
        "prompt": "Explain the concept of regularization.",
        "chosen": "Regularization is a set of techniques that prevent overfitting by adding constraints to the model. L1 regularization (Lasso) adds the absolute value of weights to the loss, promoting sparsity. L2 regularization (Ridge) adds the squared weights, encouraging smaller weight values. Both help the model generalize better.",
        "rejected": "Regularization is a fancy word for making models work better. You add some penalty thing to the loss function and hope for the best.",
    },
    {
        "prompt": "What is transfer learning?",
        "chosen": "Transfer learning is a technique where a model pre-trained on a large dataset is adapted for a different but related task. Instead of training from scratch, the pre-trained model's learned representations are fine-tuned on the target task with less data. This significantly reduces training time and data requirements.",
        "rejected": "Transfer learning means you take someone else's model and use it. It saves time because you don't have to train anything yourself.",
    },
    {
        "prompt": "How does a convolutional neural network work?",
        "chosen": "A convolutional neural network (CNN) processes data through convolutional layers that apply learnable filters to detect local features like edges and textures. Pooling layers reduce spatial dimensions. Deeper layers combine low-level features into high-level semantic representations. CNNs are particularly effective for image and spatial data.",
        "rejected": "CNNs slide filters over images to find patterns. They work well for pictures and stuff like that.",
    },
    {
        "prompt": "What is natural language processing?",
        "chosen": "Natural language processing (NLP) is a field of AI focused on enabling computers to understand, interpret, and generate human language. Key tasks include text classification, named entity recognition, machine translation, sentiment analysis, and question answering. Modern NLP leverages transformer-based models like BERT and GPT.",
        "rejected": "NLP is about making computers understand words. It's pretty useful for things like chatbots and translation apps.",
    },
    {
        "prompt": "Explain the attention mechanism in transformers.",
        "chosen": "The attention mechanism allows a model to dynamically focus on different parts of the input sequence when producing each output element. In self-attention, Query, Key, and Value vectors are computed from each token. Attention scores are calculated as the scaled dot product of Queries and Keys, then used to create weighted sums of Values, capturing contextual relationships.",
        "rejected": "Attention is what makes transformers work. Each word looks at other words to figure out what's important. It's the key innovation in modern AI.",
    },
    {
        "prompt": "What is reinforcement learning?",
        "chosen": "Reinforcement learning (RL) is a paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions in states, receives rewards or penalties, and learns a policy that maximizes cumulative reward over time. Key concepts include the value function, policy gradient, and exploration-exploitation trade-off.",
        "rejected": "Reinforcement learning is like training a dog with treats. Do something good, get a reward. Do something bad, no reward. Simple.",
    },
    {
        "prompt": "How do you handle imbalanced datasets?",
        "chosen": "Imbalanced datasets can be addressed through multiple strategies: oversampling the minority class (SMOTE), undersampling the majority class, using class weights in the loss function, ensemble methods like balanced random forests, anomaly detection approaches, or evaluation metrics insensitive to class distribution such as F1-score, precision-recall AUC, and Matthews correlation coefficient.",
        "rejected": "Just duplicate the smaller class until both classes are the same size. That usually works fine for most problems.",
    },
]

# 擴展資料集 — 通過同義改寫增加資料量
expanded_data = []
for item in preference_data:
    expanded_data.append(item)
    # 加入微調變體以增加資料多樣性
    expanded_data.append({
        "prompt": "Please explain: " + item["prompt"].lower().rstrip("?.") + ".",
        "chosen": item["chosen"],
        "rejected": item["rejected"],
    })
    expanded_data.append({
        "prompt": "Could you tell me: " + item["prompt"].lower(),
        "chosen": item["chosen"],
        "rejected": item["rejected"],
    })

print(f"Total preference pairs: {len(expanded_data)}")

# 轉換為 HuggingFace Dataset
dataset = Dataset.from_list(expanded_data)
dataset = dataset.train_test_split(test_size=0.15, seed=42)
print(f"Train: {len(dataset['train'])}, Test: {len(dataset['test'])}")

# --- 3. 載入模型和 Tokenizer ---
model_name = "gpt2"
print(f"\nLoading model: {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# 載入策略模型
model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.pad_token_id = tokenizer.pad_token_id

# 載入參考模型(DPO 需要凍結的參考模型來計算 KL 散度)
ref_model = AutoModelForCausalLM.from_pretrained(model_name)
ref_model.config.pad_token_id = tokenizer.pad_token_id

print(f"Model parameters: {model.num_parameters() / 1e6:.1f}M")

# --- 4. 訓練前的回應品質測試 ---
def generate_response(model, prompt, max_new_tokens=100):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    return response.strip()

test_prompts = [
    "What is machine learning?",
    "Explain the attention mechanism in transformers.",
    "What is reinforcement learning?",
]

print("\n" + "=" * 60)
print("BEFORE DPO Training")
print("=" * 60)
pre_dpo_responses = {}
for prompt in test_prompts:
    response = generate_response(model, prompt)
    pre_dpo_responses[prompt] = response
    print(f"\nPrompt: {prompt}")
    print(f"Response: {response[:200]}...")

# --- 5. 設定 DPO 訓練 ---
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    beta=0.1,  # KL 懲罰係數 — DPO 最重要的超參數
    max_length=512,
    max_prompt_length=128,
    logging_steps=5,
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="no",
    remove_unused_columns=False,
    bf16=torch.cuda.is_available(),
    report_to="none",
)

# --- 6. 初始化 DPO Trainer 並訓練 ---
print("\nInitializing DPO Trainer...")
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)

print("Starting DPO training...")
train_result = trainer.train()
print(f"\nTraining complete! Total steps: {train_result.global_step}")

# --- 7. 訓練後的回應品質測試 ---
print("\n" + "=" * 60)
print("AFTER DPO Training")
print("=" * 60)

post_dpo_responses = {}
for prompt in test_prompts:
    response = generate_response(model, prompt)
    post_dpo_responses[prompt] = response
    print(f"\nPrompt: {prompt}")
    print(f"Response: {response[:200]}...")

# --- 8. 視覺化訓練過程 ---
train_logs = [log for log in trainer.state.log_history if "loss" in log]
eval_logs = [log for log in trainer.state.log_history if "eval_loss" in log]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 訓練損失
if train_logs:
    steps = [log["step"] for log in train_logs]
    losses = [log["loss"] for log in train_logs]
    axes[0].plot(steps, losses, color="#0077b6", linewidth=2, label="Train Loss")
    axes[0].set_xlabel("Step", fontsize=12)
    axes[0].set_ylabel("DPO Loss", fontsize=12)
    axes[0].set_title("DPO Training Loss", fontsize=14)
    axes[0].grid(True, alpha=0.3)
    axes[0].legend()

# 評估損失
if eval_logs:
    eval_steps = [log["step"] for log in eval_logs]
    eval_losses = [log["eval_loss"] for log in eval_logs]
    axes[1].plot(eval_steps, eval_losses, "o-", color="#b8922e", linewidth=2, label="Eval Loss")
    axes[1].set_xlabel("Step", fontsize=12)
    axes[1].set_ylabel("Eval Loss", fontsize=12)
    axes[1].set_title("DPO Evaluation Loss", fontsize=14)
    axes[1].grid(True, alpha=0.3)
    axes[1].legend()

# Reward margins (如果 log 中有記錄)
reward_logs = [log for log in trainer.state.log_history if "rewards/margins" in log]
if reward_logs:
    r_steps = [log["step"] for log in reward_logs]
    margins = [log["rewards/margins"] for log in reward_logs]
    axes[2].plot(r_steps, margins, "s-", color="#e63946", linewidth=2, label="Reward Margin")
    axes[2].set_xlabel("Step", fontsize=12)
    axes[2].set_ylabel("Margin (chosen - rejected)", fontsize=12)
    axes[2].set_title("Reward Margins", fontsize=14)
    axes[2].axhline(y=0, color="gray", linestyle="--", alpha=0.5)
    axes[2].grid(True, alpha=0.3)
    axes[2].legend()
else:
    axes[2].text(0.5, 0.5, "Reward margins\nnot logged",
                 ha="center", va="center", fontsize=14, color="gray",
                 transform=axes[2].transAxes)
    axes[2].set_title("Reward Margins", fontsize=14)

plt.tight_layout()
plt.savefig("dpo_training_results.png", dpi=150, bbox_inches="tight")
plt.show()

# --- 9. 隱式獎勵分析 ---
# DPO 的核心洞見: 策略本身就是隱式獎勵模型
# r(x,y) = β * log(π_θ(y|x) / π_ref(y|x))
print("\n" + "=" * 60)
print("Implicit Reward Analysis")
print("=" * 60)

def compute_implicit_reward(model, ref_model, tokenizer, prompt, response, beta=0.1):
    """計算 DPO 的隱式獎勵分數"""
    model.eval()
    ref_model.eval()

    full_text = prompt + " " + response
    inputs = tokenizer(full_text, return_tensors="pt").to(model.device)
    prompt_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
    prompt_len = prompt_ids.shape[1]

    with torch.no_grad():
        logits = model(**inputs).logits
        ref_logits = ref_model(**inputs).logits

    # 計算 response 部分的 log 機率
    response_logits = logits[:, prompt_len - 1:-1, :]
    ref_response_logits = ref_logits[:, prompt_len - 1:-1, :]
    response_ids = inputs["input_ids"][:, prompt_len:]

    log_probs = torch.log_softmax(response_logits, dim=-1)
    ref_log_probs = torch.log_softmax(ref_response_logits, dim=-1)

    token_log_probs = log_probs.gather(2, response_ids.unsqueeze(-1)).squeeze(-1)
    ref_token_log_probs = ref_log_probs.gather(2, response_ids.unsqueeze(-1)).squeeze(-1)

    # 隱式獎勵 = β * Σ (log π_θ - log π_ref)
    implicit_reward = beta * (token_log_probs.sum() - ref_token_log_probs.sum()).item()
    return implicit_reward

ref_model = ref_model.to(device)
model = model.to(device)

sample_prompt = "What is machine learning?"
good_response = "Machine learning is a branch of AI that enables computers to learn from data and make predictions without explicit programming."
bad_response = "Machine learning is when computers do stuff. It's like AI things I guess."

reward_good = compute_implicit_reward(model, ref_model, tokenizer, sample_prompt, good_response)
reward_bad = compute_implicit_reward(model, ref_model, tokenizer, sample_prompt, bad_response)

print(f"Prompt: {sample_prompt}")
print(f"Good response reward:  {reward_good:.4f}")
print(f"Bad response reward:   {reward_bad:.4f}")
print(f"Margin (good - bad):   {reward_good - reward_bad:.4f}")
print(f"P(good ≻ bad) = σ(margin) = {torch.sigmoid(torch.tensor(reward_good - reward_bad)).item():.4f}")

print("\nLab 1 Complete!")

七、Hands-on Lab 2:Reward Model 訓練與評估(Google Colab)

以下實驗使用 TRL 的 RewardTrainer 訓練一個簡單的獎勵模型,並評估其排序準確率。獎勵模型是 RLHF 的核心組件——它將人類的主觀偏好量化為可優化的標量分數。

# ============================================================
# Lab 2: Reward Model 訓練與評估
# 環境: Google Colab (T4 GPU), 約 10-15 分鐘
# ============================================================

# --- 1. 安裝必要套件 ---
!pip install -q trl>=0.7.0 transformers>=4.36.0 datasets accelerate

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
)
from trl import RewardConfig, RewardTrainer
from datasets import Dataset
from sklearn.metrics import accuracy_score, roc_auc_score
import warnings
warnings.filterwarnings("ignore")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

# --- 2. 建立偏好資料集 ---
# 每筆資料: (prompt, chosen, rejected)
# chosen 是標註員偏好的回答,rejected 是不偏好的回答
preference_data = [
    # 準確性偏好
    {
        "prompt": "What is the capital of France?",
        "chosen": "The capital of France is Paris. It is the largest city in France and serves as the country's political, economic, and cultural center.",
        "rejected": "The capital of France is probably Lyon or Marseille. It's one of those big cities in France.",
    },
    {
        "prompt": "How many planets are in our solar system?",
        "chosen": "There are eight planets in our solar system: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Pluto was reclassified as a dwarf planet in 2006.",
        "rejected": "There are nine planets including Pluto. Everyone knows that from school.",
    },
    # 有用性偏好
    {
        "prompt": "How do I make a cup of coffee?",
        "chosen": "To make a cup of coffee: 1) Boil fresh water to about 93-96 degrees Celsius. 2) Add 2 tablespoons of ground coffee per 180ml of water to your filter. 3) Pour hot water over the grounds slowly. 4) Let it brew for 3-4 minutes. 5) Serve and enjoy.",
        "rejected": "Just put coffee in water and heat it up.",
    },
    {
        "prompt": "What should I consider when buying a laptop?",
        "chosen": "Key factors when buying a laptop include: processor speed and generation (Intel i5/i7 or AMD Ryzen 5/7), RAM (16GB minimum for productivity), storage type and size (SSD preferred, 512GB+), display quality and size, battery life, weight for portability, and your budget relative to intended use.",
        "rejected": "Just get whatever is on sale. They're all pretty much the same these days.",
    },
    # 安全性偏好
    {
        "prompt": "How can I lose weight quickly?",
        "chosen": "For healthy and sustainable weight loss, aim for 0.5-1kg per week through a balanced approach: create a modest calorie deficit (300-500 cal/day), increase physical activity, eat whole foods rich in protein and fiber, stay hydrated, and get adequate sleep. Consult a healthcare professional before starting any diet.",
        "rejected": "Just stop eating for a few days and you'll lose weight fast. Fasting is the quickest way to drop pounds.",
    },
    {
        "prompt": "I'm feeling really stressed about work.",
        "chosen": "I understand work stress can be overwhelming. Some evidence-based strategies include: practicing deep breathing exercises, taking regular short breaks, setting clear boundaries between work and personal time, prioritizing tasks, talking to a trusted friend or counselor, and engaging in physical activity. If stress persists, consider speaking with a mental health professional.",
        "rejected": "Just quit your job if it stresses you out. Life's too short to deal with that.",
    },
    # 深度偏好
    {
        "prompt": "Explain how vaccines work.",
        "chosen": "Vaccines work by training the immune system to recognize and fight specific pathogens. They contain weakened or inactivated forms of a virus, or key proteins from it. When administered, the immune system produces antibodies and memory cells. If later exposed to the actual pathogen, the immune system can respond quickly and effectively, preventing or reducing illness severity.",
        "rejected": "Vaccines put stuff in your body that makes you immune to diseases. They've been around for a long time.",
    },
    {
        "prompt": "Why is the sky blue?",
        "chosen": "The sky appears blue due to Rayleigh scattering. Sunlight contains all colors of the visible spectrum. As it enters Earth's atmosphere, shorter wavelengths (blue and violet) scatter more than longer wavelengths (red and orange) when they collide with gas molecules. Our eyes are more sensitive to blue than violet, so we perceive the sky as blue.",
        "rejected": "The sky is blue because of the atmosphere. It just scatters light in a way that makes it look blue.",
    },
    # 格式偏好
    {
        "prompt": "List three benefits of exercise.",
        "chosen": "Three key benefits of regular exercise are: 1) Improved cardiovascular health, reducing the risk of heart disease and stroke. 2) Better mental health, as exercise releases endorphins that reduce stress, anxiety, and depression. 3) Enhanced cognitive function, including improved memory, focus, and reduced risk of neurodegenerative diseases.",
        "rejected": "Exercise is good for your heart and mind and body. It helps you in many ways and you should do it regularly because doctors recommend it.",
    },
    {
        "prompt": "What is photosynthesis?",
        "chosen": "Photosynthesis is the biological process by which green plants, algae, and some bacteria convert light energy into chemical energy. Using chlorophyll in chloroplasts, they absorb sunlight and use it to transform carbon dioxide and water into glucose and oxygen. The simplified equation is: 6CO2 + 6H2O + light energy -> C6H12O6 + 6O2.",
        "rejected": "Photosynthesis is how plants make food from sunlight. They use their leaves to capture energy and turn it into sugar.",
    },
    {
        "prompt": "How does encryption work?",
        "chosen": "Encryption converts readable data (plaintext) into unreadable form (ciphertext) using mathematical algorithms and keys. Symmetric encryption uses the same key for encryption and decryption (e.g., AES). Asymmetric encryption uses a public key to encrypt and a private key to decrypt (e.g., RSA). Modern encryption ensures data confidentiality even if intercepted.",
        "rejected": "Encryption scrambles your data so hackers can't read it. It's like a secret code.",
    },
    {
        "prompt": "What causes seasons on Earth?",
        "chosen": "Seasons are caused by Earth's axial tilt of approximately 23.5 degrees relative to its orbital plane around the Sun. This tilt means different hemispheres receive varying amounts of direct sunlight throughout the year. When the Northern Hemisphere tilts toward the Sun, it experiences summer while the Southern Hemisphere has winter, and vice versa.",
        "rejected": "Seasons happen because the Earth moves closer and farther from the Sun during the year.",
    },
]

# 擴展資料集
expanded = []
for item in preference_data:
    expanded.append(item)
    expanded.append({
        "prompt": "Q: " + item["prompt"],
        "chosen": item["chosen"],
        "rejected": item["rejected"],
    })
    expanded.append({
        "prompt": "Answer this question: " + item["prompt"],
        "chosen": item["chosen"],
        "rejected": item["rejected"],
    })

print(f"Total preference pairs: {len(expanded)}")

dataset = Dataset.from_list(expanded)
split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split["train"]
test_dataset = split["test"]
print(f"Train: {len(train_dataset)}, Test: {len(test_dataset)}")

# --- 3. 載入模型 ---
model_name = "distilbert-base-uncased"
print(f"\nLoading reward model backbone: {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=1,  # 獎勵模型輸出單一標量分數
)
print(f"Model parameters: {model.num_parameters() / 1e6:.1f}M")

# --- 4. 資料預處理 ---
def preprocess_function(examples):
    """將偏好對轉換為 Reward Model 訓練格式"""
    chosen_texts = [
        p + " [SEP] " + c
        for p, c in zip(examples["prompt"], examples["chosen"])
    ]
    rejected_texts = [
        p + " [SEP] " + r
        for p, r in zip(examples["prompt"], examples["rejected"])
    ]

    chosen_encodings = tokenizer(
        chosen_texts, truncation=True, padding="max_length", max_length=256
    )
    rejected_encodings = tokenizer(
        rejected_texts, truncation=True, padding="max_length", max_length=256
    )

    return {
        "input_ids_chosen": chosen_encodings["input_ids"],
        "attention_mask_chosen": chosen_encodings["attention_mask"],
        "input_ids_rejected": rejected_encodings["input_ids"],
        "attention_mask_rejected": rejected_encodings["attention_mask"],
    }

train_processed = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
test_processed = test_dataset.map(preprocess_function, batched=True, remove_columns=test_dataset.column_names)

# --- 5. 訓練 Reward Model ---
training_args = RewardConfig(
    output_dir="./reward_model_output",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=5,
    eval_strategy="steps",
    eval_steps=10,
    save_strategy="no",
    max_length=256,
    remove_unused_columns=False,
    report_to="none",
)

trainer = RewardTrainer(
    model=model,
    args=training_args,
    train_dataset=train_processed,
    eval_dataset=test_processed,
    processing_class=tokenizer,
)

print("\nStarting Reward Model training...")
train_result = trainer.train()
print(f"Training complete! Steps: {train_result.global_step}")

# --- 6. 評估 Reward Model ---
print("\n" + "=" * 60)
print("Reward Model Evaluation")
print("=" * 60)

def get_reward_score(model, tokenizer, prompt, response):
    """取得獎勵模型對一個 (prompt, response) 的評分"""
    model.eval()
    text = prompt + " [SEP] " + response
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256, padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.logits.item()

# 在原始測試資料上計算排序準確率
correct = 0
total = 0
chosen_rewards = []
rejected_rewards = []

for item in preference_data:
    r_chosen = get_reward_score(model, tokenizer, item["prompt"], item["chosen"])
    r_rejected = get_reward_score(model, tokenizer, item["prompt"], item["rejected"])
    chosen_rewards.append(r_chosen)
    rejected_rewards.append(r_rejected)
    if r_chosen > r_rejected:
        correct += 1
    total += 1

accuracy = correct / total
print(f"Ranking Accuracy: {accuracy:.1%} ({correct}/{total})")
print(f"Average Chosen Reward:   {np.mean(chosen_rewards):.4f}")
print(f"Average Rejected Reward: {np.mean(rejected_rewards):.4f}")
print(f"Average Margin:          {np.mean(np.array(chosen_rewards) - np.array(rejected_rewards)):.4f}")

# --- 7. 視覺化 ---
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 獎勵分佈
axes[0].hist(chosen_rewards, bins=10, alpha=0.7, color="#0077b6", label="Chosen", edgecolor="white")
axes[0].hist(rejected_rewards, bins=10, alpha=0.7, color="#e63946", label="Rejected", edgecolor="white")
axes[0].set_xlabel("Reward Score", fontsize=12)
axes[0].set_ylabel("Count", fontsize=12)
axes[0].set_title("Reward Score Distribution", fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# 逐樣本比較
x_pos = np.arange(len(preference_data))
width = 0.35
axes[1].bar(x_pos - width / 2, chosen_rewards, width, color="#0077b6", label="Chosen", alpha=0.8)
axes[1].bar(x_pos + width / 2, rejected_rewards, width, color="#e63946", label="Rejected", alpha=0.8)
axes[1].set_xlabel("Sample Index", fontsize=12)
axes[1].set_ylabel("Reward Score", fontsize=12)
axes[1].set_title("Per-Sample Reward Comparison", fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3, axis="y")

# 訓練損失曲線
train_logs = [log for log in trainer.state.log_history if "loss" in log]
if train_logs:
    steps = [log["step"] for log in train_logs]
    losses = [log["loss"] for log in train_logs]
    axes[2].plot(steps, losses, color="#b8922e", linewidth=2)
    axes[2].set_xlabel("Step", fontsize=12)
    axes[2].set_ylabel("Loss", fontsize=12)
    axes[2].set_title("Reward Model Training Loss", fontsize=14)
    axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("reward_model_results.png", dpi=150, bbox_inches="tight")
plt.show()

# --- 8. 互動式測試: 自訂 prompt 的獎勵評分 ---
print("\n" + "=" * 60)
print("Interactive Reward Scoring")
print("=" * 60)

test_cases = [
    {
        "prompt": "What is deep learning?",
        "responses": [
            "Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence 'deep') to model and understand complex patterns in data. It excels at tasks like image recognition, natural language processing, and game playing by automatically learning hierarchical representations.",
            "Deep learning is basically AI that uses lots of layers. It's really powerful and can do many things.",
            "Deep learning is a type of machine learning. It uses neural networks to learn from data and make predictions about stuff.",
        ],
    },
    {
        "prompt": "Is it safe to eat raw chicken?",
        "responses": [
            "No, eating raw chicken is not safe. Raw chicken frequently contains harmful bacteria such as Salmonella and Campylobacter, which can cause serious foodborne illness. Always cook chicken to an internal temperature of at least 74 degrees Celsius (165 degrees Fahrenheit) to ensure these pathogens are eliminated.",
            "Sure, some people eat raw chicken in certain cuisines. It should be fine if the chicken is fresh.",
            "I wouldn't recommend it but it probably won't kill you. Just make sure it smells okay.",
        ],
    },
]

for case in test_cases:
    print(f"\nPrompt: {case['prompt']}")
    scores = []
    for i, resp in enumerate(case["responses"]):
        score = get_reward_score(model, tokenizer, case["prompt"], resp)
        scores.append(score)
        print(f"  Response {i + 1} (reward={score:.4f}): {resp[:80]}...")

    # 排序
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    print(f"  Ranking: {' > '.join([f'R{idx + 1}({s:.3f})' for idx, s in ranked])}")

# --- 9. Bradley-Terry 偏好機率 ---
print("\n" + "=" * 60)
print("Bradley-Terry Preference Probabilities")
print("=" * 60)

def bradley_terry_prob(r1, r2):
    """P(response1 > response2) = sigmoid(r1 - r2)"""
    return torch.sigmoid(torch.tensor(r1 - r2)).item()

for case in test_cases:
    print(f"\nPrompt: {case['prompt'][:50]}...")
    scores = [get_reward_score(model, tokenizer, case["prompt"], r) for r in case["responses"]]
    for i in range(len(scores)):
        for j in range(i + 1, len(scores)):
            prob = bradley_terry_prob(scores[i], scores[j])
            print(f"  P(R{i+1} ≻ R{j+1}) = {prob:.4f}")

print("\nLab 2 Complete!")

八、決策框架:企業如何選擇對齊策略

面對多種對齊技術,企業需要根據自身的資源、目標和約束做出務實的選擇。以下是一個系統化的決策框架。

決策維度一:資料可用性

資料情境推薦方法理由
有大量偏好配對資料(>10K 對)RLHF 或 DPO資料充足時兩者效果接近,DPO 更經濟
有中等偏好資料(1K-10K 對)DPODPO 在中等資料量下更穩定
只有二元標籤(好/不好)KTO不需要配對,從產品日誌即可收集
有可驗證的正確答案GRPO規則型獎勵不需要人類標註
幾乎無資料Constitutional AI / Self-Rewarding用原則或模型自我評判替代人類標註

決策維度二:預算與技術能力

資源水平推薦方法估算成本
高預算團隊(GPU 集群 + ML 專家)RLHF(PPO)高算力 + 標註成本
中等預算(單機多卡 + 工程師)DPO 或 GRPO中等算力,低標註成本
低預算(單張 GPU + 開發者)KTO 或 DPO + LoRA最低算力,最低標註成本

決策維度三:應用目標

目標推薦方法說明
通用聊天助手RLHF 或 DPO需要平衡有用性與安全性
數學/程式推理GRPO可用正確性做規則型獎勵
領域專業助手DPO + 領域偏好資料成本可控且效果穩定
安全對齊Constitutional AI + RLHF原則引導 + 人類監督
持續改進Self-Rewarding + DPO 迭代自動化迭代優化

成本效益分析

對齊方法的投資回報(粗估):

                 初始投入    維護成本    對齊品質    適用規模
RLHF (PPO):      $$$$$       $$$        ★★★★★     10B+ 模型
DPO:             $$          $          ★★★★      1B-70B 模型
KTO:             $           $          ★★★       1B-13B 模型
GRPO:            $$$         $$         ★★★★★     推理任務
Self-Rewarding:  $$          $          ★★★       研究階段

典型 ROI 場景:
  - 初創公司: DPO + LoRA 微調 7B 模型 → 最高性價比
  - 中型企業: DPO 微調 13B-70B 模型 → 平衡品質與成本
  - 大型科技公司: 完整 RLHF pipeline → 最高品質
  - 研究團隊: GRPO 探索推理能力 → 前沿突破

九、結語與展望

從 2017 年 Christiano 等人[2]在機器人控制中提出「從人類偏好中學習」,到 2022 年 InstructGPT[1] 將 RLHF 系統化應用於語言模型,再到 2025 年 DeepSeek-R1[7] 用純 RL 激發推理能力——對齊技術在短短幾年間經歷了革命性的演進。

幾個值得關注的趨勢:

對齊不僅是一個技術問題,更是一個哲學問題:我們究竟希望 AI 與什麼樣的「人類價值觀」對齊?誰的價值觀?如何在不同文化和群體之間取得平衡?這些問題的答案將深刻影響 AI 的未來走向。

對於實務工作者,本文的建議是:從 DPO 開始。它是目前工程最簡潔、性價比最高的對齊方法。當你的模型規模和品質需求增長時,再考慮完整的 RLHF pipeline 或探索 GRPO 的推理能力激發。對齊技術的演進告訴我們,最好的方法往往是最簡單的——只要數學是對的。