Ch02: Supervised Learning

一句話解釋

從有標註的資料中學習 input → output 的映射，並期望在未見資料上表現良好。

基本框架

核心概念

監督式學習的組成

給定： - 訓練資料 $D = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\}$ - $x_i \in \mathcal{X}$：輸入空間（input space） - $y_i \in \mathcal{Y}$：輸出空間（output space） 目標： 找一個 hypothesis $h: \mathcal{X} \to \mathcal{Y}$，使得： - $h(x) \approx y$ 對訓練資料成立 - $h(x)$ 在**未見資料**上也表現良好（generalization）

三個關鍵決策

決策	說明
1. Model（Hypothesis Space H）	定義候選模型的範圍例：線性模型、決策樹、神經網路
2. Loss Function	如何衡量預測錯誤例：0/1 loss、squared loss、cross-entropy
3. Optimization Algorithm	如何找到最佳 h 例：梯度下降、EM algorithm

Hypothesis Space

Hypothesis h：候選的「機器」（模型），從 hypothesis space $\mathcal{H}$ 中挑選。

$\mathcal{H}$ 的選擇決定了模型的表達能力和學習難度。

誤差類型

Training vs. Generalization

Training Error（經驗風險）： $$ \hat{R}(h) = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[h(x_i) \neq y_i] $$ $h$ 在訓練集上的錯誤率。 Generalization Error（真實風險）： $$ R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\mathbb{1}[h(x) \neq y]] $$ $h$ 在所有可能資料上的真實錯誤率（來自未知的真實分佈 $\mathcal{D}$）。

核心問題

⚠️ 關鍵挑戰：
Training error 低 $\neq$ Generalization error 低

Overfitting：模型在訓練資料上表現完美，但在新資料上表現差。

為什麼會 Overfitting？

模型太複雜： $\mathcal{H}$ 太大，能「記住」訓練資料的雜訊
訓練資料太少： 無法充分代表真實分佈
訓練時間太長： 過度優化訓練誤差

VC Dimension

模型複雜度

定義

Definition: VC Dimension

Vapnik-Chervonenkis Dimension** 衡量 hypothesis class $\mathcal{H}$ 的**複雜度（表達能力）。 Shattering： 若 $\mathcal{H}$ 能對某個大小為 $N$ 的資料集實現**所有可能的 labeling**（$2^N$ 種），稱 $\mathcal{H}$ **shatters** 這個資料集。 $$ VC(\mathcal{H}) = \max\{N : \mathcal{H} \text{ can shatter some dataset of size } N\} $$

直觀理解

VC dimension 是 $\mathcal{H}$ 能完美記住（shatter）的最大資料集大小。

VC dimension 越大 → 模型越複雜 → 表達能力越強 → 需要更多資料才能學好。

經典例子

例 1：平面上的直線分類器

$\mathcal{H} = {h(x, y) = \text{sign}(ax + by + c)}$

問：VC dimension 是多少？

答：$VC(\mathcal{H}) = 3$

證明概念：

存在 3 個點，可以被直線 shatter（任意 $2^3 = 8$ 種 labeling 都能分開）
任意 4 個點，至少有一種 labeling 無法被直線分開（XOR 問題）

例 2：$\mathbb{R}^n$ 上的線性分類器

$\mathcal{H} = {h(x) = \text{sign}(w^T x + b)}$，其中 $x \in \mathbb{R}^n$

\[VC(\mathcal{H}) = n + 1\]

意義：參數數量 = $n + 1$（$w$ 有 $n$ 個元素 + bias $b$）

一般來說，參數數量可以作為 VC dimension 的粗略估計。

VC Dimension 與 Generalization

Theorem: VC Generalization Bound

若 $VC(\mathcal{H}) = d$，則以機率至少 $1 - \delta$： $$ R(h) \leq \hat{R}(h) + \sqrt{\frac{d \log(N/d) + \log(1/\delta)}{N}} $$ 意義： - Generalization error 由 training error + 複雜度懲罰組成 - $d$ 越大 → 複雜度懲罰越大 → 需要更多資料 - $N$ 越大 → 複雜度懲罰越小 → generalization 越好

PAC Learning

可學習性理論

定義

Definition: PAC Learning

**Probably Approximately Correct Learning** 給定： - 誤差上限 $\epsilon$（**Approximately** Correct） - 失敗機率上限 $\delta$（**Probably**）若存在演算法，使用 $N$ 個訓練樣本後，以機率至少 $1 - \delta$ 輸出 hypothesis $h$ 滿足： $$ R(h) \leq \epsilon $$ 則稱 $\mathcal{H}$ 是 **PAC learnable**。

Sample Complexity

問：需要多少訓練資料 $N$？

Theorem: Haussler (有限 Hypothesis Space)

若 $|\mathcal{H}| < \infty$（有限個 hypothesis），則： $$ N \geq \frac{1}{\epsilon} \left( \ln |\mathcal{H}| + \ln \frac{1}{\delta} \right) $$ 保證以機率 $1 - \delta$，generalization error $\leq \epsilon$。

三個變數的關係

**1. $\epsilon$ 越小（要求越精確）** $$ N \propto \frac{1}{\epsilon} $$ 需要更多資料 **2. $\delta$ 越小（要求更有把握）** $$ N \propto \ln \frac{1}{\delta} $$ 需要更多資料（但只是對數成長） **3. $|\mathcal{H}|$ 越大（模型越複雜）** $$ N \propto \ln |\mathcal{H}| $$ 需要更多資料

實務意涵

Occam’s Razor（奧坎剃刀原則）：
在表現相同時，偏好較簡單的模型。

原因：

簡單模型需要較少資料
泛化能力更好
更易理解和維護

⚠️ No Free Lunch Theorem：
沒有一個演算法在所有問題上都最優。

必須根據問題特性和先驗知識選擇合適的 $\mathcal{H}$。

Regression vs. Classification

兩種主要任務

	Classification	Regression
Output 空間	離散（class label）	連續（實數）
例子	垃圾郵件偵測、圖像識別	房價預測、溫度預測
常見 Loss	0/1 loss, cross-entropy	Squared loss, absolute loss
評估指標	Accuracy, Precision, Recall, F1	MSE, MAE, $R^2$

共同點：

都屬於 supervised learning
都需要標註資料
都面臨 overfitting 問題
都可用 VC dimension 和 PAC learning 理論分析

總結

監督式學習的核心：

從 hypothesis space $\mathcal{H}$ 中選擇最佳 $h$
平衡 training error 與 generalization error
模型複雜度與資料量的權衡

理論工具：

VC Dimension： 衡量模型複雜度
PAC Learning： 分析樣本複雜度
Occam’s Razor： 偏好簡單模型

下一步：

Bayesian Decision Theory：如何做最優決策
具體演算法：Linear Models, SVM, Decision Trees 等

參考資料

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 2-6
Abu-Mostafa et al., Learning from Data, Chapters 1-2
Vapnik, The Nature of Statistical Learning Theory

決策	說明
1. Model（Hypothesis Space H）	定義候選模型的範圍例：線性模型、決策樹、神經網路
2. Loss Function	如何衡量預測錯誤例：0/1 loss、squared loss、cross-entropy
3. Optimization Algorithm	如何找到最佳 h 例：梯度下降、EM algorithm