【学习笔记】李宏毅机器学习

速通。

机器学习任务攻略

可能的问题:

  1. Overfitting
  2. Model Bias (不够大、不够好)
  3. Optimization Issue(没找到好的解)

Optimization Issue

Method: Start from shallower networks (or other models)

Overfitting

Small loss on training data, large loss on testing.

Method:

  • More training data.
  • Data augmentation.
  • 给 model 一些限制 (太多限制会导致 model bias: trade-off):
    1. Less parameters, sharing parameters.
    2. Less features
    3. Early stopping
    4. Regularization
    5. Dropout

Cross Validation

把 Training Set 分成 Training Set 和 Validation Set.

N-fold Cross Validation

分成 NN 份,分别取 NN 组为 Validation Set,其他 N1N-1 组为 Training Set 跑 NN 次。

避免了 Validation Set 选的不好的情况。

Mismatch

Training data and testing data have different distribution.

When Gradient is Small

Critical Point (where gradient is small):

  • Local Minima
  • Saddle Point (鞍点).

如何判断?

Tayler Series Approximation.

L(θ)L(θ)+(θθ)Tg+12(θθ)TH(θθ)L(\theta) \approx L(\theta')+(\theta-\theta')^{\mathrm{T}} \mathbb{g}+\frac{1}{2}(\theta-\theta')^{\mathrm{T}} H(\theta-\theta')

其中 gg 是 gradient, HH 是 Hessian Matrix.

在 critical point, g=0g=0.

  • 如果 HH 正定 (positive definite, positive eigen values):Local Minima;
  • 如果 HH 负定 (negative definite, …):Local Maxima;
  • Otherwise: Saddle Point.

也就是说,如果在 Saddle Point, HH 有 eigen value λ<0\lambda<0, eigen vector u\mathbb{u}, 顺着 u\mathbb{u} 更新就可以。

Saddle Point v.s. Local Minima

低维度的 Local Minima 在高维度可能是 Saddle Point.

Minimum ratio=Number of Positive Eigen ValuesNumber of Eigen Values\text{Minimum ratio} = \frac{\text{Number of Positive Eigen Values}}{\text{Number of Eigen Values}}

Batch & Momentum

Batch

  • Large batch size does not require longer time to compute gradient (取决于 GPU)
  • Smaller batch requires longer time for one epoch (longer time for seeing all data once)

大的 batch size 可能会导致 Optimization fail.

  • Smaller batch size has better performance
  • “Noisy” update is better for training (wow!)
  • Small batch is better on testing data (!)。一个解释:Minima 有 Flat Minima 和 Sharp Minima (Flat 指更“圆润”),于是在 Testing 的时候 Loss 可能和 Training loss 有偏差,在 Flat Minima 附近移动的话变化就比较少。

Momentum

动量。

mi=λmi1ηgi1m^i = \lambda m^{i-1}- \eta g^{i-1}

Error Surface is Rugged

Training Stuck \ne Small Gradient

Different parameters needs different learning rate.

θit+1θitησitgit\theta_i^{t+1} \leftarrow \theta_i^t - \frac{\eta}{\sigma_i^t} g_i^t

Root Mean Square

σit=1t+1i=0t(git)2\sigma_i^t = \sqrt{\frac{1}{t+1}\sum_{i=0}^t (g_i^t)^2}

Used in Adagrad.

RMSProp

Let 0<α<10<\alpha<1,

σit=α(σi1)2+(1α)(git)2\sigma_i^t = \sqrt{\alpha(\sigma_i^1)^2 + (1-\alpha)(g_i^t)^2}

Adam: RMSProp + Momentum

Learning Rate Scheduling

  • Learning Rate Decay: 训练越后期,Learning Rate 越减小。
  • Warm Up: Learning Rate 先变大后变小。

为什么要 Warm Up? 一个解释:在 tt 小的时候,σit\sigma_i^t 的估计的方差比较大 (see RAdam).

Various Improvements

θit+1θitηtσitmit\theta_i^{t+1} \leftarrow \theta_i^t - \frac{\eta_t}{\sigma_i^t} m_i^t

Classification

Class as one-hot vector: 避免用数字直接编号的时候,隐含的数字大小关系。

Logit: Softmax 后得到的值。

Mean Square Error (MSE): e=i(yi^yi)2e=\sum_i (\hat{y_i} - y_i')^2.

Cross-Entropy: e=iyi^lnyie=-\sum_i \hat{y_i} \ln y_i' (Minimizing cross-entropy 等价于 maximizing likelihood,更加适合在 classification)

Pokemon (?)

Hoeffding’s Inequality

P(Dtrain is bad due to h)2exp(2Nε2)P(D_{\text{train}} \text{ is bad due to } h) \le 2 \exp(-2N \varepsilon^2)

P(Dtrain is bad)H2exp(2Nε2)P(D_{\text{train}} \text{ is bad}) \le |\mathcal{H}| \cdot 2 \exp(-2N \varepsilon^2)

Convolutional Neural Network (CNN)

We need augmentation.

Why Validation Set but Still Overfit?

每次相当于从若干模型中,选择一个在 Validation Set 上表现最好的模型。

但如果可选项太多了,可能还是会 Overfit (可以看作相当于在可选项中通过 Validation Set “训练”)

Why Deep Learning?

hall=arg minhHL(h,Dall)h^{all} = \argmin_{h\in \mathcal{H}} L(h, \mathcal{D}_{all})

如何同时使得 Loss 小 & H\mathcal{H} 中的 candidate 少?

Deep network 能得到相对小的 H|\mathcal{H}|,Shallow network 才是相对大的 H|\mathcal{H}| (H|\mathcal{H}| 和参数量相关(?) ).

Deep networks outperforms shallow ones when the required functions are complex and regular.

Generation

Network as Generator

xx + Simple Distribution (we know the formulation) y\to y

Why Distribution?

特别对于那些需要 creativity 的 tasks.

Generative Adversarial Network (GAN)

Unconditional Generation

例:没有 xx,输入从 Normal Distribution 采样,要求过了 Generator 之后生成动画人物 (Complex Distribution).

Discriminator

也是 function (neural network).

Algorithm

  • Initialize Generator and Discriminator
  • In each training iteration:
    1. Fix (固定) generator GG and update discriminator DD: 从 database 中 sample 一些 (real object),用 GG 生成一些,让 DD 分辨;
    2. Fix discriminator DD and update generator GG: Generator learns to “fool” the discriminator,让 GG 生成的东西在 DD 上得分更高

Theory Behind GAN

目标:

G=arg minGDiv(PG,Pdata)G^* = \argmin_G \operatorname{Div}(P_G, P_{data})

其中 PG,PdataP_G, P_{data} 是 distribution, Div\operatorname{Div} 是 Divergence.

How to compute Divergence?

PGP_GPdataP_{data} 里 sample 即可,Discriminator DD 起到了计算 Divergence 的功能。

  • Training: D=arg maxDV(D,G)D^* = \argmax_D V(D, G)
  • Objective Function

V(G,D)=EyPdata[logD(y)]+EzPG[log(1D(G(z)))]V(G, D) = \mathbb{E}_{y\sim P_{data}}[\log D(y)] + \mathbb{E}_{z\sim P_G}[\log(1-D(G(z)))]

其中 maxDV(D,G)\max_D V(D, G) 和 JS divergence 有关。

Tips for GAN

JS divergence is not suitable

In most cases, PGP_G and PdataP_{data} are not overlapped.

JS divergence 在两个 distribution not overlapped 的时候永远是 log2\log 2.

Intuition: If two distributions are not overlapped, binary classifier can always achieve 100%100\% accuracy.

Wasserstein distance - WGAN

smallest average distance to “move”.

maxD1Lipschitz{EyPdata[D(x)]EyPG[D(x)]}\max_{D\in 1-\text{Lipschitz}} \{\mathbb{E}_{y \sim P_{data}}[D(x)]-\mathbb{E}_{y \sim P_G}[D(x)]\}

限制的意思就是要求 DD 足够光滑。怎么做到?See Inproved WGAN, Spectral Normalization (SNGAN)…

GAN is still challenging…

Generator and Discriminator needs to match each other.

GAN for Sequence Generation

在取 max logit 的时候,Decoder 较小的改变可能不会带来输出的变化,导致 Discriminator 训不动。

那为什么 CNN 的 max pooling 不会有这个问题?

  • 因为 GAN 的过程是要交替训练 Generator 和 Discriminator 的。

Try Reinforcement Learning 但也非常难。

See ScratchGAN.

Evaluation of Generation

How to evaluate the quality of the generated image?

Image \to Off-the-shelf Image Classifier \to Concentraded distribution means higher visual quality.

Diversity:

  • Mode Collapse: 多次生成的东西都差不多 (Diversity 低)
  • Mode Dropping: 模式丢失,生成出来的只有部分模式

一个检查 Diversity 的方法:Inception Score (IS), Inception Network: 用来做 classification 的网络。

1Ni=1NP(c  yi)\frac{1}{N} \sum_{i=1}^N P(c~|~y^i)

Good Quality: 对于一张图的概率分布集中;High Diversity: 对于多张图的概率分布分散。

Frechet Inception Distance (FID)

看进入 Softmax 之前的向量,计算两个 Gaussian 分布的 Frechet Distance。Smaller is better.

可能存在的问题:

  • 真的是 Gaussian 吗?
  • 需要很多 sample.

We don’t want memory GAN.

Conditional Generation (CGAN)

xx + Normal Distribution zz \to yy

Better discriminator: x+yx + y \to Scalar: yy is realistic or not + xx and yy are matched or not.

xx 可以是文本,也可以是图片 (pix2pix…)

Learning from Unpaired Data

把 GAN 用在 unsupervised data 上。

只有 domain X,Y\mathcal{X}, \mathcal{Y}, 没有配对数据。

Cycle GAN

只从 X\mathcal{X} domain sample,剩下部分和上面一样是不够的:甚至可以 ignore input, input & output 完全无关。

训两个 Generator GxyG_{x\to y}GyxG_{y\to x},一个 Discriminator DD.

要求 xx 通过 GxyG_{x\to y} 后得到 yy 能过 Discriminator DD,且 yy 通过 GyxG_{y\to x}xx'xx 尽量接近 (cycle consistency).

这样 GxyG_{x\to y} 就不能忽略 input xx.

但是机器学到的联系会不会不是我们想要的?

实验上好像表现不错。

当然也可以训两个 Discriminator,DxD_xDyD_y,把上面过程反过来,中间用 DxD_x.

See SELFIE2ANIME

Self-Attention

Sophisticated Input

A set of vectors.

Self-Attention for Speech

如果滑动窗口 10 ms 的话,vector 数量太多。

使用 Truncated Self-Attention.

Self-Attention v.s. CNN

可以认为 CNN 是 Self-Attention 的特例 (simplified)。

  • CNN: Good for less data.
  • Self-Attention: Good for more data.

Self-Attention v.s. RNN

RNN: 要考虑距离 + nonparallel.

Self-Attention for Graph

Consider edge: only consider attention to connected nodes.

This is one type of Graph Neural Network (GNN).

Batch-Normalization

能不能直接改变 Error Surface 的 Landscape?

如果输入 x1,x2x_1, x_2, 在所有数据中 x1x_1 都很小,但是 x2x_2 都很大。

x1x_1 乘上 w1w_1, x2x_2 乘上 w2w_2。因为 x1x_1 比较小,所以 Δw1\Delta w_1 对 Loss 的影响比较小,同理 Δw2\Delta w_2 对 Loss 的影响比较大。从而导致 w1w_1w2w_2 的 Gradient 大小差的较多。

Feature Normalization

对于 xr=(x1r,x2r,,xdr)x^r = (x_1^r, x_2^r, \cdots, x_d^r), 对 dimension ii 计算 mean μi\mu_i, standard deviation σi\sigma_i.

x^ir=xirμiσi\hat{x}_i^r = \frac{x_i^r - \mu_i}{\sigma_i}

也就是平均值为 0,方差为 1.

这样的话,因为要算 μi,σi\mu_i, \sigma_i,所以相当于把所有数据同时输入进去了。但是因为数据可能太多了,所以一般指对一个 batch 同时做 \to batch normalization.

因为要算 μi,σi\mu_i, \sigma_i,所以 batch size 不能太小。我们也可以加上 β,γ\beta, \gamma 来改变分布。

x^i=γx^i+β\hat{x}^i = \gamma * \hat{x}^i + \beta

Batch Normalization - Testing

Testing 的时候可能没有一个 batch,无法从中直接计算 μi,σi\mu_i, \sigma_i. 但是可以计算 training 时候 batches 的 μi,σi\mu_i, \sigma_imoving average:

μˉλμˉ+(1λ)μ\bar{\mu} \leftarrow \lambda \bar{\mu} + (1-\lambda) \mu

Internal Covariate Shift?

Transformer

Decoder - Non-Autoregressive (NAT)

How to decide the output length for NAT decoder?

  • Another predictor for output length;
  • Output a very long sequence and truncate it at EOS;

parallel, controllable output length.

NAT is usually worse that AT. (Why? Multi-Modality)

Seq2Seq Tips

Copy Mechanism

Guided Attention

In some tasks, input and output are monotonically aligned.

Monotonic Attention, Location-aware attention.

Greedy decoding: always choose the word with highest probability.

Not possible to check all the paths.

Beam Search: at each step, keep the top kk paths.

Optimizing Evaluation Metrics?

How to do the optimization? USE REINFORCEMENT LEARNING.

Schduled Sampling

各种 Attention

Local Attention / Truncated Attention

只计算附近的 Attention: Similar with CNN.

Stride Attention

跳着看。

Global Attention

Add special token into original sequence.

  • Attend to every token \to collect global information.
  • Attend by every token \to it knows global information.

No attention between non-special tokens.

可以在 multi-head 来同时使用上面的各种 Attention.

Focus on Critical Parts

将 small value 直接设为 00.

How to quickly estimate the portion with small attention weights?

  • Clustering: 让相近的 vector 属于同一个 cluster. 只计算同一个 cluster 里的 Attention.

  • Learnable Patterns: Sinkforn Sorting Network.

Do we need full attention matrix?

选一些 Representative keys (why not queries? change output sequence length).

How to reduce number of keys?

  • Compress Attention
  • Linformer (做不同的线性变换,组合)

Attention-Free?

Self-Supervised Learning

BERT

GPT

Predict Next Token

Prompt Tuning.

Self-Supervised Learning for Speech

中间省略了很多。

Domain Adaptation

Domain Shift

Source domain 和 Target domain 的输入资料的分布不同。

有一些 Knowledge of target domain,可以考虑先在 Source domain 上训练,然后用 Target data 去 fine-tune.

Challenge: Target data 很少,所以要避免 overfit / 如果 Target data 没有 label?

Basic Idea

训练一个 Feature Extractor,比如在有颜色的数字识别上,让这个 Feature Extractor 去 ignore colors.

具体就是把 Source data 和 Target data 输入到同一个 Feature Extractor,目标是让他们的 feature 相近。

Domain Adversarial Training

同时训一个 Feature Extractor 的同时,还训一个 Domain Classifier,类似 GAN 地对抗训练就行。

为什么 Feature Extractor 不会直接输出 0?

因为 Feature Extractor 还要后面接 Label Predictor,因此他不得不保留一些 feature 去支持后面的 predictor.

可以看作 Feature Extractor 的 loss 形如:Label Predictor 一侧的 loss - Domain Classifier 一侧的 loss.

Unknown Label?

如果 Target Domain 中有 Source Domain 没有的 label?

Domain Generalization

如果对 Target Domain 一无所知?

用 Data Augmentation 拓宽 Domain

Reinforcement Learning

Intro

快进了一堆内容之后终于到 RL 了!

Self-Supervised Learning 的时候其实还是有 label 的,只不过不需要我们标记。

Environment 给 Actor 一个 Observation, Actor 输出一个 Action 给 Environment, Environment 给 Actor 一个 Reward 和新的 Observation.

要找的 function 就是 Actor, f(Observation)=Actionf(\text{Observation}) = \text{Action}.

  • Step 1: function with unknown parameters
  • Step 2: define loss function
  • Step 3: optimization

Step 1: Function with Unknown Parameters

Policy Network (Actor)

Input: the observation,

Output: each action corresponds to a neuron in output layer.

一般会通过 Action 的 Score 作为概率来 Sample,而不是直接 Argmax.

Step 2: Define Loss Function

从开始到结束的过程,所有的行为的总和叫一个 Episode.

一个操作 aTa_T 有 reward rTr_T,那么 Total reward (return) 是 R=t=1TrtR = \sum_{t=1}^T r_t.

因此我们的目标是 maximize RR. 或者说,minimize R-R.

Step 3: Optimization

Actor: siais_i \to a_i, Env: aisi+1a_i \to s_{i+1}.

Reward Function: si+airis_i+a_i \to r_i.

困难的问题在于 Env 和 Reward 是 black box with randomness.

How to do the optimization here is the main challenge in RL.

Policy Gradient

How to control your Actor

对于 sis_i,如果 take action a^i\hat{a}_i,用 eie_i (±1\pm 1) 表示这个数据表示的是做 / 不做 a^\hat{a}.

我们给 (si,a^i)(s_i, \hat{a}_i) 一个值 AiA_i 表示这个行为有多好。然后就可以令 L=AieiL=\sum A_i e_i.

Version 0 : Wrong !

直接拿 reward rir_i 当作 AiA_i.

这是一个 Short-sighted version.

An action affects the subsequent observations and thus subsequent rewards.

  • Reward Delay: Actor has to sacrifice immediate reward for future reward.

Version 1

Cumulated Reward: 用 Gi=jirjG_i = \sum_{j\ge i} r_j 来当 AnA_n. 问题在于太靠后的奖励不一定是最前面 action 的功劳。

Version 2

从 Version 1 加一个 discount factor γ<1\gamma < 1.

Version 3

Good or bad reward is “relative”.

可以 Minus by a baseline bb, make GtG'_t have positive and negative values.

Policy Gradient Theorem

  • Initialize actor network parameters θ0\theta^0.
  • For training iteration i=1Ti=1 \sim T.
    • Using actor θi1\theta^{i-1} to interact
    • Obtain data {s1,a1},{s2,a2},,{sN,aN}\{ s_1, a_1 \}, \{ s_2, a_2 \}, \cdots, \{ s_N, a_N \}
    • Compute A1,A2,,ANA_1, A_2, \ldots , A_N
    • Compute loss LL
    • θiθi1ηL\theta^i \leftarrow \theta^{i-1} - \eta \nabla L

Obtain data 在每次 iteration 内部。更新完一次参数之后,就需要重新收集参数了。

  • On-Policy: The actor to train and the actor for interacting is the same.
  • Off-Policy: The actor to train and the actor for interacting is different.

Off-Policy 就不需要每一轮之后都收集数据(怎么做???)

Off-Policy: Proximal Policy Optimization (PPO)

The actor to train has to know its difference from the actor to interact.

Collection Training Data: Exploration

The actor needs to have randomness during data collection.

Actor-Critic

Critic: Given Actor θ\theta, how good it is when observing ss (and taking action aa)

Value Function

Value Function Vθ(s)V^\theta(s): When using Actor θ\theta, the discounted cumulated reward expects to be obtained after seeing ss. “未卜先知”,从 ss 根据 θ\theta 随机,能得到的期望 reward。

The output values of a critic depend on the actor evaluated (i.e. θ\theta).

How to estimate Vπ(s)V^\pi(s)?

  • Monte-Carlo (MC) based approach: The critic watches actor θ\theta to interact with the environment. (但有些游戏或许永远不会结束)
  • Temporal-Difference (TD) approach.

Vθ(st)=rt+γrt+1+γ2rt+2+, Vθ(st+1)=rt+1+γrt+2+V^\theta(s_t) = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots, ~ V^\theta(s_{t+1}) = r_{t+1} + \gamma r_{t+2} + \cdots

Vθ(st)=rt+γVθ(st+1)V^\theta(s_t) = r_t + \gamma V^\theta(s_{t+1})

MC v.s. TD

用两种方法可能训出不同的结果。

Version 3.5

在 Version 3 中,{st,at}\{ s_t, a_t \}At=GtVθ(st)A_t = G_t' - V^\theta(s_t). (也就是取 b=Vθ(st)b = V^\theta(s_t))

但是这个 GtG'_t 是某一次 sample 的结果,还是不太牛。

Version 4

GtG'_t 换成 rt+γVθ(st+1)r_t + \gamma V^\theta(s_{t+1}). 其实就是 E(Gt)\mathbb{E}(G'_t).

“Advantage Actor-Critic”

Tip of Actor-Critic

  • The parameters of actor and critic can be shared.

Deep Q Network (DQN)

Reward Shaping

Sparse Reward

稀疏反馈。没有明确的对 action 的 reward.

The developers define extra rewards to guide agents \to reward shaping

VizDoom

需要很多 Domain Knowledge.

Curiosity

Obtaining extra reward when the agent sees something new (but meaningful)

No Reward: Learning from Demonstration

Motivation:

  • Even define reward can be challenging in some tasks.
  • Hand-crafted rewards can lead to uncontrolled behavior.

Imitation Learning

Actor can interact with the environment, but reward function is not available.

We have demonstration of the expert. Each τ^\hat{\tau} is a trajectory of the expert.

“但这不就是 Supervised Learning 吗?”

Problem:

  • The experts only sample limited observations (比如去学开车的时候,expert 没有展示撞车的现象)
  • The agent will copy every behavior, even irrelevant actions.

Inverse Reinforcement Learning

现在我们没有 reward 的,只有 Env 和 demonstration of the expert.

想法是通过 Env 的 expert 的展示,反推出 reward function. (ohhhh)

然后再用这个 reward function 去找 optimal actor.

  • Priciple: The teacher is always the best (但这不是说老师的所有行为都需要模仿)
  • Basic idea:
    • Initialize an actor
    • In each iteration:
      • The actor interacts with the environment to obtain some trajectories
      • Define a reward function, which makes the trajectory of the teacher better than the actor
      • The actor learns to maximize the reward based on the new reward function (By RL)
    • Output the reward function and the actor learned from the reward function.

GAN & IRL: 非常相似。

Robot

Network Compression

Network Pruning

Networks are typically over-parameterized (there is significant redundant weights or neurons)

  • Importance of a weight
  • Importance of a neuron

After pruning, the accuracy will drop.

但我们可以接着 Fine-tuning on training data for recover.

Don’t prune too much at once, or the network will not recover.

Practical Issue:

  • Weight pruning: The network architecture becomes irregular (Hard to implement, hard to speedup)
  • Neuron pruning: The network architecture is regular (Easy to implement, easy to speedup)

Why Pruning? It is widely known that smaller network is more difficult to learn successfully.

Knowledge Distillation

Temperature for Softmax:

pi=exp(zi/T)jexp(zj/T)p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}

Parameter Quantization

  1. Using less bits to represent a value
  2. Weight clustering: 把相近的 weight 放进一个 cluster,用同一个值表示。
  3. Huffman Coding.

Binary Weights: ±1\pm 1 的 weight (给了模型比较大的限制,所以他比较不容易 overfit)

Architecture Design

  • Depthwise Separable Convolution (原理:Low-rank Approximation)
    1. Depthwise Convolution: Filter number = Input channel number, Each filter only considers one channel.
    2. Pointwise Convolution: 1x1 Filter (考虑不同 channel 之间的关系)

Dynamic Computation

The network adjusts the computation it needs.

  • Dynamic Depth: 每过一层的结果套一个 Extra Layer (每层不同),让他们都接近 Ground Truth.
  • Dynamic Width.
  • Computation based on Sample Difficulty: 动态根据样本难度调整计算量。

Life-Long Learning