介绍

半监督学习就是除了有input和label一一对应的数据之外，还有一笔数据只有input没有label。

虽然unlabeled data没有对应的output，但是unlabeled data的分布可以告诉我们一些事，从而影响我们的决定。因而在使用半监督学习的时候往往需要有一些分布假设。

分类有两个：

transductive learning(unlabled data is the testing data)
inductive learning(unlabeled data is not the testing data )

半监督生成模型

监督生成模型在「概率分类与Logistic回归」中有详细说明。首先是先验部分，通过训练集的数据做每个类服从高斯分布的假设，获得每个类的期望和方差。其次是后验部分，通过给的一个新的input data做分类，决定decision boundary在哪里。

在半监督生成模型中，unlabeled data能对estimate parameters的结果有影响，对于每个类的期望和方差有影响，从而影响decision boundary。

具体操作

Initialize estimators

given initial estimators $Missing or unrecognized delimiter for \left\theta = \left{ P(C_1), P(C_2), \mu^1, \mu^2, \sum\right}$

这组初始化参数是通过已有的labeled data估测出来的。

E-Step/Compute the posterior probability of unlabeled data

这个几率算出来是怎么样的取决于modell值，是基于步骤一中的modell算出来的后验概率。

感觉这个过程就像在测试model，给它一个input，看他的output。与testing data的不同在于，这个output不是用来衡量学习的优劣的，而是作为一笔数据更新model的参数。

M-Step/Update model

依据步骤二算出来的unlabeled data属于model下的Class 1的概率，更新模型参数。

如何更新

$P(C_1) = \frac{N_1+ \sum_{x^u}{P(C_1|x^u)}}{N}$

分子中代表labeled data中属于Class 1的个数，分母中就是所有labeled data的个数。

如果不考虑unlabeled data，结果就是这两个数据的分式子。

如果考虑unlabeled data，需要把所有unlabeled data中属于Class 1的后验概率的和考虑进去。

如何更新

原本不考虑unlabeled data的时候，Class1的期望就是把所有属于Class 1的input做平均。加上unlabeled data之后就是把每一笔unlabeled data的数据和他的相对应的概率想成之后再除以总的概率和。

$\mu^1 = \frac{\sum_{x^r\in C_1}x^r}{N_1} + \frac{1}{\sum_{x^u}P(C_1|x^u)} \sum_{x^u}{P(C_1|x^u)}x^u$

E-M算法：更新完所有的参数之后又返回步骤二，一直循环到步骤三结束，最终是收敛的结果。

假设一：Low-density Seperation

表示“非黑即白”，在类间交界处，数据的密度较低，可以勾画一个十分明显的界限。

Self-training

self-training是最简单的一种实现low-density seperation的方式。

如何实现

有unlabeled data set和labeled data set。从labeled data set中训练出一个model，然后将这个model应用到unlabeled data set上获得预测出的pseudo-label。然后从unlabeled data set中移除（how to choose data？）一部分数据并添加到labeled data set中，更新模型。

需要注意调整概率阈值threshold，或是多次取样来获得比较可信的数据。例如设置pos_threshold = 0.8，只有 prediction > 0.8 的 data 会被标上1。

类比GenerativeModel

可以看出来过程和半监督生成模型类似。主要区别在于对unlabeled data set的output处理。

self-training使用hard label，即unlabeled data就一定会属于某一个class；generative model使用soft label，即unlabeled data有多大概率属于Class 1，有多大概率属于Class 2。

显然，生成模型和Self-traning所要求的“Low-density Seperation”/非黑即白相矛盾，因而生成模型不会在neuron network起作用。

Entropy-based Regularization

既然需要unlabeled data能够促进“非黑即白”，联系信息熵，就要求这个unlabeled data通过这个model之后的输出output信息熵越小越好。

对unlabeled data output求信息熵

$E(y^u) = -\sum_{m=1}^{N^u}y_m^u\ln{y_m^u}$

可以重新定义model的Loss Function为

$L = \sum_{x^r}{C(y^r, \hat{y}^r)} + \lambda E(y^u)\\$

第一个式子是在用交叉熵衡量labeled data上model的output和real data output的距离，希望越小越好；第二个式子是在用熵衡量unlabeled data上model的output的混乱度，自然是越小越好。参数带表了你倾向于偏向unlabeled data多一点还是少一点。

Semi-supervised SVM

Support Vector Machines穷举所有的unlabeled data。

假设二：Smoothness Assumption

similar x has the same y hat.

因为x的分布不是平均的，它在某些地方十分集中，某些地方又十分分散，如果x1和x2在a high density region中十分相近，那么他们的输出label可以很相近。（x1和x2可以看作相近相似的中间状态，不相似的data可以看作是跳跃状态？）

Cluster and then label

Cluster and then label是最简单的实现Smoothness assumption的方法，但是要求聚类很强。一般都是用一个很好的方法来描述image，提取特征后再进行聚类。

Graph-based Approach

用high density path连接起来的一个区域就是a high density region，相当于在结点之间画边。

画边方式与相似度

有两种方式：

K nearest Neighbour
e-neighbour

同样考虑到图论中边和边权。可以给相连的边一些权重，让edge和他两端的points的相似度是成正比的。定义相似度选择Gussian Radial Basis Function:

$s(x^i, x^j) = exp(-\gamma||x^i-x^j||^2)$

算出两个point的距离之后乘以一个参数取负号在exponential。

定量描述the smoothness of the labels on the graph

$S = \frac{1}{2}\sum_{i,j}{\omega_{i,j}(y^i-y^j)^2} = \boldsymbol{y}^TL\boldsymbol{y}$

第一个表达式中，是为了后续做微分方便而添加的；是两个data point之间的权重，即边权；就是两个data point之间的距离。注意这个式子是对连成图之后的所有data point而言的。这个式子越小代表越光滑。

第二个表达式中，是一个(R+U)维度的向量，是一个(R+U)x(R+U)的矩阵，叫做Graph Matrix。

$\boldsymbol{y} = [\cdots y^i \cdots y^j \cdots]^T \\ \boldsymbol{L} = \boldsymbol{D} - \boldsymbol{W}$

其中代表图的邻接矩阵，是邻接矩阵每一行的和放在对角线上。

由smoothness到loss function

在原来的loss function中，有labelled data训练之后的cross entropy，此外添加带的smoothness of the labels on the graph项。因为平滑度是取决于nueron network的，而要找的nueron network就是一组参数。

其实smoothness of the labels on the graph项不一定要放在output layer，可以放在某些hidden layer中。