# Semi-Supervised Learning

はじめまして、データ分析部のシャオです。今回は半教師あり学習(Semi-supervised Learning)を英語で紹介します。

This blog introduces representative methods in section 2, including Generative Model, Support Vector Machines, Graph-Based Model and co-training. In section 3, the demonstration of text classification by R is presented. The co-training exmaple is in section 3.2.1 and generative model is in section 3.2.2.

# 1 Introduction

Semi-supervised learning (SSL) is a technique that uses both labeled and unlabeled data. It is between supervised learning (using labeled data) and unsupervised learning (using unlabeled data only). Usually, there are a great amount of unlabeled data which are easy and cheap to acquire but only few labeled data. For instance, we can easily extract website text data by web crawlers. However, labelling an extremely small amount of these data is not only expensive but also time-consuming.

# 2 Theory and Methods

How can unlabeled data help to learn a better classifier? The ideal situation is that we find a classifier which can correctly predict labeled data and make unlabeled data fit the model well too. In Balcan and Blum paper (2010), they use the notation of compatibility between data distribution and the target function. For example, now we have 10 labeled data and linear classifier as Figure 1. The black separator looks good, since it perfectly classifies the label data. However, after seeing unlabeled data, we could find the possible per class data distribution, P(x|c). Then, if the distribution is true, it is obvious the blue one is better than the black one. In this situation, the target function is compatible with data distribution if it does not slice through any high-density zone of any class.

Figure 1

The following part focuses on binary classifier. Here are symbols used:
l: labeled data.
u: unlabeled data.
f: binary classifier.
y: true label.
x: feature.

### 2.1 Generative Model

The generative model assumes the joint probability is $p(x,y|\theta)=p(y|\theta)p(x|y,\theta)$. $p(y|\theta)$ is the class prior distribution. $p(x|y,\theta)$ is the class conditional distribution. By Bayes rule, we can find the classifier

$f_\theta (x) \equiv \text {argmax}_y p( y | x, \theta) \equiv \text {argmax}_y \frac {p(x,y|\theta)} {\sum_y' p(x, y'| \theta)}$

For SSL, besides how good the classifier (target function) is, a good classifier should be compatible with data distribution. That is, the likelihood,$p(x_i,y | \theta)$, should be high. For unlabeled data, the likelihood is $\sum_{y \in \mathcal{Y}} p(x_i,y | \theta)$ because the exact label y is unknown. Consequently, the $\hat \theta$ is the one maximizing log likelihood:

$\text {argmax} _{\theta} \ \text{logp} ($ { $x_i, y_i$ } $_l |\theta) + \lambda \text{logp} ($ { $x_i$ } $_u |\theta)$

where $\lambda$ is a balancing weight between [0,1]. The first log term is log likelihood of all labeled data, and the second log term is the log likelihood of all unlabeled data. If $\lambda=0$, it retreats to usual supervised learning (SL). With $\hat \theta$ in hand, we can obtain classifier $f_{\hat \theta}(x)$

### 2.2 Support Vector Machines

A good SVM has the decision boundary with large margin, and few data falls into the margin. Margin is the region bounded by two hyperplanes. If an instance falls into the margin, classifier is not confident which group it should belong to. Therefore, a good classifier should prevent data from falling into the margin.

For labeled instances, SVM will punish the instance that falls onto the wrongside. The farther the wrong prediction from the boundary, the more punishment it gets. For unlabeled instances, semi-supervised SVM (or S3VM) punishes data which falls into margin. When too many data fall into the margin, it means this classifier thinks these data could be either way. This is the last thing we want because it simply loses its function of classification. To punish this situation, hat loss is proposed. Hat loss is $\text{max} (1-|f(x)|,0)$. When data is outside the boundary, $1-|f(x)| < 0$, no hat loss at all. The hat loss will be largest when the instance stands in the middle of the margin. Hence, to find the good classifier f is to minimize:

$\sum_{i \in l}$ max $(0, 1-y_if(x_i)) + \lambda_1$ $||f||^2$ $+ \lambda_2 \sum_{i \in u}$ max $(0, 1-|f(x_i)|)$

The first and second term are the supervised SVM’s hinge loss and regularization term of labeled data. The third term is the hat loss of unlabeled data.

However, hat loss is non-convex function so it is hard to find a solution. Therefore, there are different approaches to find the optimal solution. The following introduces some of them.

##### 2.2.1 SVM light

SVM light first trains supervised SVM by labeled data, then uses it to classify some
proportion of unlabeled data. Then, start with a small $\lambda_2$ and train a classifier f to minimize:

$\sum_{i \in l}$ max $(0, 1-y_if(x_i)) + \lambda_1$ $||f||^2$ $+ \lambda_2 \sum_{i \in u}$ max $(0, 1-y_if(x_i))$

After obtaining a new classifier, try to switch unlabeled data. The switch is made if two are labeled differently by the new classifier and if hinge loss will decrease after switching. The figure 2 shows some switchable cases. The red line is the hinge loss. After switching, there is no hinge loss in case 1. In case 2 and 3, hinge loss decreases.

Figure 2

After no labels are switchable anymore, SVM light is going to train a new classifier. In this step, it uses the higher $\lambda_2$ than the previous one, together with the updated labels from the switching step. Then, train a new f and switch labels again until $\lambda_2 =1$ or user-defined control level.

The higher $\lambda_2$ is, the higher it thinks of unlabeled data. It plays the same role as $\lambda$ in generative model in 2.1 section.

##### 2.2.2 Gradient descent based S3VM

∇S3VM smooths the hat loss with a similar-looking Gaussian curve. By doing so, it can approximate the objective function which now is differentiable.

##### 2.2.3 Concave-convex procedure (CCCP)

Concave-convex procedure (CCCP) also deals with hat loss function. Hat loss function is the sum of convex and concave term. That is, $\text{max}(1-|f|,0) = \text{max} (|f|-1)+1-|f(x)|$. CCCP uses the sequence of z and linearized the concave part $R_{cave} (w)=f(z)+\nabla f(z)'(w-z)$. Hence, CCCP can find the sequence of local optimal of convex function and just pick the best one.

#### 2.3 Graph-Based Models

The graph $G={V,E}$. V is vertices, all labeled and unlabeled data here. E is undirected edges connecting instances $x_i,x_j$. $w_{ij}$ is the weight reflecting the proximity of $x_i,x_j$. The logic is simple: if two instances are alike, a good classifier should classify them into the same class. Then f is:

$\text {argmin}_f {\sum} _{i,j \in { l,u } } w _{ij} (f(x_i) - f(x_j))^2$
##### 2.3.1 Mincut

One algorithm is mincut algorithm. Mincut finds a cut which costs (weight) the least and to separate the graph into two sides. The intuition is that if $x_i,x_j$ are so unlike, it does not hurt if f classifies them into two groups. The mincut algorithm tries iteratively ($\infty$ ) to solve:

min $\ \infty \sum_{i \in l} (y_i-Y_i)^2+\sum_{i,j \in { l,u } } w_{ij}(y_i-y_j)^2$

The frist term is to minimize loss of labeled data, and the second term is to find the mincut.

##### 2.3.2 Manifold regularization

Another example is manifold regularization algorithm. It relaxes labels to be continuous. The classification is based on the sign of $f(x)$. This algorithm tries different weight ($\lambda_1, \lambda_2$), and kernel h to parameterize x into the same domain as f, where $f(x)=h(x)+b$. With kernel h, it could construct similarity graph, and obtain weights among instances. The classifier f is:

$\text { argmin } _f {\sum} _{i \in l} \text { max } (0, 1 - y_if(x_i) ) + \lambda_1$ $\Vert h \Vert^2$ $+ \lambda_2 {\sum} _{i,j \in { l,u } } w _{ij}(f(x_i) - f(x_j))^2$

The first and second term is hinge loss and regularization term. The third term is as same as above, the punishment term when f classifies two similar samples into different groups.

### 2.4 Co-training

Sometimes, data have multiple views. For web pages, we can have image features and text features. The co-training trains each feature by labeled data alone first. Then, it uses feature classifiers to label unlabeled data. It selects the k-most-confident data of one classifier as the training data for other feature classifiers next time. That is, it uses the unlabeled data which are not vulnerable to misclassification, just as labeled data. For instance, if some unlabeled documents can easily be classified by its text feature. We can use their image feature and predicted labels based on text feature to train image feature classifier.

When can co-training be used? First, it should be able to train a good classifier when using one view alone. If it cannot achieve this sufficient assumption, the k-most-confident data are useless since the classifier itself is not trustable. Another assumption is independent assumption. The feature should be conditionally independent given the label. Under this assumption, the unlabeled instances with its predicted labels based on one feature classifier will distribute randomly to other classifiers, so it is as informative as random samples.

If co-training works well, no matter which feature classifier is used, the predicted label should be the same. Thence, co-training includes multiple view classifiers $f^{(1)}, f^{(2)}, ..., f^{(m)}$ are functions to minimize:

$\sum_{j=1}^m (\frac{1}{l} \sum_{i=1}^l c^{(j)}(f^{(j)}(x_i), y_i) + \lambda_1 \Omega^{(j)}(f^{(j)})) \ + \lambda_2 \sum_{i \in u} \sum_{j,k=1}^m c(f^{(j)}(x_i), f^{(k)}(x_i))$

Here, c is loss function and Ω is regularization term. The first summation is the same as supervised learning of m view classifiers. The second summation is to punish f when multi classifiers cannot lead to the same prediction for the same instance.

Nevertheless, most datasets are one-view but with several features. Then making a fake feature split is an option. If after splitting, the sufficient and independent assumption can be satisfied, co-training will work well, too. Here are four splitting methods introduced in Du et.all (2011):

##### 2.4.1 Random split

As its name, it just splits the feature randomly and verify assumptions empirically.

##### 2.4.2 Entropy split

It calculates entropy of each features first. The larger the entropy is, the higher predictive power it has. If we want to split into two views, then we simply assign the odd-number (first, third, etc.) highest entropy features to view 1, and even-number (second, fourth, etc.) highest features to view 2. By doing so, sufficient assumption is satisfied since both views have some high-entropy features, but it does not ensure independent assumption.

##### 2.4.3 Entropy-Start Hill Climbing

To solve the previous problem, entropy-start hilling climbing is proposed. It switches each feature to other view once, so generates a group of new splits. Then, we can verify two assumptions empirically using these splits and pick the most qualified one.

##### 2.4.4 Random-Restart Hill Climbing

Since entropy-start hill climbing only starts from the entropy-based split, it might be stuck in local optima. Random-restart hill climbing starts from n random splits and searches the splits which are best compatible with co-training assumptions empirically.

# 3 R example

The dataset used here is a tiny version of 20 Newsgroups (available at http://www.cs.nyu.edu/~roweis/data.html). It contains 16,242 articles and binary occurrence data for 100 words. The label is the newsgroup to which the article was posted. These 20 newsgroups can be classified into 4 subjects, computer, recreation, science, and talk (http://qwone.com/~jason/20Newsgroups/ for details).

We are going to demonstrate two simple codes of co-training with fake split and generative model, all by Expectation-Maximization (EM) algorithm and under naïve Bayes assumption.

### 3.1 Theory behind codes

The example here basically follows Nigam et al ‘s paper (2000), so please refer it for details.

Naïve Bayes classifier assumes

$p(x|y=c, \theta)=\prod_{j=1}^D p(x_j|y=c, \theta_{jc})$

$\theta$ is parameter and will be revisited later. It is naïve because it assumes features to be mutually independent.

The document consists of a sequence of words. The generative process of a document $d_i$, which belongs to class $c_j$, are as follows. First, decide the length of document, $|d_i|$, based on $p(|d_i|)$. Then, generate each word according to class-specific word probability, p({ $w_{d_i,1}, w_{d_i,2}, ..., w_{d_i,|d_i|}$ } $|c_j; \theta)$. Here, naive Bayes are applied to assume each word is generated independently, though it is unrealistic. Thus, class- specific word probability can be written as product of $p(w_{d_i, k}|c_j; \theta)$. In summary,

$p(d_i|c_j; \theta)=p(|d_i|) \prod_{k=1}^{|d_i|}p(w_{d_i, k}|c_j; \theta)$

With a class-specific document distribution in hand, the document $d_i$ can be created by selecting a class according to prior class probability $p(c_j| \theta)$. Then, the likelihood of document is

$p(d_i|\theta)=\sum_{j \in C} p(c_j| \theta)p(d_i|c_j; \theta)$

Hence, the document distribution is multinomial, with mixture weight $p(c_j| \theta)$. The common conjugate prior for multinomial is Dirichelt distribution. $P(\theta)=Dir(\theta|\alpha)$. We do not discuss further here, but just remember $\alpha$ is a constant greater than 1. In the following code, $\alpha$ is set 2, resulting in Laplace smoothing.

Turn to EM algorithm. In E-step, it labels unlabeled data by using current prior class and class-specific word probability. In M-step, it re-estimates these probabilities by using the updated labels in E-step. Other parameter used in EM algorithm is $\lambda$, which is mentioned above. By setting $\lambda > 0$, EM uses unlabeled data in training classifier.

### 3.2 R code

First, the function used are as follows:

# prior_class : p(class | theta)
# word_p: p(w|class, theta)
estimate_theta <- function(l.x, l.y, u.x, u.y, nV, lambda) {

if (!is.null(u.y)) u.x <- matrix(u.x, nrow=length(u.y))
l.x <- matrix(l.x, nrow=length(l.y))
prior_class <- sapply(class, function(c) {
nomi <- 1 + lambda * length(which(u.y==c)) + length(which(l.y==c))
denomi <- nC + lambda * length(u.y) + length(l.y)
return (nomi/denomi)
})

word_p <- lapply(class, function(c) {
if (length(which(u.y==c)) == 0 ) { u_xc <- t(matrix(rep(0, nV), nc=1)) } else {
u_xc <- matrix(u.x[which(u.y==c),],nc=nV) }
if (length(which(l.y==c)) == 0 ) { l_xc <- t(matrix(rep(0, nV), nc=1)) } else {
l_xc <- matrix(l.x[which(l.y==c),],nc=nV) }

nomi <- 1 + lambda * apply(u_xc, 2, sum) + apply(l_xc, 2, sum)
return (nomi/sum(nomi))
})

word_n_p <- lapply(class, function(c) {
if (length(which(u.y!=c)) == 0 ) { u_xc <- t(matrix(rep(0, nV), nc=1)) } else {
u_xc <- matrix(u.x[which(u.y!=c),],nc=nV) }
if (length(which(l.y!=c)) == 0 ) { l_xc <- t(matrix(rep(0, nV), nc=1)) } else {
l_xc <- matrix(l.x[which(l.y!=c),],nc=nV) }
nomi <-  1 + lambda * apply(u_xc, 2, sum) + apply(l_xc, 2, sum)
return (nomi/sum(nomi))
})

return(list(prior_class=prior_class, word_p=word_p, word_n_p=word_n_p))
}

# classifier #
classifier <- function(model, x) {
prior_class <- model$prior_class word_p <- model$word_p
x <- matrix(x, nr=ifelse(is.null(nrow(x)), 1, nrow(x)))
est_class_p <- sapply(1:nrow(x), function(d) {
nomi <- sapply(class, function(c) prior_class * prod(word_p[c][which(x[d,]==1)]) )
return(nomi/sum(nomi))
}) %>% matrix(., nr=nC) %>% t()

label <-  apply(est_class_p , 1, function(a) order(a)[max(class)])
return(list(label=label, p=est_class_p ))
}

log_theta <- function(prior_class, word_p) {sum (log(prior_class))+ sum(log(unlist(word_p)) )}
log_D <- function(prior_class, word_p, x, z) {
x <- matrix(x, nr=ifelse(is.null(nrow(x)), 1, nrow(x)))
lik_D <- 0
for (c in class) {
lik_D <- lik_D + z[,c] %*% (log(prior_class) +  x %*% log(word_p[c]) + (1-x) %*% log(1-word_p[c] ) )
}
return(lik_D)
}

EM <- function(init, epsilon, lambda, l.x, l.y, u.x, nV) {
model <- init
lik <- -10^20
prob.ly <- matrix(sapply(class, function(c) ifelse(l.y==c,1,0)), nr=length(l.y))
iter <- 0
repeat {
iter <- iter + 1
# E step (label u.x by current prior_class and word_p)
label_u <- classifier(model, u.x)

# M step: re-estimate prior_class, word_p
new_model <- estimate_theta(l.x, l.y, u.x, label_u$label, nV, lambda) new_lik <- log_theta(new_model$prior_class, new_model$word_p) + log_D(new_model$prior_class, new_model$word_p, l.x, prob.ly) + lambda * log_D(new_model$prior_class, new_model$word_p, u.x, label_u$p)

print(paste('iteration=', iter, '; loglikelihood=', new_lik))
if ( (new_lik - lik) > epsilon) {
model <- new_model
lik <- new_lik
} else { break }
}# repeat

return(model)
}

result <- function(model, x, y, top) {
label <- classifier(model, x)$label tb <- table(label, y) accuracy <- sum(diag(tb))/sum(tb) word_p <- model$word_p
word_n_p <- model$word_n_p ratio <- sapply(class, function(c) word_p[c] * log(word_p[c]/word_n_p[c])) top_word <- sapply(class, function(c) word_list[order(-ratio[,c])]) [1:top, ] ratio <- sapply(class, function(c) ratio[order(-ratio[,c]), c]) [1:top, ] list(tb=tb, accuracy=accuracy, top_word=top_word, log_ratio=ratio) } Load packages we need. Then randomly use 16 labeled instances (4 from each class) and 15,000 unlabeled instances. library(dplyr) library(R.matlab) library(ggplot2) library(entropy) library(wordcloud) # ----------------- load data ---------------------------# # available: http://www.cs.nyu.edu/~roweis/data.html data<-readMat("20news_w100.mat") data$document <- as(data$document, 'Matrix') %>% t() word_list <- data$wordlist %>% unlist()
x <- data$document y <- data$newsgroups

# ----------------- set parameter ---------------------#
nC <- data$groupnames %>% unlist() %>% length() nV <- data$wordlist %>% unlist() %>% length()
class <- c(1:4)
l.nD <- 4
u.nD <- 15000
n.split <- 2
lambda <- 1

# ---------------------- sample data -------------------#
set.seed(25)
l.select <- sapply(class, function(c) sample(which(y==c), l.nD)) %>% as.numeric()
l.y <- data$newsgroups[l.select] l.x <- data$document[ l.select, ]
set.seed(25); u.select <- sample((1:nrow(data$document))[-l.select], u.nD) u.x <- data$document[u.select,]
u.y <- data$newsgroups[u.select] ##### 3.2.1 Co-training with fake split Since this dataset is one-view dataset, we make entropy split here into two views. Then obtain the initial view 1 and view 2 classifier, SSL_sp. #---------------------------co training --------------------------------------------# #--------------- fake split (based on entropy) ------------------# order <- apply(x, 2, function(xx) entropy(xx,method='Laplace')) %>% order(-.) split <- lapply(1:n.split, function(i) order[seq(i,100,n.split)]) l.x.sp <- lapply(split, function(s) l.x[,s]) u.x.sp <- lapply(split, function(s) u.x[,s]) SSL_sp <- lapply(1:n.split, function(s) { init1 <- estimate_theta(l.x.sp[[s]], l.y, u.x=NULL, u.y=NULL, length(split[[s]]), lambda) EM(init1, epsilon=0, lambda, l.x.sp[[s]], l.y, l.x.sp[[s]], length(split[[s]])) }) Co-training only uses k-confident unlabel instances to train other feature classifiers. Here, set k=250. #-------------- co-training -----------------------# co_training <- function (l.x.sp, l.y, u.x.sp, split, SSL_sp, k) { iter <- 0 loss <- 10^5 pred_u <- lapply(1:n.split, function(s) classifier(SSL_sp[[s]], u.x.sp[[s]]) ) repeat { iter <- iter + 1 # k-confident-ux k_pred_u <- lapply(1:n.split, function(s) apply(pred_u[[s]]$p, 2, function(c) order(-c))[1:k,] %>% as.numeric()
)

new_SSL_sp <- lapply(1:n.split, function(s) {
new.l <- unlist(k_pred_u)[-seq((s-1)*k +1, s*k)]
pred <- lapply(pred_u, function(u) u$label) %>% unlist() EM(SSL_sp[[s]], epsilon=0, lambda, l.x=rbind(l.x.sp[[s]],u.x.sp[[s]][new.l,]), l.y=c(l.y,pred[new.l]), u.x=rbind(l.x.sp[[s]],u.x.sp[[s]][new.l,]), length(split[[s]]) ) }) pred_u <- lapply(1:n.split, function(s) classifier(new_SSL_sp[[s]], u.x.sp[[s]]) ) pred_l <- lapply(1:n.split, function(s) classifier(new_SSL_sp[[s]], l.x.sp[[s]]) ) mis_pred <- lapply(pred_l, function(pred) ifelse(pred$label==l.y, 0, 1))  %>% unlist() %>% sum()
pred_u_label <- lapply(pred_u, function(c) c$label) pair <- combn(1:n.split, 2) nonagree <- apply(pair, 2, function(r) ifelse(pred_u_label[[r[1]]] == pred_u_label[[r[2]]], 0, 1) ) %>% as.numeric() %>% sum() new_loss <- mis_pred+nonagree print(new_loss) if ( new_loss < loss ) { SSL_sp <- new_SSL_sp loss <- new_loss } else {break} } # repeat return(SSL_sp) } co_SSL<- co_training(l.x.sp, l.y, u.x.sp, split, SSL_sp,k=250) To evaluate the final model, we predict all data by classifier 1 and 2 to obtain its probabilistic labels. Then sum them up, and the label gets the majority probability wins. class_p <- lapply(1:n.split, function(s) classifier(co_SSL[[s]], x[,split[[s]]])$p)
class_p <- sapply(class, function(c)
lapply(class_p, function(p) p[,c]) %>% do.call(cbind, .) %>% apply(., 1, sum)
)
label <- apply(class_p, 1, function(pp) order(-pp)[1])

The accuracy is 0.7067.

##### 3.2.2 Generative model

Run SL model and SSL model (lambda=1). Lambda is adjustable but set one here.

#---------------------------weighted EM -------------------------#
init <- estimate_theta(l.x, l.y, u.x=NULL, u.y=NULL, nV, lambda=1)
SL_EM <- EM(init, 0, lambda, l.x, l.y, l.x, nV)
SSL_EM <- EM(init, 0, lambda, l.x, l.y, u.x, nV)

SL_result <- result(SL_EM, x, y, nV)
SSL_result <- result(SSL_EM, x, y, nV)
c(SL_result$accuracy, SSL_result$accuracy)

SSL model improves accuracy from 0.5195 to 0.7867. Then, we can check the words which have higher predictive power. The predictive power here is weighted log likelihood ratio.

$p(w_{d_i, k}|c_j; \theta) log (\frac{p(w_{d_i, k}|c_j; \theta)}{p(w_{d_i, k}|\neg c_j; \theta)})$

The first term tells the word occurs frequently has higher power. The second term indicates the word occurs frequently in one class but seldom in other classes has higher power. Figure 3 shows SSL can extract important keywords out.

# ---------------top predictive words----------------#
pal2 <- brewer.pal(8,"Dark2")
word_plot <- function(model, class) {
set.seed(343)
wordcloud(model$top_word[,class],model$log_ratio[,class], scale=c(3,1),max.words=100, random.order=FALSE, rot.per=.15, colors=pal2)
}
word_plot(SL_result, 1)

Figure 3

We do not post further code here, but you can try different lambda, and numbers of labeled and unlabeled instances.
When having enough labeled instances, SL model performs almost the same as SSL model. Increasing more unlabeled data cannot improve accuracy significantly anymore.

Figure 4

The different lambdas have the same implication. When having few labeled data, SSL perfroms better, making the full use of unlabeled data ($\lambda=1$). On the contrary, with enough labeled data, SSL does not need to set high $\lambda$.

Figure 5

# Reference

• Balcan, M. F., & Blum, A. (2010). A discriminative model for semi-supervised learning. Journal of the ACM (JACM), 57(3), 19.
• Du, J., Ling, C. X., & Zhou, Z. H. (2011). When does cotraining work in real data?. IEEE Transactions on Knowledge and Data Engineering, 23(5), 788-799.
• Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3), 103-134.
• Zhu, X. (2011). Semi-supervised learning. In Encyclopedia of Machine Learning (pp. 892-897). Springer US.

## Hsiao

Taiwanese. Obtained the master's degree in economics in Japan. Use tools like R and SAS.