Combining active learning suggestions

Alasdair Tran; Cheng Soon Ong; Christian Wolf

doi:10.7717/peerj-cs.157

Combining active learning suggestions

Alasdair Tran ^1,2, Cheng Soon Ong^1,3, Christian Wolf^4,5

1Research School of Computer Science, Australian National University, Canberra, ACT, Australia

2Data to Decisions Cooperative Research Centre, Adelaide, SA, Australia

3Machine Learning Research Group, Data61, CSIRO, Canberra, ACT, Australia

4Research School of Astronomy and Astrophysics, Australian National University, Canberra, ACT, Australia

5ARC Centre of Excellence for All-sky Astrophysics (CAASTRO), Sydney, NSW, Australia

DOI: 10.7717/peerj-cs.157

Published: 2018-07-23
Accepted: 2018-06-14
Received: 2018-01-05

Academic Editor: Sebastian Ventura

Subject Areas: Data Mining and Machine Learning
Keywords: Active learning, Bandit, Rank aggregation, Benchmark, Multiclass classification

Copyright: © 2018 Tran et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Tran A, Ong CS, Wolf C. 2018. Combining active learning suggestions. PeerJ Computer Science 4:e157 https://doi.org/10.7717/peerj-cs.157

The authors have chosen to make the review history of this article public.

Abstract

We study the problem of combining active learning suggestions to identify informative training examples by empirically comparing methods on benchmark datasets. Many active learning heuristics for classification problems have been proposed to help us pick which instance to annotate next. But what is the optimal heuristic for a particular source of data? Motivated by the success of methods that combine predictors, we combine active learners with bandit algorithms and rank aggregation methods. We demonstrate that a combination of active learners outperforms passive learning in large benchmark datasets and removes the need to pick a particular active learner a priori. We discuss challenges to finding good rewards for bandit approaches and show that rank aggregation performs well.

Introduction

Recent advances in sensors and scientific instruments have led to an increasing use of machine learning techniques to manage the data deluge. Supervised learning has become a widely used paradigm in many big data applications. This relies on building a training set of labeled examples, which is time-consuming as it requires manual annotation from human experts.

The most common approach to producing a training set is passive learning, where we randomly select an instance from a large pool of unlabeled data to annotate, and we continue doing this until the training set reaches a certain size or until the classifier makes sufficiently good predictions. Depending on how the underlying data is distributed, this process can be quite inefficient. Alternatively we can exploit the current set of labeled data to identify more informative unlabeled examples to annotate. For instance we can pick examples near the decision boundary of the classifier, where the class probability estimates are uncertain (i.e., we are still unsure which class the example belongs to).

Many active learning heuristics have been developed to reduce the labeling bottleneck without sacrificing the classifier performance. These heuristics actively choose the most informative examples to be labeled based on the predicted class probabilities. “Overview of Active Learning” describes two families of algorithms in detail: uncertainty sampling and version space reduction.

In this paper, we present a survey of how we can combine suggestions from various active learning heuristics. In supervised learning, combining predictors is a well-studied problem. Many techniques such as AdaBoost (Freund & Schapire, 1996) (which averages predictions from a set of models) and decision trees (Breiman et al., 1984) (which select one model for making predictions in each region of the input space) have been shown to perform better than just using a single model. Inspired by this success, we propose to combine active learning suggestions with bandit and rank aggregation methods in “Combining Suggestions.”

The use of bandit algorithms to combine active learners has been studied before (Baram, El-Yaniv & Luz, 2004; Hsu & Lin, 2015). Borda count, a simple rank aggregation method, has been used in the context of multi-task learning for linguistic annotations (Reichart et al., 2008), where we have one active learner selecting examples to improve the performance of multiple related tasks (e.g., part-of-speech tagging and name entity recognition). Borda count has also been used in multi-label learning (Reyes, Morell & Ventura, 2018) to combine uncertainty information from multiple labels. As far as we know, other aggregation methods have not been explored and our work is the first time that social choice theory is used to rank and aggregate suggestions from multiple active learners.

This paper makes the following two main contributions:

We empirically compare four bandit and three rank aggregation algorithms in the context of combining active learning heuristics. We apply these algorithms to 11 benchmark datasets from the UCI Machine Learning Repository (Lichman, 2013) and a large dataset from the Sloan Digital Sky Survey (SDSS) (Alam et al., 2015). The experimental setup and discussion are described in “Experimental Protocol, Results, and Discussion.”
We propose two metrics for evaluation: the mean posterior balanced accuracy (MPBA) and the strength of an algorithm. The MPBA extends the metric proposed in Brodersen et al. (2010) from the binary to the multi-class setting. This is an accuracy measure that takes class imbalance into account. The strength measure is a variation on the deficiency measure used in Baram, El-Yaniv & Luz (2004) which evaluates the performance of an active learner or combiner, relative to passive learning. The main difference between our measure and that of Baram, El-Yaniv & Luz (2004) is that ours assigns a higher number for better active learning methods and ensures that it is upper-bounded by 1 for easier comparison across datasets.

Overview of Active Learning

In this paper we consider the binary and multiclass classification settings where we would like to learn a classifier h, which is a function that maps some feature space $X \subseteq ℝ^{d}$ to a probability distribution over a finite label space 𝒴: (1) $h : X \to p (Y)$

In other words, we require that the classifier produces class probability estimates for each unlabeled example. For instance, in logistic regression with only two classes, i.e., $Y = {0, 1}$ , we can model the probability that an object with feature vector x belongs to the positive class with (2) $h (x; θ) = ℙ (y = 1 | x; θ) = \frac{1}{1 + e^{- θ^{T} x}}$ and the optimal weight vector θ is learned in training. We can further consider kernel logistic regression, where the feature space $X$ is the feature space corresponding to a given kernel, allowing for non-linear decision functions.

In active learning, we use the class probability estimates from a trained classifier to estimate a score of informativeness for each unlabeled example. In pool-based active learning, where we select an object from a pool of unlabeled examples at each time step, we require that some objects have already been labeled. In practice this normally means that we label a small random sample at the beginning. These become the labeled training set $ℒ_{T} \subset X \times Y$ and the rest form the unlabeled set $U \subset X$ .

Now consider the problem of choosing the next example in U for querying. Labeling can be a very expensive task, because it requires using expensive equipment or human experts to manually examine each object. Thus, we want to be smart in choosing the next example. This motivates us to come up with a rule s(x; h) that gives each unlabeled example a score based only on their feature vector x and the current classifier h. Recall that the classifier produces p(𝒴), a probability estimate for each class. We use these probability estimates from the classifier over the unlabeled examples to calculate the scores: (3) $s : p (Y) \to ℝ$ The value of s(x; h) indicates the informativeness of example x, where bigger is better. We would then label the example with the largest value of s(x; h). This will be our active learning rule r: (4) $r (U; h) = \underset{x \in U}{\arg \max} s (x; h)$

Algorithm 1 outlines the standard pool-based active learning setting.

Algorithm 1:

The pool-based active learning algorithm.

Input: unlabeled set 𝒰, labeled training set ℒ_T, classifier h(x), and active learner

r (U; h)

repeat

Select the most informative candidate x_* from 𝒰 using the active learning rule

r (U; h)

Ask the expert to label x_*. Call the label y_*.

Add the newly labeled example to the training set:

ℒ_{T} \leftarrow ℒ_{T} \cup {(x_{*}, y_{*})}

Remove the newly labeled example from the unlabeled set:

U \leftarrow U \ {x_{*}}

Retrain the classifier h(x) using ℒ_T.

until we have enough training examples.

Coming up with an optimal rule is itself a difficult problem, but there have been many attempts to derive good heuristics. Five common ones, which we shall use in our experiments, are described in “Uncertainty Sampling” and “Version Space Reduction.”

There are also heuristics that involve minimizing the variance or maximizing the classifier certainty of the model (Schein & Ungar, 2007), but they are computationally expensive. For example, in the variance minimization heuristic, the score of a candidate example is the expected reduction in the model variance if that example were in the training set. To compute this reduction, we first need to give the example each of the possible labels, add it to the training set, and update the classifier. This is expensive to run since in each iteration, the classifier needs to be retrained k × U times, where k is the number of classes and U is the size of the unlabeled pool. There are techniques to speed this up such as using online training or assigning a score to only a small subset of the unlabeled pool. Preliminary experiments showed that these heuristics do not perform as well as the simpler ones (Tran, 2015), so we do not consider them in this paper.

A more comprehensive treatment of these active learning heuristics can be found in (Settles, 2012).

Uncertainty sampling

Lewis & Gale (1994) introduced uncertainty sampling, where we select the instance whose class membership the classifier is least certain about. These tend to be points that are near the decision boundary of the classifier. Perhaps the simplest way to quantify uncertainty is the least confidence heuristic (Culotta & McCallum, 2005), where we pick the candidate whose most likely label the classifier is most uncertain about: (5) $r_{L C} (U; h) = \underset{x \in U}{\arg \max} {- \max_{y \in Y} p (y | x; h)}$ where p(y|x; h) is the probability that the object with feature vector x belongs to class y under classifier h. For consistency, we have flipped the sign of the score function so that the candidate with the highest score is picked.

A second option is to calculate the entropy (Shannon, 1948), which measures the amount of information needed to encode a distribution. Intuitively, the closer the class probabilities of an object are to a uniform distribution, the higher its entropy will be. This gives us the heuristic of picking the candidate with the highest entropy of the distribution over the classes: (6) $r_{H E} (U; h) = \underset{x \in U}{\arg \max} {- \sum_{y \in Y} p (y | x; h) \log [p (y | x; h)]}$ As a third option we can pick the candidate with the smallest margin, which is defined as the difference between the two highest class probabilities (Scheffer, Decomain & Wrobel, 2001): (7) $r_{S M} (U; h) = \underset{x \in U}{\arg \max} {- (\max_{y \in Y} p (y | x; h) - \max_{z \in Y \ {y^{*}}} p (z | x; h))}$ where $y^{*} = \underset{y \in Y}{\arg \max} p (z | x; h)$ and we again flip the sign of the score function. Since the sum of all probabilities must be 1, the smaller the margin is, the harder it is to differentiate between the two most likely labels.

An extension to the above three heuristics is to weight the score with the information density so that we give more importance to instances in regions of high density: (8) $s_{I D} (U; h) = (\frac{1}{U} \sum_{k = 1}^{E} sim (x, x^{(k)})) s (x; h)$ where h is the classifier, s(x; h) is the original score function of the instance with feature vector x, U is the size of the unlabeled pool, and sim(x, x^(k)) is the similarity between x and another instance x^(k) using the Gaussian kernel with parameter γ: (9) $sim (x, x^{(k)}) = \exp (γ ‖ x - x^{(k)} ‖^{2})$

The information density weighting was proposed by Settles & Craven (2008) to discourage the active learner from picking outliers. Although the class membership of outliers might be uncertain, knowing their labels would probably not affect the classifier performance on the data as a whole.

Version space reduction

Instead of focusing on the uncertainty of individual predictions, we could instead try to constrain the size of the version space, thus allowing the search for the optimal classifier to be more precise. The version space is defined as the set of all possible classifiers that are consistent with the current training set. To quantify the size of this space, we can train a committee of B classifiers, ℬ = {h₁, h₂, …, h_B}, and measure the disagreement among the members of the committee about an object’s class membership. Ideally, each member should be as different from the others as possible but still be in the version space (Melville & Mooney, 2004). In order to have this diversity, we give each member only a subset of the training examples. Since there might not be enough training data, we need to use bootstrapping and select samples with replacement. Hence this method is often called Query by Bagging (QBB).

One way to measure the level of disagreement is to calculate the margin using the class probabilities estimated by the committee (Melville & Mooney, 2004): (10) $r_{QBBM} (U; h) = \underset{_{x \in U}}{\arg \max} {- (\max_{y \in Y} p (y | x; ℬ) - \max_{z \in Y \ {y^{*}}} p (z | x; ℬ))}$ where (11) $y^{*} = \underset{y \in Y}{\arg \max} p (z | x; B)$ (12) $p (z | x; ℬ) = \frac{1}{B} \sum_{b \in ℬ} p (y | x; h_{b})$

This looks similar to one of the uncertainty sampling heuristics, except now we use p(y|x; ℬ) instead of p(y|x; h). That is, we first average out the class probabilities predicted by the members before minimizing the margin. McCallum & Nigam (1998) offered an alternative disagreement measure which involves picking the candidate with the largest mean Kullback–Leibler (KL) divergence from the average: (13) $r_{QBBKL} (U; h) = \underset{x \in U}{\arg \max} {\frac{1}{B} \sum_{b = 1}^{B} D_{KL} (p_{b} ∥ p_{ℬ})}$ where D_KL(p_b‖p_B) is the KL divergence from p_B (the probability distribution that is averaged across the committee B), to p_b (the distribution predicted by a member b ∈ B): (14) $D_{KL} (p_{b} ∥ p_{ℬ}) = \sum_{y \in Y} p (y | x; h_{b}) \ln \frac{p (y | x; h_{b})}{p (y | x; ℬ)}$ For convenience, we summarize the five heuristics discussed above in Table 1.

Table 1:

Summary of active learning heuristics used in our experiments.

Abbreviation	Heuristic	Objective function
confidence	Least confidence	$\underset{_{x \in U}}{\arg \max} {- \max_{y \in Y} p (y \| x; h)}$
entropy	Highest entropy	$\underset{_{x \in U}}{\arg \max} {- \sum_{y \in Y} p (y \| x; h) \log [p (y \| x; h)]}$
margin	Smallest margin	$\underset{_{x \in U}}{\arg \max} {- (\max_{y \in Y} p (y \| x; h) - \max_{z \in Y \ {y^{*}}} p (z \| x; h))}$
qbb-margin	Smallest QBB margin	$\underset{_{x \in U}}{\arg \max} {- (\max_{y \in Y} p (y \| x; ℬ) - \max_{z \in Y \ {y^{*}}} p (z \| x; ℬ))}$
qbb-kl	Largest QBB KL	$\underset{_{x \in U}}{\arg \max} {\frac{1}{B} \sum_{b = 1}^{B} D_{KL} (p_{b} ∥ p_{ℬ})}$

DOI: 10.7717/peerj-cs.157/table-1

Note:

Notations: p(y|x; h) is the probability of that an object with feature vector x has label y under classifier h, ℬ is the set of ℬ classifiers {h₁, h₂, …, h_B}, 𝒴 is the set of possible labels, y^* is the most certain label, 𝒰 is the set of unlabeled instances, D_KL(p‖q) is the Kullback–Leibler divergence of p from q, and p_ℬ is the class distribution averaged across classifiers in ℬ. For consistency, with heuristics that use minimization, we flip the sign of the score so that we can always take the argmax to get the best candidate.

Combining Suggestions

Out of the five heuristics discussed, which one should we use in practice when we would like to apply active learning to a particular problem? There have been some attempts in the literature to do a theoretical analysis of their performance. Proofs are however scarce, and when there is one available, they normally only work under restrictive assumptions. For example, Freund et al. (1997) showed that the query by committee algorithm (a slight variant of our two QBB heuristics) guarantees an exponential decrease in the prediction error with the training size, but only when there is no noise. In general, whether any of these heuristics is guaranteed to beat passive learning is still an open question.

Even though we do not know which one is the best, we can still combine suggestions from all of the heuristics. This can be thought of as the problem of prediction with expert advice, where each expert is an active learning heuristic. In this paper we explore two different approaches: we can either consider the advice of only one expert at each time step (with bandit algorithms), or we can aggregate the advice of all the experts (with social choice theory).

Combining suggestions with bandit theory

First let us turn our attention to the multi-armed bandit problem in probability theory (Berry & Fristedt, 1985). The colorful name originates from the situation where a gambler stands in front of a slot machine with R levers. When pulled, each lever gives out a reward according to some unknown distribution. The goal of the game is to come up with a strategy that can maximize the gambler’s lifetime rewards. In the context of active learning, each lever is a heuristic with a different ability to identify the candidate whose labeling information is most valuable.

The main problem in multi-armed bandits is the trade-off between exploring random heuristics and exploiting the best heuristic so far. There are many situations in which we find our previously held beliefs to be completely wrong. By always exploiting, we could miss out on the best heuristic. On the other hand, if we explore too much, it could take us a long time to reach the desired accuracy.

Bandit algorithms do not need to know the internal workings of the heuristics, but only the reward received from using any of them. At each time step, we receive a reward from a heuristic, and based on the history of all the rewards, the bandit algorithm can decide on which heuristic to pick next. Formally, we need to learn the function (15) $b : {(J_{R} \times [0, 1])}^{n} \to J_{R}$ where b is the bandit algorithm, the reward is normalized between 0 and 1, J_ℛ is the index set over the set of heuristics ℛ, and n is the time horizon.

What would be an appropriate reward w in this setting? We propose using the incremental increase in the performance of the test set after the candidate is added to the training set. This, of course, means that we need to keep a separate labeled test set around, just for the purpose of computing the rewards. We could, as is common practice in machine learning, use cross validation or bootstrap on ℒ_T to estimate the generalization performance. However for simplicity of presentation we use a separate test set ℒ_S.

Figure 1 and Algorithm 2 outline how bandits can be used in pool-based active learning. The only difference between the bandit algorithms lies in the Select function that selects which heuristic to use, and the Update function that updates the algorithm’s selection parameters when receiving a new reward.

Figure 1: Active learning pipeline with bandit algorithms.
We need to collect rewards w from the test set *ℒ_S* in order to decide which heuristic to choose at each time step. This routine is indicated by the red arrows. Notations: ℛ is the set of heuristics {r₁, …, *r_R*}, *ℒ_T* is the training set, *ℒ_S* is the test set, 𝒰 is the unlabeled set, and p(𝒴) is the predicted class probabilities on the unlabeled data 𝒰.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-1

There have been some attempts to combine active learning suggestions in the literature. Baram, El-Yaniv & Luz (2004) used the EXP4 multi-armed bandit algorithm to automate the selection process. They proposed a reward called the classification entropy maximization, which can be shown to grow at a similar rate to the true accuracy in binary classification with support vector machines (SVMs). We will not compare our results directly with those in Baram, El-Yaniv & Luz (2004) since we would like to evaluate algorithms that can work with both binary and multi-class classification. Our experiments also use logistic regression which produces probability estimates directly, rather than SVMs which can only produce unnormalized scores. Hsu & Lin (2015) studied an improved version of EXP4, called EXP4.P, and used importance weighting to estimate the true classifier performance using only the training set. In this paper, we empirically compare the following four bandit algorithms: Thompson sampling, OC-UCB, kl-UCB, and EXP3++.

Algorithm 2:

Pool-based active learning with bandit theory. Note that in addition to the set of active learning heuristics ℛ and the test set ℒ_S, some bandit algorithms also need to know n, the maximum size of the training set, in advance.

Input: unlabeled set 𝒰, labeled training set ℒ_T, labeled test set ℒ_S, classifier h, desired training size n, set of active learning heuristics ℛ, and bandit algorithm b with two functions Select and Update.

while |ℒ_T| < n do

Select a heuristic r_* ∈ ℛ according to Select.

Select the most informative candidate x_* from 𝒰 using the chosen heuristic r_* (𝒰; h).

Ask the expert to label x_*. Call the label y_*.

Add the newly labeled example to the training set:

ℒ_{T} \leftarrow ℒ_{T} \cup {(x_{*}, y_{*})}

Remove the newly labeled example from the unlabeled set:

U \leftarrow U \ {x_{*}}

Retrain the classifier h(x) using ℒ_T.

Run the updated classifier on the test set ℒ_S to compute the increase in the performance w.

Update the parameters of b with Update(w).

end

Thompson sampling

The oldest bandit algorithm is Thompson sampling (Thompson, 1933) which solves the exploration-exploitation trade-off from a Bayesian perspective.

Let W_i be the reward of heuristic r_i ∈ ℛ. Observe that even with the best heuristic, we still might not score perfectly due to having a poor classifier trained on finite data. Conversely, a bad heuristic might be able to pick an informative candidate due to pure luck. Thus there is always a certain level of randomness in the reward received. Let us treat the reward W_i as a normally distributed random variable with mean 𝜈_i and variance $τ_{i}^{2}$ : (16) $(W_{i} | ν_{i}) \sim N (ν_{i}, τ_{i}^{2})$

If we knew both 𝜈_i and τ_i² for all heuristics, the problem would become trivially easy since we just need to always use the heuristic that has the highest mean reward. In practice, we do not know the true mean of the reward 𝜈_i, so let us add a second layer of randomness and assume that the mean itself follows a normal distribution: (17) $ν_{i} \sim N (μ_{i}, σ_{i}^{2})$

To make the problem tractable, let us assume that the variance τ_i² in the first layer is a known constant. The goal now is to find a good algorithm that can estimate μ_i and σ_i².

We start with a prior on μ_i and σ_i² for each heuristic r_i. The choice of prior does not usually matter in the long run. Since initially we do not have any information about the performance of each heuristic, the appropriate prior value for μ_i is 0, i.e., there is no evidence (yet) that any of the heuristics offers an improvement to the performance.

In each round, we draw a random sample 𝜈_i′ from the normal distribution $N (μ_{i}, σ_{i}^{2})$ for each i and select heuristic r_* that has the highest sampled value of the mean reward: (18) $r_{*} = \underset{i}{arg max} {v^{'}}_{i}$

We then use this heuristic to select the object that is deemed to be the most informative, add it to the training set, and retrain the classifier. Next we use the updated classifier to predict the labels of objects in the test set. Let w be the reward observed. We now have a new piece of information that we can use to update our prior belief about the mean μ_* and the variance σ_*² of the mean reward. Using Bayes’ theorem, we can show that the posterior distribution of the mean reward remains normal, (19) $(ν_{*} ∣ W_{*} = w) ~ ({μ^{'}}_{*}, {σ^{'}}_{*}^{2})$ with the following new mean and variance: (20) $μ_{*}^{'} = \frac{μ_{*} τ_{*}^{2} + w σ_{*}^{2}}{σ_{*}^{2} + τ_{*}^{2}} σ_{*}^{'}^{2} = \frac{σ_{*}^{2} τ_{*}^{2}}{σ_{*}^{2} + τ_{*}^{2}}$

Algorithm 3 summarizes the Select and Update functions used in Thompson sampling.

Algorithm 3:

Thompson sampling with normally distributed rewards. Notations: ℛ is the set of R heuristics, μ is the mean parameter of the average reward, σ² is the variance parameter of the average reward, τ² is the known variance parameter of the reward, and w is the actual reward received.

function Select()

for i ∈ {1, 2, …, R} do

𝜈_i′ draw a sample from

N (μ_{i}, σ_{i}^{2})

end

Select the heuristic with the highest sampled value:

r_{*} \leftarrow \underset{i}{argmax} {ν^{'}}_{i}

function Update(w)

μ_{*} \leftarrow \frac{μ_{*} τ_{*}^{2} + w σ_{*}^{2}}{σ_{*}^{2} + τ_{*}^{2}} σ_{*}^{2} \leftarrow \frac{σ_{*}^{2} τ_{*}^{2}}{σ_{*}^{2} + τ_{*}^{2}}

Upper confidence bounds

Next we consider the Upper Confidence Bound (UCB) algorithms which use the principle of “optimism in the face of uncertainty.” In choosing which heuristic to use, we first estimate the upper bound of the reward (that is, we make an optimistic guess) and pick the one with the highest bound. If our guess turns out to be wrong, the upper bound of the chosen heuristic will decrease, making it less likely to get selected in the next iteration.

There are many different algorithms in the UCB family, e.g., UCB1-TUNED & UCB2 (Auer, Cesa-Bianchi & Fischer, 2002a), V-UCB (Audibert, Munos & Szepesvári, 2009), OC-UCB (Lattimore, 2015), and kl-UCB (Cappé et al., 2013). They differ only in the way the upper bound is calculated. In this paper, we only consider the last two. In Optimally Confident UCB (OC-UCB), Lattimore (2015) suggests that we pick the heuristic that maximizes the following upper bound: (21) $r_{*} = \underset{i}{\arg \max} (\bar{w_{i}} + \sqrt{\frac{α}{T_{i} (t)} \ln (\frac{ψ n}{t})})$ where $\bar{w_{i}}$ is the average of the rewards from r_i that we have observed so far, t is the time step, T_i(t) is the number times we have selected heuristic r_i before step t, and n is the maximum number of steps that we are going to take. There are two tunable parameters, α and ψ, which the author suggests setting to 3 and 2, respectively.

In kl-UCB, Cappé et al. (2013) suggest that we can instead consider the KL-divergence between the distribution of the current estimated reward and that of the upper bound. In the case of normally distributed rewards with known variance σ², the chosen heuristic would be (22) $r_{*} = \underset{i}{\arg \max} (\bar{w_{i}} + \sqrt{2 σ^{2} \frac{\ln (T_{i} (t))}{t}})$

Algorithms 4 and 5 summarize these two UCB approaches. Note that the size of the reward w is not used in Update (w) of UCB, except to select the best arm.

Algorithm 4:

Optimally Confident UCB. Notations: n is the time horizon (maximum number of time steps), t is the current time step, T_i(t) counts how many times heuristic i has been selected before step t, w is the reward received, and

\bar{w_{i}}

is the average of the rewards from r_i so far.

fuction Select()

r_{*} \leftarrow \underset{i}{\arg \max} \bar{w_{i}} + \sqrt{\frac{3}{T_{i} (t)} \ln (\frac{2 n}{t})}

function Update(w)

t \leftarrow t + 1

T_{*} (t) \leftarrow T_{*} (t - 1) + 1

Algorithm 5:

kl-UCB with normally distributed rewards. Notations: σ² is the variance of the rewards, t is the current time step, T_i(t) counts how many times heuristic i has been selected before step t, w is the reward received, and

\bar{w_{i}}

is the average of the rewards from r_i so far.

function Select()

r_{*} \leftarrow \underset{i}{\arg \max} \bar{w_{i}} + \sqrt{2 σ^{2} \frac{\ln (T_{i} (t))}{t}}

function Update(w)

t \leftarrow t + 1

T_{*} (t) \leftarrow T_{*} (t - 1) + 1

EXP3++

The exponential-weight algorithm for exploration and exploitation (EXP3) was first developed by Auer et al. (2002b) to solve the non-stochastic bandit problem where we make no statistical assumptions about the reward distribution. This is also often known as the adversarial setting, where we have an adversary who generates an arbitrary sequence of rewards for each heuristic in advance. Like Thompson sampling, the algorithm samples from a probability distribution at each step to pick a heuristic. Here however, we construct the distribution with exponential weighting (hence the name EXP3). We shall test Seldin & Slivkins (2014)’s EXP3++ algorithm (see Algorithm 6). This is a generalization of the original EXP3 and it has been shown to perform well in both the stochastic (where the reward of each heuristic follows an unknown reward distribution) and the adversarial regime.

Algorithm 6:

EXP3++ algorithm. Notations: ℛ is the set of R heuristics, t is the current time step, β is a parameter used to weight the heuristics for selection, ξ_i and ε_i are used to compute the loss L_i, ρ is the distribution from which a heuristic is sampled, and w is the reward received.

function Select()

β = \frac{1}{2} \sqrt{\frac{\ln R}{t R}}

for i ∈ {1, 2, …, R} do

ξ_{i} = \frac{18 In(t)^{2}}{t \min {(1, \frac{1}{t} (L_{i} - m i n (L)))}^{2}}

ξ_{i} = \min (\frac{1}{2 R}, β, ξ_{i})

ρ_{i} = \frac{e^{- β * L_{i}}}{\sum_{j} e^{- β * L_{j}}}

end

r_{*} \leftarrow

draw a sample from ℛ with probability distribution ρ.

function Update (w)

t \leftarrow t + 1

T_{*} (t) \leftarrow T_{*} (t - 1) + 1

L_{*} \leftarrow \frac{L_{*} + (1 - w)}{(1 - \sum_{j} ε_{j}) W_{*} + ε_{*}}

Combining suggestions with social choice theory

A drawback of the above bandit methods is that at each iteration, we could only use one suggestion from one particular heuristic. EXP4 and EXP4.P algorithms can take advice from all heuristics by maintaining a weight on each of them. However, being a bandit method, they require designing a reward scheme. If the reward is the performance on a test set, we would need to keep around a separate subset of the data, which is expensive and sometimes impossible to obtain in practice. This leads us to social choice theory, which can combine suggestions like EXP4 and EXP4.P, while not needing the concept of a reward. Originally developed by political scientists like Nicolas de Condorcet and Jean-Charles de Borda, this field of study is concerned with how we aggregate preferences of a group of people to determine, for example, the winner in an election (List, 2013). It has the nice property that everyone (or in our context, every active learning heuristic) has a voice.

For each heuristic, we assign a score to every candidate with the score function s(x, h) like before. We are neither interested in the actual raw scores nor the candidate with the highest score. Instead, we only need a ranking of the candidates, which is achieved by a function $k (s, U)$ that provides a ranking of the unlabeled examples according to their scores. For example, k could assign the candidate with the highest score a rank of 1, the next best candidate a rank of 2, and so on. An aggregation function c will then combine all the rankings into a combined ranking, (23) $c : σ {(J_{U})}^{R} \to σ (J_{U})$ where $σ (J_{U})$ is a permutation over the index set of the unlabeled pool 𝒰 and R is the number of heuristics. From these we can pick the highest-ranked candidate to annotate. See Table 2 for an example.

Table 2:

An example of how to convert raw scores into a ranking.

Score	s(x; h)	0.1	0.9	0.3	0.8
Rank	k(s, U)	4	1	3	2

DOI: 10.7717/peerj-cs.157/table-2

The main difference between this approach and the bandit algorithms is that we do not consider the reward history when combining the rankings. Here each heuristic is assumed to always have an equal weight. A possible extension, which is not considered in this paper, is to use the past performance to re-weight the heuristics before aggregating at each step. Figure 2 and Algorithm 7 provide an overview of how social choice theory is used in pool-based active learning.

Figure 2: Active learning pipeline with rank aggregation methods.
Unlike the bandit pipeline, there is only one cycle in which we aggregate information from all heuristics. Additional notation: $σ (J_{U})$ is a permutation (i.e., rank) on the index set of the unlabeled data.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-2

Algorithm 7:

Pool-based active learning with social choice theory.

Input: unlabeled set 𝒰, labeled training set ℒ_T, classifier h, set of active learning suggestions ℛ, ranking function k, and rank aggregator c.

repeat:

for r ∈ ℛ do

Rank all the candidates in 𝒰 with k.

end

Aggregate all the rankings into one ranking using the aggregator c.

Select the highest-ranked candidate x_* from 𝒰.

Ask the expert to label x_*. Call the label y_*.

Add the newly labeled example to the training set:

ℒ_{T} \leftarrow ℒ_{T} \cup {(x_{*}, y_{*})}

Remove the newly labeled example from the unlabeled set:

U \leftarrow U \ {x_{*}}

Retrain the classifier h(x) using ℒ_T.

until we have enough training examples.

The central question in social choice theory is how we can come up with a good preference aggregation rule. We shall examine three aggregation rules: Borda count, the geometric mean, and the Schulze method.

In the simplest approach, Borda count, we assign an integer point to each candidate. The lowest-ranked candidate receives a point of 1, and each candidate receives one more point than the candidate below. To aggregate, we simply add up all the points each candidate receives from every heuristic. The candidate with the most points is declared the winner and is to be labeled next. We can think of Borda count, then, as ranking the candidate according to the arithmetic mean.

An alternative approach is to use the geometric mean, where instead of adding up the points, we multiply them. Bedö & Ong (2016) showed that the geometric mean maximizes the Spearman correlation between the ranks. Note that this method requires the ranks to be scaled so that they lie strictly between 0 and 1. This can be achieved by simply dividing the ranks by (U + 1), where U is the number of candidates.

The third approach we consider is the Schulze method (Schulze, 2011). Out of the three methods considered, this is the only one that fulfills the Condorcet criterion, i.e., the winner chosen by the algorithm is also the winner when compared individually with each of the other candidates. However, the Schulze method is more computationally intensive since it requires examining all pairs of candidates. First we compute the number of heuristics that prefer candidate x_i to candidate x_j, for all possible pairs (x_i, x_j). Let us call this d(x_i, x_j). Let us also define a path from candidate x_i to x_j as the sequence of candidates, {x_i, x₁, x₂, …, x_j}, that starts with x_i and ends with x_j, where, as we move along the path, the number of heuristics that prefer the current candidate over the next candidate must be strictly decreasing. Intuitively, the path is the rank of a subset of candidates, where x_i is the highest-ranked candidate and x_j is at the lowest-ranked.

Associated with each path is a strength p, which is the minimum of d(x_i, x_j) for all consecutive x_i and x_j along the path. The core part of the algorithm involves finding the path of the maximal strength from each candidate to every other. Let us call p(x_i, x_j) the strength of strongest path between x_i and x_j. Candidate x_i is a potential winner if $p (x_{i}, x_{j}) \geq p (x_{j}, x_{i})$ for all other x_j. This problem has a similar flavor to the problem of finding the shortest path. In fact, the implementation uses a variant of the Floyd–Warshall algorithm to find the strongest path. This is the most efficient implementation that we know of, taking cubic time in the number of candidates.

We end this section with a small illustration of how the three aggregation algorithms work in Table 3.

Table 3:

An example of how social choice theory algorithms rank candidates by aggregating three heuristics: r₁, r₂, and r₃.

There are four candidates in the unlabeled pool: A, B, C, and D.

(a) An example of how the three heuristics rank four candidates A, B, C, and D. For instance, heuristic r₁ considers B to be the highest rank candidate, followed by A, C, and D.
Heuristic	Ranking
r₁	B A C D
r₂	A C B D
r₃	B D C A

(b) Aggregated ranking with Borda count and geometric mean. The scores are determined by the relative ranking in each heuristic. For example, A is ranked second by r₁, first by r₁, and last by r₃, thus giving us a score of 3, 4, and 1, respectively. In both methods, candidate B receives the highest aggregated score.
Candidate	Borda count	Geometric mean
A	3 + 4 + 1 = 8	3 × 4 × 1 = 12
B	4 + 2 + 4 = 10	4 × 2 × 4 = 32
C	2 + 3 + 2 = 7	2 × 3 × 2 = 12
D	1 + 1 + 3 = 5	1 × 1 × 3 = 3

(c) Aggregated ranking with the Schulze method. The table shows the strongest path strength p(x_i, x_j) between all pairs of candidates. For example, p(B, D) = 3 because the path B → D is the strongest path from B to D, where three heuristics prefer B over D. Candidate B is the winner since p(B, A) > p(A, B), p(B, C) > p(C, B), and p(B, D) > p(D, B).
From/To	A	B	C	D
A	–	1	2	2
B	2	–	2	3
C	1	1	–	2
D	2	0	1	–

DOI: 10.7717/peerj-cs.157/table-3

Experimental Protocol

We use 11 classification datasets taken from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/) (Lichman, 2013), with a large multiclass classification dataset which we extracted from the SDSS project (DOI 10.5281/zenodo.58500) (Alam et al., 2015). The code for the experiments can be found on our GitHub repository (https://github.com/chengsoonong/mclass-sky). Table 4 shows the size and the number of classes in each dataset, along with the proportion of the samples belonging to the majority class and the maximum achievable performance using logistic regression. These datasets were chosen such that we have an equal number of binary and multiclass datasets, and a mixture of small and large datasets.

Table 4:

Overview of datasets.

Dataset	Size	No. of classes	No. of features	Majority class (%)	Max performance (MPBA) (%)
Glass	214	6	10	33	65
Ionosphere	351	2	34	64	89
Iris	150	3	4	33	90
Magic	19,020	2	11	65	84
Miniboone	129,596	2	50	72	88
Pageblock	5,473	5	10	90	79
Pima	733	2	8	66	71
SDSS	2,801,002	3	11	61	90
Sonar	208	2	60	53	78
Vehicle	846	4	18	26	81
Wine	178	3	13	40	94
WPBC	194	2	34	76	58

DOI: 10.7717/peerj-cs.157/table-4

Note:

The following datasets are from the UCI Machine Learning Repository: glass, ionosphere, iris, magic, miniboone, pageblock, pima, sonar, vehicle, wine, and wpbc. In particular, the vehicle dataset comes from the Turing Institute, Glasgow, Scotland. The sdss dataset was extracted from Data Release 12 of SDSS-III.

For each dataset, we use Scikit-learn (Pedregosa et al., 2011) to train a logistic regression model using a 10-fold stratified shuffled cross-validation. Here “stratified” means that the proportion of the classes remains constant in each split. We standardize all features to have zero mean and unit variance. Although all examples have already been labeled, we simulate the active learning task by assuming that certain examples do not have any labels. For each fold, the unlabeled pool size is 70% of data up to a maximum of 10,000 examples, and the test pool consists of the remaining examples up to a maximum of 20,000. We assume all test examples are labeled. We initialize the classifier by labeling 10 random instances and using them as the initial training set. The heuristics are fast enough such that we can assign a score to every unlabeled instance at every time step. We use logistic regression with a Gaussian kernel approximation and an L2 regularizer. In the binary case, the loss function is (24) $L = \frac{1}{2} θ^{T} θ + C \sum_{i = 1}^{n} \ln (1 + \exp (- y_{i} (θ^{T} f (x_{i}))))$ where x_i is the feature vector of the ith example, y_i ∈ {0, 1} is the label of x_i, and n is the training size. The term $\frac{1}{2} θ^{T} θ$ is the regularization term to ensure that the weight vector θ is not too large, and C is a regularization hyperparameter in [10⁻⁶, 10⁶] which we find using grid search. To speed up the training time while using the Gaussian kernel, we approximate the feature map of the kernel with Random Kitchen Sinks (Rahimi & Recht, 2008), transforming the raw features x_i into a fixed 100-dimensional feature vector f (x_i). In the multiclass case, we use the One-vs-Rest strategy, where for every class we build a binary classifier that determines whether a particular example belongs to that class or not. For the QBB algorithms, we train a committee of seven classifiers, where each member is given a sample of maximum 100 examples that have already been labeled.

For the bandit algorithms, we use the increase in the MPBA on the test set as the reward. The MPBA can be thought of as the expected value of the average recall, where we treat the recall as a random variable that follows a Beta distribution. Compared to the raw accuracy score, this metric takes into account class imbalance. This is because we first calculate the recall in each class and then take the average, thus giving each class an equal weight. Refer to Appendix A for the derivation of the MPBA, which extends Brodersen et al. (2010)’s formula from the binary to the multiclass setting.

In total, we test 17 query strategies. This includes passive learning, eight active learning heuristics, five bandit algorithms, and three aggregation methods. The bandit algorithms include the four described in “Combining Suggestions with Bandit Theory” and a baseline called explore which simply selects a random heuristic at each time step. In other words, we ignore the rewards and explore 100% of the time. For all bandit and rank aggregation methods, we take advice from six representative experts: passive, confidence, margin, entropy, qbb-margin, and qbb-kl. We have not explored how adding the heuristics with information density weighting to the bandits would impact the performance. See Table 5 for a list of abbreviations associated with the methods.

Table 5:

Summary of active learning heuristics and combiners used in the experiments.

Abbreviation	Type	Description
passive	Heuristic	Passive learning
confidence	Heuristic	Least confidence heuristic
w-confidence	Heuristic	Least confidence heuristic with information density weighting
margin	Heuristic	Smallest margin heuristic
w-margin	Heuristic	Smallest margin heuristic with information density weighting
entropy	Heuristic	Highest entropy heuristic
w-entropy	Heuristic	Highest entropy heuristic with information density weighting
qbb-margin	Heuristic	Smallest QBB margin heuristic
qbb-kl	Heuristic	Largest QBB KL-divergence heuristic
explore	Bandit	Bandit algorithm with 100% exploration
thompson	Bandit	Thompson sampling
ocucb	Bandit	Optimally confidence UCB algorithm
klucb	Bandit	kl-UCB algorithm
exp3++	Bandit	EXP3++ algorithm
borda	Aggregation	Aggregation with Borda count
geometric	Aggregation	Aggregation with the geometric mean
schulze	Aggregation	Aggregation with the Schulze method

DOI: 10.7717/peerj-cs.157/table-5

Given that there are 12 datasets, each with 17 learning curves, we need a measure that can summarize in one number how well a particular heuristic or policy does. Building on Baram, El-Yaniv & Luz (2004)’s deficiency measure, we define the strength of an active learner or a combiner relative to passive learning as (25) $S t r e n g t h (h; m) = 1 - \frac{\sum_{t = 1}^{n} (m (\max) - m (h, t))}{\sum_{t = 1}^{n} (m (\max) - m (p a s s i v e, t))}$ where m is a chosen metric (e.g., accuracy rate, MPBA), m(max) is the best possible performance¹, and m(h, t) is the performance achieved using the first t examples selected by heuristic h. We can think of the summation as the area between the best possible performance line and the learning curve of h. The better the heuristic is, the faster it would approach this maximum line, and thus the smaller the area. Finally, so that we can compare the performance across datasets, we normalize the measure with the area obtained from using just passive learning. Refer to Fig. 3 for a visualization of the strength measure.

Figure 3: An illustration of the MPBA strength measure.
It is proportional to the shaded area between the (red) passive learning curve and the (blue) active learning curve. The bigger the area is, the more the active learner outperforms the passive learner. The top dotted line indicates the maximum performance achieved.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-3

We evaluate the algorithm performance with two metrics: the accuracy score and the MPBA. The accuracy score is the percentage of instances in the test set where the predicted label matches the true label. If a dataset has a dominant class, then the accuracy score of instances within that class will also dominate the overall accuracy score. The MPBA, on the other hand, puts an equal weight on each class and thus favors algorithms that can predict the label of all classes equally well.

The heuristics with information density weighting and Thompson sampling have a few additional hyperparameters. To investigate the effect of these hyperparameters, we pick one binary dataset (glass) and one multiclass dataset (ionosphere) to investigate. Both of these are small enough to allow us to iterate through many hyperparameter values quickly. With w-confidence, w-margin, and w-entropy, we set γ in the Gaussian kernel to be the inverse of the 95th percentile of all pairwise distances. This appears to work well, as shown in Fig. 4. For thompson, the prior values for μ, σ² and the value of τ² seem to have little effect on the final performance (see Fig. 5). We set the initial μ to 0.5, the initial σ² to 0.02, and τ² to 0.02.

Figure 4: Effect of γ on w-confidence and w-margin using the glass and ionosphere datasets.
We examine six different values for γ: the 50th, 60th, 70th, 90th, 95th, and 99th percentile of the pairwise L1-distance between the data points. For the glass dataset (A), changing value of γ has minimal effect on the results, while for the ionosphere dataset (B), using the 90th percentile and above seems to work well.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-4

Figure 5: Effect of the initial values of the parameters in thompson.
We test 16 combinations of μ, σ², and τ² on the glass (A) and ionosphere (B) dataset. Varying these values does not seem to affect the final performance.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-5

Results

Figures 6 and 7 show the strengths of all methods that we consider, while Figs. 8 and 9 provide selected learning curves. Plots for the six small datasets with fewer than 500 examples (glass, ionosphere, iris, sonar, wine, and wpbc) are shown in Figs. 6 and 8. Plots for the two medium-sized datasets (pima and vehicle) and the four large datasets (magic, miniboone, pageblocks, and sdss) are shown in Figs. 7 and 9. Each figure contains two subfigures, one reporting the raw accuracy score, while the other showing the MPBA score.

Figure 6: Boxplots of the accuracy and MPBA strength of the 16 active learning strategies, relative to passive learning, using the small datasets (glass, ionosphere, iris, sonar, wine, and wpbc).
The more positive the strength is, the better the heuristic/combiner is. Gray boxes represent individual heuristics; blue boxes represent bandit algorithms, and red boxes are for rank aggregation methods. A strategy that is above the zero line is better than passive learning. Each boxplot contains 10 trials. The accuracy score (A, C, E, G, I, and K) is a simple metric that simply counts up the number of correct predictions. The MPBA score (B, D, F, H, J, and L), being the weighted average of the recall and precision, gives an equal representation to each class. The boxes represent the quartiles and the whiskers extend to 1.5 times of the interquartile range.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-6

Figure 7: Boxplots of the accuracy and MPBA strength of the 16 active learning strategies, relative to passive learning, using medium to the large datasets (magic, miniboone, pageblocks, pima, sdss, and vehicle).
The more positive the strength is, the better the heuristic/combiner is. Gray boxes represent individual heuristics; blue boxes represent bandit algorithms, and red boxes are for rank aggregation methods. A strategy that is above the zero line is better than passive learning. Each boxplot contains 10 trials. The accuracy score (A, C, E, G, I, and K) is a simple metric that simply counts up the number of correct predictions. The MPBA score (B, D, F, H, J, and L), being the weighted average of the recall and precision, gives an equal representation to each class. The boxes represent the quartiles and the whiskers extend to 1.5 times of the interquartile range.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-7

Figure 8: Selected accuracy and MPBA learning curves for the small datasets (glass, ionosphere, iris, sonar, wine, and wpbc).
As it would get too cluttered to plot 17 learning curves, we only show the accuracy (A, C, E, G, I, and K) and MPBA (B, D, F, H, J, and L) learning curves for passive, confidence, exp3++, and borda. The learning curves are averaged over 10 trials. The dotted horizontal line shows the performance obtained from using the whole training data.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-8

Figure 9: Selected accuracy and MPBA learning curves for the medium to large datasets (magic, miniboone, pageblocks, pima, sdss, and vehicle).
As it would get too cluttered to plot 17 learning curves, we only show the accuracy (A, C, E, G, I, and K) and MPBA (B, D, F, H, J, and K) learning curve for passive, confidence, exp3++, and borda. The learning curves are averaged over 10 trials. The dotted horizontal line shows the performance obtained from using the whole training data.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-9

Active learning methods generally beat passive learning in four of the six small datasets—glass, ionosphere, iris, and wine. This can be seen by the fact that the boxplots are mostly above the zero line in Fig. 6. For sonar and wpbc, the results are mixed—active learning has little to no effect here. The wpbc dataset is particularly noisy—our classifier cannot achieve an MPBA score greater than 60% (see Fig. 8). Thus it is not surprising that active learning does not perform well here since there is not much to learn to begin with.

The advantage of active learning becomes more apparent with the larger datasets like magic, miniboone, pageblocks, and sdss. Here there is a visible gap between the passive learning curve and the active learning curve for most methods. For instance, using a simple heuristic such as confidence in the pageblocks dataset results in an average MPBA score of 74% after 1,000 examples, while passive learning can only achieves 67% (see Fig. 9F).

Out of the eight active learning heuristics tested, the heuristics with the information density weighting (w-confidence, w-margin, and w-entropy) generally perform worse than the ones without the weighting. qbb-kl performs the best in pageblocks while it can barely beat passive learning in other datasets. The remaining heuristics—confidence, margin, entropy, and qbb-margin—perform equally well in all datasets.

We find no difference in performance between the bandit algorithms and the rank aggregation methods. Combining active learners does not seem to hurt the performance, even if we include a poorly performing heuristic such as qbb-kl.

For bandit algorithms, it is interesting to note that thompson favors certain heuristics a lot more than others, while the behavior of exp3++, ocucb, and klucb is almost indistinguishable from explore, where we explore 100% of the time (see Fig. 10). Changing the initial values of μ, σ², and τ² changes the order of preference slightly, but overall, which heuristics thompson picks seems to correlate with the heuristic performance. For example, as shown in Fig. 11, passive and qbb-kl tend to get chosen less often than others in the ionosphere dataset.

Figure 10: Selection frequencies of heuristics in thompson and exp3++, with the large datasets (magic, miniboone, pageblocks, pima, sdss, and vehicle).
The plots show how often each of the heuristics gets selected over time. The selection frequencies are averaged over 10 trials. thompson (A–F) favors certain heuristics more strongly than others. In contrast, exp3++ (G–L) favors uniform exploration more, sampling each heuristic with roughly equal weights. The plots for ocucb and klucb are not shown here, but they are similar to exp3++.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-10

Figure 11: The effect of the initial values of the parameters in thompson on the heuristic selection frequencies.
We test 16 combinations of μ, σ², and τ² on the glass and ionosphere dataset. Which heuristics thompson picks seems to correlate with the heuristic performance. For example, in ionosphere, passive (the dotted purple line) and qbb-kl (the dashed dark blue line) tend to get picked less often than others.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-11

Discussion

The experimental results allow us to answer the following questions:

Can active learning beat passive learning? Yes, active learning can perform much better than passive learning, especially when the unlabeled pool is large (e.g., sdss, miniboone, pageblock). When the unlabeled pool is small, the effect of active learning becomes less apparent, as there are now fewer candidates to choose from. This can be seen in Fig. 12, where we show that artificially reducing the unlabeled pool results in a reduction in the final performance. At the same time, having a small test set also makes the gap between the active learning curve and the passive learning curve smaller (see Figs. 12C and 12F). This further contributes to the poorer performance on the smaller datasets. In any case, when a dataset is small, we can label everything so active learning is usually not needed.
Can active learning degrade performance? Yes, there is no guarantee that active learning will always beat passive learning. For example, w-entropy actually slows down the learning in the many datasets. However, this only happens with certain heuristics, like those using the information density weighting.
What is the best single active learning heuristic? All of confidence, margin, entropy, and qbb-margin have a similar performance. However confidence is perhaps the simplest to compute and thus is a good default choice in practice.
What are the challenges in using bandit algorithms?
1. Designing a good reward scheme is difficult. This paper uses the increase in the classifier performance as the reward. However this type of reward is non-stationary (i.e., it gets smaller after each step as learning saturates) and the rewards will thus eventually go to zero.
2. In practice, we do not have a representative test set that can be used to compute the reward. As a workaround, Hsu & Lin (2015) computed the reward on the training set and then used importance weighting to remove any potential bias. For this to work, we need to ensure that every training example and every active learning suggestion have a non-zero probability of being selected in each step.
3. Finally, some bandit algorithms such as Thompson sampling assumes that the reward follows a certain distribution (e.g., Gaussian). However, this assumption is unrealistic.
What are the challenges in using rank aggregation algorithms?
1. We need to compute the scores from all heuristics at every time step. This might not be feasible if there are too many heuristics or if we include heuristics that require a large amount of compute power (e.g., variance minimization).
2. The Schulze method uses O(n²) space, where n is the number of candidates. This might lead to memory issues if we need to rank a large number of candidates from the unlabeled pool.
3. Before aggregating the rankings, we throw away the score magnitudes, which could cause a loss of information.
4. Unlike bandit algorithms, all of the rank aggregators always give each heuristic an equal weight.
Which method should I use in practice to combine active learners? Since there is no difference in performance between various combiners, we recommend using a simple rank aggregator like Borda count or geometric mean if we do not want to select a heuristic a priori. Rank aggregators do not need a notion of a reward—we simply give all suggestions an equal weight when combining. Thus we neither need to a keep a separate test set, nor do we need to worry about designing a good reward scheme.

Figure 12: Effect of the pool size on the learning curves. We pick two large datasets—pageblocks and sdss—to investigate how the size of the pool affects the performance.
(A) and (D) are the original learning curves from Figs. 9F and 9J (we only show the first 200 examples so that all figures have the same scale). For (B) and (E), we use the same test pool, but the unlabeled pool now only has a maximum of 300 candidates. Finally, for (C) and (F) the combined test pool and training pool have a size of 300.

Download full-size image

DOI: 10.7717/peerj-cs.157/fig-12

Our investigation has a few limitations. Firstly, we empirically compare algorithms that only work with single-label classification problems. Nowadays, many problems require multi-label learning, in which each example is allowed to be in more than one class. Our methods can be extended to work with multi-label datasets with the following modifications. We first need a multi-label classifier. This can be as simple as a collection of binary classifiers, each of which produces the probability that an example belongs to a particular class. For each class, we can use an active learning heuristic to assign a score to each unlabeled example as before. However now we need to aggregate the scores among the classes. As suggested by Reyes, Morell & Ventura (2018), we can use any aggregation method like Borda count to combine these scores. In effect, the multi-label learning problem adds an extra layer of aggregation into the pipeline.

Another limitation of our methods is that our active learning methods are myopic. That is, in each iteration, we only pick one instance to give to a human expert for labeling. In many practical applications like astronomy, batch-mode active learning is preferred, as it is much more cost efficient to obtain multiple labels simultaneously. One naive extension is to simply choose the m highest ranked objects using our current methods. However, it is possible to have two unlabeled objects whose class membership we are currently uncertain about, but because they have very similar feature vectors, labeling only one of them would allow us to predict the label of the other one easily. More sophisticated batch-mode active learning approaches have been proposed to take into account other factors such as the diversity of a batch and the representativeness of each batch example. These approaches include looking at the angles between hyperplanes in support vector machines (Brinker, 2003), using cluster analysis (Xu, Akella & Zhang, 2007), and using an evolutionary algorithm (Reyes & Ventura, 2018). How to aggregate suggestions from these approaches is an interesting problem for future work.

Conclusion

In this paper we compared 16 active learning methods with passive learning. Our three main findings are: active learning is better than passive learning; combining active learners does not in general degrade the performance; and social choice theory provides more practical algorithms than bandit theory since we do not need to design a reward scheme.

Appendix A: Posterior Balanced Accuracy

Most real-world datasets are unbalanced. In the SDSS dataset, for example, there are 4.5 times as many galaxies as quasars. The problem of class imbalance is even more severe in the pageblocks dataset, where one class makes up 90% of the data and the remaining four classes only make up 10%. An easy fix is to under sample the dominant class when creating the training and test sets. This, of course, means that the size of these sets are limited by the size of the minority class.

When we do not want to alter the underlying class distribution or when larger training and test sets are desired, we need a performance measure that can correct for the class imbalance. Brodersen et al. (2010) show that the posterior balanced accuracy distribution can overcome the bias in the binary case. We now extend this idea to the multi-class setting.

Suppose we have k classes. For each class i between 1 and k, there are N_i objects in the universe. Given a classifier, we can predict the label of every object and compare our prediction with the true label. Let G_i be the number of objects in class i that are correctly predicted. Then we define the recall A_i of class i as (26) $A_{i} = \frac{G_{i}}{N_{i}}$ The problem is that it is not feasible to get the actual values of G_i and N_i since that would require us to obtain the true label of every object in the universe. Thus we need a method to estimate these quantities when we only have a sample. Initially we have no information about G_i and N_i, so we can assume that each A_i follows a uniform prior distribution between 0 and 1. This is the same as a Beta distribution with shape parameters α = β = 1: (27) $A_{i} \sim Beta (1,1)$

The probability density function (PDF) of A_i is then (28) $\begin{array}{l} f_{A_{i}} (a) = \frac{Γ (α + β)}{Γ (α) Γ (β)} a^{α - 1} {(1 - a)}^{β - 1} \\ \propto a^{1 - 1} {(1 - a)}^{1 - 1} \end{array}$ where Γ(α) is the gamma function.

After we have trained the classifier, suppose we have a test set containing n_i objects in class i. Running the classifier on this test set is the same as conducting k binomial experiments, where, in the ith experiment, the sample size is n_i and the probability of success is simply A_i. Let g_i be the number of correctly labeled objects belonging to class i in the test set. Then, conditional on the recall rate A_i, g_i follows a binomial distribution: (29) $(g_{i} | A_{i}) \sim Bin (n_{i}, A_{i})$

The probability mass function of $(g_{i} | A_{i} = a)$ is thus (30) $\begin{array}{l} p_{g_{i} | A_{i}} (g_{i}) = (\begin{matrix} n_{i} \\ g_{i} \end{matrix}) a^{g_{i}} {(1 - a)}^{n_{i} - g_{i}} \\ \propto a^{g_{i}} {(1 - a)}^{n_{i} - g_{i}} \end{array}$

In the Bayesian setting, Eq. (28) is the prior and Eq. (30) is the likelihood. To get the posterior PDF, we simply multiply the prior with the likelihood: (31) $\begin{array}{l} f_{A_{i} | g} (a) \propto f_{A_{i}} (a) \times f_{g_{i} | A_{i}} (g_{i}) \\ \propto a^{1 - 1} {(1 - a)}^{1 - 1} \times {(1 - a)}^{n_{i} - g_{i}} \\ = a^{1 + g_{i} - 1} {(1 - a)}^{1 + n_{i} - g_{i} - 1} \end{array}$ Thus, with respect to the binomial likelihood function, the Beta distribution is conjugate to itself. The posterior recall rate A_i also follows a Beta distribution, now with parameters (32) $(A_{i} | g_{i}) \sim Beta (1 + g_{i}, 1 + n_{i} - g_{i})$ Our goal is to have a balanced accuracy rate, A, that puts an equal weight in each class. One way to achieve this is to take the average of the individual recalls: (33) $\begin{array}{l} A = \frac{1}{k} \sum_{i = 1}^{k} A_{i} \\ = \frac{1}{k} A_{T} \end{array}$ Here we have defined A_T to be the sum of the individual recalls. We call $(A | g)$ the posterior balanced accuracy, where g = (g₁, …, g_k). Most of the time, we simply want to calculate its expected value: (34) $\begin{array}{l} E [A | g] = \frac{1}{k} E [A_{T} | g] \\ = \frac{1}{k} \int a \cdot f_{A_{T} | g} (a) d a \end{array}$ Let us call this the MPBA. Note that there is no closed form solution for the PDF $f_{A_{T} | g} (a)$ . However assuming that A_T is a sum of k independent Beta random variables, $f_{A_{T} | g} (a)$ can be approximated by numerically convolving k Beta distributions. The independence assumption is reasonable here, since there should be little to no correlation between the individual recall rates. For example, knowing that a classifier is really good at recognizing stars does not tell us much about how well that classifier can recognize galaxies.

Having the knowledge of f_A_|g (a) will allow us to make violin plots, construct confidence intervals and do hypothesis tests. To get an expression for this, let us first rewrite the cumulative distribution function as (35) $\begin{array}{l} F_{A | g} (a) = ℙ (A \leq a | g) \\ = ℙ (\frac{1}{k} A_{T} \leq a | g) \\ = ℙ (A_{T} \leq k a | g) \\ = F_{A_{T} | g} (k a) \end{array}$

Differentiating (Eq. (35)) with respect to a, we obtain the PDF of (A|g): (36) $\begin{array}{l} f_{A | g} (a) = \frac{\partial}{\partial a} F_{A | g} (k a) \\ = \frac{\partial}{\partial a} (k a) \cdot \frac{\partial}{\partial k a} F_{A_{T} | g} (k a) \\ = k \cdot f_{A_{T} | g} (k a) \end{array}$ A Python implementation for the posterior balanced accuracy can be found on our GitHub repository (https://github.com/chengsoonong/mclass-sky).

The best possible performance in each trial is obtained by the higher of: (1) the performance achieved by using all the labeled examples in the training set; and (2) the maximum value of the learning curves of all the methods.

[2] Audibert J-Y, Munos R, Szepesvári C. 2009. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19):1876-1902

[3] Auer P, Cesa-Bianchi N, Fischer P. 2002a. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2–3):235-256

[4] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE. 2002b. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1):48-77

[5] Baram Y, El-Yaniv R, Luz K. 2004. Online choice of active learning algorithms. Journal of Machine Learning Research 5:255-291

[6] Bedö J, Ong CS. 2016. Multivariate spearman’s ρ for aggregating ranks using copulas. Journal of Machine Learning Research 17(201):1-30

[7] Berry DA, Fristedt B. 1985. Bandit Problems: Sequential Allocation of Experiments (Monographs on Statistics and Applied Probability). London: Chapman and Hall. 5:71-87

[8] Breiman L, Friedman J, Stone CJ, Olshen RA. 1984. Classification and Regression Trees. Boca Raton: CRC press.

[9] Brinker K. 2003. Incorporating diversity in active learning with support vector machines.

[10] Brodersen KH, Ong CS, Stephan KE, Buhmann JM. 2010. The balanced accuracy and its posterior distribution.

[11] Cappé O, Garivier A, Maillard O-A, Munos R, Stoltz G. 2013. Kullback-leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics 41(3):1516-1541

[12] Culotta A, McCallum A. 2005. Reducing labeling effort for structured prediction tasks.

[13] Freund Y, Schapire RE. 1996. Experiments with a new boosting algorithm.

[14] Freund Y, Seung HS, Shamir E, Tishby N. 1997. Selective sampling using the query by committee algorithm. Machine Learning 28(2–3):133-168

[15] Hsu W-N, Lin H-T. 2015. Active learning by learning.

[16] Lattimore T. 2015. Optimally confident UCB: improved regret for finite-armed bandits. preprint

[17] Lewis DD, Gale WA. 1994. A sequential algorithm for training text classifiers.

[18] Lichman M. 2013. UCI machine learning repository.

[19] List C. Zalta EN, ed. Social choice theory.

[20] McCallum A, Nigam K. 1998. Employing EM and pool-based active learning for text classification.

[21] Melville P, Mooney RJ. 2004. Diverse ensembles for active learning.

[22] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825-2830

[23] Rahimi A, Recht B. 2008. Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems. USA: Curran Associates Inc.. 1177-1184

[24] Reichart R, Tomanek K, Hahn U, Rappoport A. 2008. Multi-task active learning for linguistic annotations.

[25] Reyes O, Morell C, Ventura S. 2018. Effective active learning strategy for multi-label learning. Neurocomputing 273:494-508

[26] Reyes O, Ventura S. 2018. Evolutionary strategy to perform batch-mode active learning on multi-label data. ACM Transactions on Intelligent Systems and Technology 9(4):46:1-46:26

[27] Scheffer T, Decomain C, Wrobel S. 2001. Active hidden markov models for information extraction. In: Hoffmann F, Hand D, Adams N, Fisher D, Guimaraes G, eds. Advances in Intelligent Data Analysis. Berlin/Heidelberg: Springer. 2189:309-318

[28] Schein AI, Ungar LH. 2007. Active learning for logistic regression: an evaluation. Machine Learning 68(3):235-265

[29] Schulze M. 2011. A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method. Social Choice and Welfare 36(2):267-303

[30] Seldin Y, Slivkins A. 2014. One practical algorithm for both stochastic and adversarial bandits.

[31] Settles B. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6(1):1-114

[32] Settles B, Craven M. 2008. An analysis of active learning strategies for sequence labeling tasks.