BrownBoost
BrownBoost is a boosting algorithm that is robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is true for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund.[1]
Motivation
AdaBoost performs well on a variety of datasets; however, it can be shown that AdaBoost does not perform well on noisy data sets.[2] This is a result of AdaBoost's focus on examples that are repeatedly misclassified. In contrast, BrownBoost effectively "gives up" on examples that are repeatedly misclassified.
The user of the algorithm can set the amount of error to be tolerated in the training set. Thus, if the training set is noisy (say 10% of all examples are assumed to be mislabeled), the booster can be told to accept a 10% error rate. Since the noisy examples may be ignored, only the true examples will contribute to the learning process.
BrownBoost Learning Algorithm
BrownBoost is unique in that it uses a non-convex potential loss function, thus it does not fit into the AnyBoost framework. The non-convex optimization provides a method to avoid overfitting data sets. AdaBoost has one parameter to set: the number of rounds of boosting. BrownBoost has an analogue parameter referred to as the amount of time the booster runs. This paramter is referred to as <math>c</math>.
Input: <math>m</math> training examples <math>(x_{1},y_{1}),\ldots,(x_{m},y_{m})</math> where <math>x_{y} \in X,\, y_{j} \in Y = \{-1, +1\}</math>
Initialise: <math>s=c</math>, <math>r_1(x_j) = 0 \forall j</math>.
While <math>s > 0</math>:
- Set the weights of each example: <math>W_{i}(x_j) = e^{- \frac{(r_i(x_j)+s)^2}{c}}</math>, where <math>r_i(x_j)</math> is the margin of example <math>x_j</math>
- Find a classifier <math>h_i : X \to \{-1,+1\}</math> such that <math>\sum_j W_i(x_j) h_i(x_j) y_j > 0</math>
- Find values <math>\alpha^*, t^*</math> that satisfy the differential equation:
<math>\frac{dt}{d\alpha} = \sum_j e^{-\frac{(r_i(x_j)+\alpha h_i(x_j) y_j + s - t)^2}{c}} = 0</math>. (Note this is similar to the condition <math>E_{W_{i+1}}[h_i(x_j) y_j]=0</math> set forth by Schapire and Singer[3])
This update is subject to the constraint
<math>
\sum
\left(\Phi\left(r_i(x_j) + \alpha h(x_j) y_j + s - t\right) -
\Phi\left( r_i(x_j) + s \right)
\right) = 0
</math>, where <math> \Phi(z) = 1-erf(z/\sqrt{c}) </math> is the potential loss for a point with margin <math>r_i(x_j)</math>
- Update the margins for each example: <math>r_{i+1}(x_j) = r_i(x_j) + \alpha h(x_j) y_j</math>
- Update the time remaining: <math>s = s - t</math>
Output: <math>H(x) = \textrm{sign}\left( \sum_i \alpha_{i} h_{i}(x) \right)</math>
Empirical Results
In preliminary experimental results with noisy datasets, BrownBoost outperformed AdaBoost's generalization error; however, LogitBoost performed as well as BrownBoost.[4] An implementation of BrownBoost can be found in the open source software JBoost.
References
- ↑ Yoav Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293--318, June 2001.
- ↑ Dietterich, T. G., (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40 (2) 139-158.
- ↑ Robert Schapire and Yoram Singer. Improved Boosting Using Confidence-rated Predictions. Journal of Machine Learning, Vol 37(3), pages 297-336. 1999
- ↑ Ross A. McDonald, David J. Hand, Idris A. Eckley. An Empirical Comparison of Three Boosting Algorithms on Real Data Sets with Artificial Class Noise. Multiple Classifier Systems, In Series Lecture Notes in Computer Science, pages 35-44, 2003.