Feasibility Study of Automatic Article Importance Classification

In this notebook, we will utilize our existing article importance dataset in order to determine if it is feasible to train a machine learner to automatically classify article importance.

Importing Libraries and Data

# install.packages('data.table');
# install.packages('randomForest');
# install.packages('e1071');
# install.packages('gbm');

library(data.table);
library(randomForest);
library(e1071);
library(gbm);
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Loading required package: survival
Loading required package: lattice
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.1

Due to memory constraints, the comlete dataset of 3.5 million labels has been processed externally, and a training set of 25,000 labels have been extracted. The code used for the extraction is found further below, it does a random sample of k articles from each of the five importance categories (in this case, k = 5000).

## Read in the dataset and show the first rows to verify it was imported correctly.
## impdata = data.table(read.table(gzfile('article_stats.tsv.gz'),
##                                sep='\t', header=T, quote="", encoding="UTF-8",
##                     stringsAsFactors=FALSE));
## head(impdata);

## Read in our predefined training set, based on a random sample of 5,000 articles from
## each of the importance categories in the full dataset mentioned above.
training.set = data.table(read.table(gzfile('importance-trainingset.tsv.gz'),
                                    sep='\t', header=T, quote='', encoding='UTF-8',
                                    stringsAsFactors=FALSE));

Data Examination and Massaging

We are going to use a Random Foreset classifier, which prefers having roughly the same number of labels per category (or "class"). Otherwise the underlying calculations for optimizing performance break and the tree does not perform as well. The code for sampling articles from each importance category is commented out, as the dataset we read in previously is already a sampled set of articles using this approach.

## This is the code used to generate the already existing training set
## n_per_class = 5000;
## training.set = data.table();
## for(rating in importance.order) {
##   training.set = rbind(training.set,
##                       impdata[sample(which(impdata$max_importance == rating),
##                                      n_per_class)]);
##}
head(training.set);
length(training.set$page_id);
page_idpage_titlemax_importancemin_importancemean_importanceinlinksdirect_inlinksinlinks_from_redirectsorganic_inlinksorganic_direct_inlinksorganic_inlinks_from_redirectsviewsdirect_viewsviews_from_redirectsfoldlog_orginlinkslog_viewsordered_imppred
17757592 Nordsjælland_Håndbold Unknown Unknown 5 41 37 4 28 23 5 209 167 42 1 4.857981 7.714246 Unknown High
11508678 Wayne_Robinson Unknown Unknown 5 92 92 0 0 20 0 225 225 0 2 0.000000 7.820179 Unknown Unknown
6474654 James_A._Reed_(entrepreneur)Unknown Unknown 5 23 4 19 9 3 6 0 0 0 3 3.321928 0.000000 Unknown Unknown
7569171 Rüdiger_Vogler Unknown Unknown 5 52 41 11 44 35 9 580 479 101 4 5.491853 9.182394 Unknown Unknown
37903008 Billotte Unknown Unknown 5 1 1 0 0 0 0 21 21 0 5 0.000000 4.459432 Unknown Low
9129749 Concepts_(album) Unknown Unknown 5 182 179 3 9 6 3 416 244 172 6 3.321928 8.703904 Unknown Unknown
25000

Some of our approaches perform better with Normally distributed data. We suspect that article views and number of inlinks follow a non-Normal distribution. Let's quickly check that assumption.

summary(training.set$organic_inlinks);
summary(training.set$views);
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     0.0      0.0     12.0    230.6     60.0 468400.0 
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
      0.0     162.0     507.5    5202.0    2456.0 1107000.0 

In both cases the median and mean are very different, suggesting we should log-transform these variables before using them for training.

## Log-transform variables per comments above
training.set$log_orginlinks = log2(1 + training.set$organic_inlinks);
training.set$log_views = log2(1 + training.set$views);

Classifier Training

In order to keep the code slightly cleaner, we define a variable containing the names of the two independent variables (number of inlinks and views). We also create an ordered column of importance categories in the training set, as R supports ordinal categorical variables. The pred column in the dataset is read in a character type column, so we similarly convert it into an ordinal categorical variable. Lastly, we define the number of folds we are using in our cross-validation.

## These are the columns that we think can be used to predict importance
importance_columns = c('log_orginlinks', 'log_views');

## Ordering the set of importance categories for prettier printout in some confusion
## matrices, and also in case we want to test Ordinal Logistic Regressions.
importance.order = c('Unknown', 'Low', 'Mid', 'High', 'Top');
training.set$ordered_imp = ordered(training.set$max_importance, importance.order);

## Also turn the "pred" column into an ordered factor, as it otherwise is not correct
training.set$pred = ordered(training.set$pred, importance.order);

## Number of folds we use for cross validation:
n_folds = 10;

We first train a modestly sized Random Forest Classifier on the whole training set to get an understanding of how well it performs on this problem.

imp_rfmodel = randomForest(x=training.set[,importance_columns, with=FALSE],
                           y=training.set$ordered_imp,
                           ntree=101);
imp_rfmodel;
Call:
 randomForest(x = training.set[, importance_columns, with = FALSE],      y = training.set$ordered_imp, ntree = 101) 
               Type of random forest: classification
                     Number of trees: 101
No. of variables tried at each split: 1

        OOB estimate of  error rate: 67.3%
Confusion matrix:
        Unknown  Low  Mid High  Top class.error
Unknown    1559 2046  717  421  257      0.6882
Low        1470 2157  690  392  291      0.5686
Mid        1126 1183 1067  917  707      0.7866
High        699  506  987 1369 1439      0.7262
Top         436  366  778 1398 2022      0.5956

Our first observation is the "OOB estimate of error rate", which reports the number of items that are incorrectly labelled during the training. This gives us an idea of what the overall accuracy is, and we see that the estimate is 32.7% accuracy. We'll then run the cross-validation to get a more correct estimate of the accuracy.

Secondly, the Random Forest classifier's confusion matrix reveals several trends. First, it has difficulty distinguishing between "Unknown" and "Low" importance. Secondly, "Mid" importance does not carry much signal and is instead spread across all five categores. Lastly, "High" and "Top" importance are also difficult to distinguish. This could suggest strategies for labelling that we can use in later research, e.g. look into whether humans can meaningfully distinguish between "High" and "Top" importance.

## Train a modestly sized Random Forest classifier, once for each fold...
set.seed(42);
for(i in 0:(n_folds-1)) {
  cur_fold = i;
  imp_rfmodel = randomForest(x=training.set[fold != cur_fold,
                                            importance_columns, with=FALSE],
                             y=training.set[fold != cur_fold]$ordered_imp,
                             xtest = training.set[fold == cur_fold,
                                                  importance_columns, with=FALSE],
                             ytest = training.set[fold == cur_fold]$ordered_imp,
                             ntree=101);
  training.set[fold == cur_fold, pred := imp_rfmodel$test$predicted];
  }
training.set[, pred := ordered(pred, importance.order)];
length(training.set[pred == ordered_imp]$page_id)/length(training.set$page_id);
page_idpage_titlemax_importancemin_importancemean_importanceinlinksdirect_inlinksinlinks_from_redirectsorganic_inlinksorganic_direct_inlinksorganic_inlinks_from_redirectsviewsdirect_viewsviews_from_redirectsfoldlog_orginlinkslog_viewsordered_imppred
17757592 Nordsjælland_Håndbold Unknown Unknown 5 41 37 4 28 23 5 209 167 42 1 4.857981 7.714246 Unknown High
11508678 Wayne_Robinson Unknown Unknown 5 92 92 0 0 20 0 225 225 0 2 0.000000 7.820179 Unknown Unknown
6474654 James_A._Reed_(entrepreneur) Unknown Unknown 5 23 4 19 9 3 6 0 0 0 3 3.321928 0.000000 Unknown Unknown
7569171 Rüdiger_Vogler Unknown Unknown 5 52 41 11 44 35 9 580 479 101 4 5.491853 9.182394 Unknown Unknown
37903008 Billotte Unknown Unknown 5 1 1 0 0 0 0 21 21 0 5 0.000000 4.459432 Unknown Low
9129749 Concepts_(album) Unknown Unknown 5 182 179 3 9 6 3 416 244 172 6 3.321928 8.703904 Unknown Unknown
1502004 550_Music Unknown Unknown 5 151 132 19 137 120 17 797 662 135 7 7.108524 9.640245 Unknown Low
21877144 Scapular_of_St._Michael_the_Archangel Unknown Unknown 5 45 44 1 12 11 1 876 865 11 8 3.700440 9.776433 Unknown High
7562553 The_Best_of_British_£1_Notes Unknown Unknown 5 14 13 1 9 9 0 230 220 10 9 3.321928 7.851749 Unknown Unknown
9359317 All_the_Way_to_the_Sun Unknown Unknown 5 44 44 0 15 15 0 317 311 6 0 4.000000 8.312883 Unknown Unknown
29931748 Umbach Unknown Unknown 5 5 5 0 0 1 0 48 48 0 1 0.000000 5.614710 Unknown Low
13930025 Sadi_Gülçelik Unknown Unknown 5 13 13 0 9 9 0 181 168 13 2 3.321928 7.507795 Unknown Unknown
9660482 Tepid_Peppermint_Wonderland:_A_RetrospectiveUnknown Unknown 5 44 43 1 10 9 1 747 695 52 3 3.459432 9.546894 Unknown Top
17200458 Yank_Porter Unknown Unknown 5 13 13 0 5 5 0 114 113 1 4 2.584963 6.845490 Unknown Low
19831859 Janików,_Kozienice_County Unknown Unknown 5 48 48 0 4 4 0 90 85 5 5 2.321928 6.507795 Unknown Low
15513262 Văn_Lãng_District Unknown Unknown 5 118 115 3 13 12 1 235 156 79 6 3.807355 7.882643 Unknown Mid
5692414 Dugald_Baird Unknown Unknown 5 13 11 2 10 8 2 313 185 128 7 3.459432 8.294621 Unknown Low
34254403 2012_Houston_Texans_season Unknown Unknown 5 122 121 1 68 67 1 2904 2854 50 8 6.108524 11.504322 Unknown Mid
25903502 Educor Unknown Unknown 5 8 8 0 0 3 0 95 95 0 9 0.000000 6.584963 Unknown Unknown
1202612 Guy_Davenport Unknown Unknown 5 111 107 4 76 76 0 1123 1097 26 0 6.266787 10.134426 Unknown Mid
8238232 Johnny_Windhurst Unknown Unknown 5 12 12 0 0 5 0 104 104 0 1 0.000000 6.714246 Unknown Low
10356617 Alive_(P.O.D._song) Unknown Unknown 5 75 74 1 35 35 0 1856 1837 19 2 5.169925 10.858758 Unknown Top
38518471 Perkin_(surname) Unknown Unknown 5 2 2 0 0 1 0 47 47 0 3 0.000000 5.584963 Unknown Low
26576736 Matt_Stephens_(politician) Unknown Unknown 5 23 23 0 0 17 0 83 83 0 4 0.000000 6.392317 Unknown Unknown
30370091 Tomasz_Mateusiak Unknown Unknown 5 6 6 0 0 4 0 65 65 0 5 0.000000 6.044394 Unknown Low
43604881 Rodrigo_de_Osona Unknown Unknown 5 10 8 2 8 7 1 65 62 3 6 3.169925 6.044394 Unknown Unknown
35821018 Sailing_at_the_2012_Summer_Olympics_–_Finn Unknown Unknown 5 133 64 69 57 7 50 358 239 119 7 5.857981 8.487840 Unknown High
9592193 Colin_Frechter Unknown Unknown 5 22 22 0 0 18 0 275 275 0 8 0.000000 8.108524 Unknown Unknown
11311556 4th_Territorial_Army_Corps_(Romania) Unknown Unknown 5 31 19 12 23 13 10 390 261 129 9 4.584963 8.611025 Unknown Mid
27987341 Uruguayan_Spanish Unknown Unknown 5 63 63 0 0 12 0 816 816 0 0 0.000000 9.674192 Unknown Mid
43355228 Latin_music_(genre) Top Unknown 4.3333 8797 8797 0 0 37 0 849 849 0 1 0.000000 9.731319 Top Mid
343131 Kingdom_of_Mutapa Top Top 1.0000 234 123 111 150 66 84 9729 6916 2813 2 7.238405 13.248224 Top High
61899 Phloem Top Top 1.0000 350 337 13 227 219 8 17844 17429 415 3 7.832890 14.123232 Top High
75485 Electrical_discharge_machining Top Top 1.0000 251 221 30 104 84 20 20409 19151 1258 4 6.714246 14.316989 Top Top
1946381 Popover Top Unknown 3.0000 59 58 1 22 21 1 4230 4063 167 5 4.523562 12.046783 Top Top
227108 Leadership_development Top Top 1.0000 169 147 22 100 87 13 5068 4920 148 6 6.658211 12.307485 Top Top
9270045 Water_supply_and_sanitation_in_Brazil Top Unknown 3.0000 672 672 0 20 20 0 1868 1848 20 7 4.392317 10.868051 Top Top
7133037 Toll_roads_in_the_United_States Top Top 1.0000 32 30 2 25 23 2 1680 1635 45 8 4.700440 10.715104 Top Top
774820 List_of_Azerbaijanis Top Top 1.0000 225 207 18 42 36 6 2016 1705 311 9 5.426265 10.977995 Top High
15318 IPv6 Top Unknown 2.3333 1549 1481 68 801 762 39 96623 87490 9133 0 9.647458 16.560094 Top Top
229104 Matter_wave Top Top 1.0000 375 188 187 220 94 126 24648 20355 4293 1 7.787903 14.589241 Top High
13113498 Style_(fiction) Top Top 1.0000 208 196 12 62 55 7 3429 3186 243 2 5.977280 11.743993 Top High
365451 Itamar_Franco Top Unknown 3.6667 168 167 1 90 89 1 1947 1928 19 3 6.507795 10.927778 Top High
253038 Northern_Mindanao Top Top 1.0000 531 517 14 248 233 15 7629 7516 113 4 7.960002 12.897467 Top Top
840510 Sokoto_Caliphate Top Unknown 1.8000 502 367 135 381 322 59 7563 6294 1269 5 8.577429 12.884934 Top Top
262861 Zeroth_law_of_thermodynamics Top Top 1.0000 264 252 12 57 54 3 13054 12626 428 6 5.857981 13.672315 Top High
4415837 Roman_Question Top Unknown 1.8000 144 114 30 82 61 21 2378 1770 608 7 6.375039 11.216140 Top High
3972111 Prime_Minister_of_Haiti Top Top 1.0000 214 147 67 132 85 47 1513 1056 457 8 7.055282 10.564149 Top High
226808 Chinese_Buddhism Top Top 1.0000 2723 2390 333 480 345 135 64153 57157 6996 9 8.909893 15.969252 Top High
31464496 Indian_black_money Top Unknown 3.4000 136 123 13 29 25 4 17751 17340 411 0 4.906891 14.115694 Top Mid
1963578 Will_Champion Top Unknown 3.0000 246 246 0 0 139 0 8215 8215 0 1 0.000000 13.004220 Top High
183525 Guangxi Top Top 1.0000 2635 2335 300 2026 1815 211 16643 13743 2900 2 10.985130 14.022715 Top Top
12118871 Florida_State_University_College_of_Social_WorkTop Top 1.0000 80 80 0 6 6 0 186 183 3 3 2.807355 7.546894 Top Unknown
214486 States_of_Austria Top Top 1.0000 3150 3101 49 499 471 28 6720 6062 658 4 8.965784 12.714460 Top Top
241164 Kukës Top Top 1.0000 229 208 21 112 100 12 1847 1594 253 5 6.820179 10.851749 Top High
376545 Cathedral_of_Christ_the_Saviour Top Unknown 3.0000 398 280 118 242 152 90 9281 6876 2405 6 7.924813 13.180220 Top High
41976692 Puerto_Rican_general_election,_1928 Top Top 1.0000 76 76 0 0 5 0 59 59 0 7 0.000000 5.906891 Top Low
14533 India Top Top 1.0000 163458 161379 2079 114949 114274 675 756592 747796 8796 8 16.810647 19.529158 Top Top
22783445 Transformers_(film_series) Top Top 1.0000 881 868 13 168 158 10 108724 106112 2612 9 7.400879 16.730324 Top Top
256305 Santa_Catarina_(state) Top Top 1.0000 1793 1704 89 1451 1375 76 6673 5745 928 0 10.503826 12.704336 Top High
0.32636

Overall, the RF approach has 32.6% accuracy. We have tested larger sizes of forests, and they only achieve a modest improvement. This appears to be a classification problem where Random Forests are not the best solution, if we compare to the SVM and GBM results below.

We next build our two regression models. First, we run a standard least-squares linear regression. Secondly, we run a Random Forest Regression, which is a regression technique built on how random forests work, and one that works well for non-linear problems. In both cases we use the mean_importance column as our dependent variable, which is based on converting importance to a simple linear scale where 5 is "Unknown" and 1 is "Top".

## Build a linear model for reference
imp_lmodel = lm(mean_importance ~ log_orginlinks + log_views, data=training.set);
summary(imp_lmodel);

## Train a modestly size Random Forest classifier for Random Forest Regression
set.seed(42);
imp_rfregression = randomForest(x=training.set[, importance_columns, with=FALSE],
                               y=training.set$mean_importance,
                               ntree=101);
imp_rfregression;
Call:
lm(formula = mean_importance ~ log_orginlinks + log_views, data = training.set)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2921 -0.7404  0.0974  0.8506  3.5660 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.292078   0.029453  179.68   <2e-16 ***
log_orginlinks -0.078351   0.003250  -24.11   <2e-16 ***
log_views      -0.178428   0.003856  -46.27   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.111 on 24997 degrees of freedom
Multiple R-squared:  0.2887,	Adjusted R-squared:  0.2886 
F-statistic:  5072 on 2 and 24997 DF,  p-value: < 2.2e-16
Call:
 randomForest(x = training.set[, importance_columns, with = FALSE],      y = training.set$mean_importance, ntree = 101) 
               Type of random forest: regression
                     Number of trees: 101
No. of variables tried at each split: 1

          Mean of squared residuals: 1.280909
                    % Var explained: 26.15

We can see from our linear model that 28.86% of the variance is explained, suggesting that there is a reasonably linear relationship between these variables. The Random Forest regression reports a similar amount of variance explained. In summary, we find that the more important articles have higher number of inlinks and views. This finding corresponds to existing research on importance.

Next we train a Support Vector Machine classifier. This has built-in support for cross-validation, so we do not need any additional code for that. The cost and gamma parameters are set to default variables recommended in the R literature. Further performance improvements can possibly be achieved by exploring these parameters, but that is outside the scope of this study.

## Train an SVM classifier and run 10-fold cross-validation
imp_svm = svm(ordered_imp ~ log_orginlinks + log_views, data=training.set,
             cost = 100, gamma = 1, cross = 10);
summary(imp_svm);
Call:
svm(formula = ordered_imp ~ log_orginlinks + log_views, data = training.set, 
    cost = 100, gamma = 1, cross = 10)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  100 
      gamma:  1 

Number of Support Vectors:  24308

 ( 4952 4940 4984 4998 4434 )


Number of Classes:  5 

Levels: 
 Unknown Low Mid High Top

10-fold cross-validation on training data:

Total Accuracy: 34.704 
Single Accuracies:
 34.68 34.32 38.16 34.96 33.16 33.88 34.24 33.6 34.2 35.84 


The SVM approach with a default configuration has 34.7% accuracy. This is slightly better than the Random Forest approach described earlier.

The final approach is a Gradient Boost Classifier. This is similar to a Random Forest in that it's a tree of weak learners. In the code below we'll use 100 trees in order to not run out of memory. The code also shows how to train a classifier with 10,000 trees and use the library's functionality to determine a sensible compromise between tree size (and subsequent memory usage) and overall performance. Based on the plot from gmb.perf, a tree of size 3,000 would be preferred.

## Train a GBM classifier with 10k trees and call gbm.perf() to get a plot of the error curve.
## This code is commented out so that PAWS doesn't run out of memory.
## imp_gbm = gbm(ordered_imp ~ log_orginlinks + log_views,
##              data=training.set,
##              distribution='multinomial', n.trees=10000)
## gbm.perf(imp_gbm);

## Run 10-fold cross-validation with GBM using 100 trees similarly as we did for the
## Random Forest classifier above. Then calculate the overall accuracy.
n_trees = 50;
for(i in 0:(n_folds-1)) {
  cur_fold = i;
  imp_gbm = gbm(ordered_imp ~ log_orginlinks + log_views,
                data=training.set[fold != cur_fold], n.trees=n_trees,
                distribution='multinomial', cv.folds=0);
  preds = predict(imp_gbm,
                  training.set[fold == cur_fold],
                  n.trees = n_trees);
  preds = apply(preds, 1, which.max)
  training.set[fold == cur_fold, pred := preds];
}
training.set[, pred := ordered(pred, importance.order)];
length(training.set[pred == ordered_imp]$page_id)/length(training.set$page_id);
Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”Warning message in if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) {:
“the condition has length > 1 and only the first element will be used”
page_idpage_titlemax_importancemin_importancemean_importanceinlinksdirect_inlinksinlinks_from_redirectsorganic_inlinksorganic_direct_inlinksorganic_inlinks_from_redirectsviewsdirect_viewsviews_from_redirectsfoldlog_orginlinkslog_viewsordered_imppred
17757592 Nordsjælland_Håndbold Unknown Unknown 5 41 37 4 28 23 5 209 167 42 1 4.857981 7.714246 Unknown Low
11508678 Wayne_Robinson Unknown Unknown 5 92 92 0 0 20 0 225 225 0 2 0.000000 7.820179 Unknown Low
6474654 James_A._Reed_(entrepreneur) Unknown Unknown 5 23 4 19 9 3 6 0 0 0 3 3.321928 0.000000 Unknown Low
7569171 Rüdiger_Vogler Unknown Unknown 5 52 41 11 44 35 9 580 479 101 4 5.491853 9.182394 Unknown Mid
37903008 Billotte Unknown Unknown 5 1 1 0 0 0 0 21 21 0 5 0.000000 4.459432 Unknown Low
9129749 Concepts_(album) Unknown Unknown 5 182 179 3 9 6 3 416 244 172 6 3.321928 8.703904 Unknown Unknown
1502004 550_Music Unknown Unknown 5 151 132 19 137 120 17 797 662 135 7 7.108524 9.640245 Unknown High
21877144 Scapular_of_St._Michael_the_Archangel Unknown Unknown 5 45 44 1 12 11 1 876 865 11 8 3.700440 9.776433 Unknown High
7562553 The_Best_of_British_£1_Notes Unknown Unknown 5 14 13 1 9 9 0 230 220 10 9 3.321928 7.851749 Unknown Low
9359317 All_the_Way_to_the_Sun Unknown Unknown 5 44 44 0 15 15 0 317 311 6 0 4.000000 8.312883 Unknown Unknown
29931748 Umbach Unknown Unknown 5 5 5 0 0 1 0 48 48 0 1 0.000000 5.614710 Unknown Low
13930025 Sadi_Gülçelik Unknown Unknown 5 13 13 0 9 9 0 181 168 13 2 3.321928 7.507795 Unknown Low
9660482 Tepid_Peppermint_Wonderland:_A_RetrospectiveUnknown Unknown 5 44 43 1 10 9 1 747 695 52 3 3.459432 9.546894 Unknown High
17200458 Yank_Porter Unknown Unknown 5 13 13 0 5 5 0 114 113 1 4 2.584963 6.845490 Unknown Low
19831859 Janików,_Kozienice_County Unknown Unknown 5 48 48 0 4 4 0 90 85 5 5 2.321928 6.507795 Unknown Low
15513262 Văn_Lãng_District Unknown Unknown 5 118 115 3 13 12 1 235 156 79 6 3.807355 7.882643 Unknown Low
5692414 Dugald_Baird Unknown Unknown 5 13 11 2 10 8 2 313 185 128 7 3.459432 8.294621 Unknown Unknown
34254403 2012_Houston_Texans_season Unknown Unknown 5 122 121 1 68 67 1 2904 2854 50 8 6.108524 11.504322 Unknown Top
25903502 Educor Unknown Unknown 5 8 8 0 0 3 0 95 95 0 9 0.000000 6.584963 Unknown Low
1202612 Guy_Davenport Unknown Unknown 5 111 107 4 76 76 0 1123 1097 26 0 6.266787 10.134426 Unknown High
8238232 Johnny_Windhurst Unknown Unknown 5 12 12 0 0 5 0 104 104 0 1 0.000000 6.714246 Unknown Low
10356617 Alive_(P.O.D._song) Unknown Unknown 5 75 74 1 35 35 0 1856 1837 19 2 5.169925 10.858758 Unknown Top
38518471 Perkin_(surname) Unknown Unknown 5 2 2 0 0 1 0 47 47 0 3 0.000000 5.584963 Unknown Low
26576736 Matt_Stephens_(politician) Unknown Unknown 5 23 23 0 0 17 0 83 83 0 4 0.000000 6.392317 Unknown Low
30370091 Tomasz_Mateusiak Unknown Unknown 5 6 6 0 0 4 0 65 65 0 5 0.000000 6.044394 Unknown Low
43604881 Rodrigo_de_Osona Unknown Unknown 5 10 8 2 8 7 1 65 62 3 6 3.169925 6.044394 Unknown Low
35821018 Sailing_at_the_2012_Summer_Olympics_–_Finn Unknown Unknown 5 133 64 69 57 7 50 358 239 119 7 5.857981 8.487840 Unknown Low
9592193 Colin_Frechter Unknown Unknown 5 22 22 0 0 18 0 275 275 0 8 0.000000 8.108524 Unknown Unknown
11311556 4th_Territorial_Army_Corps_(Romania) Unknown Unknown 5 31 19 12 23 13 10 390 261 129 9 4.584963 8.611025 Unknown Unknown
27987341 Uruguayan_Spanish Unknown Unknown 5 63 63 0 0 12 0 816 816 0 0 0.000000 9.674192 Unknown High
43355228 Latin_music_(genre) Top Unknown 4.3333 8797 8797 0 0 37 0 849 849 0 1 0.000000 9.731319 Top High
343131 Kingdom_of_Mutapa Top Top 1.0000 234 123 111 150 66 84 9729 6916 2813 2 7.238405 13.248224 Top Top
61899 Phloem Top Top 1.0000 350 337 13 227 219 8 17844 17429 415 3 7.832890 14.123232 Top Top
75485 Electrical_discharge_machining Top Top 1.0000 251 221 30 104 84 20 20409 19151 1258 4 6.714246 14.316989 Top Top
1946381 Popover Top Unknown 3.0000 59 58 1 22 21 1 4230 4063 167 5 4.523562 12.046783 Top Top
227108 Leadership_development Top Top 1.0000 169 147 22 100 87 13 5068 4920 148 6 6.658211 12.307485 Top Top
9270045 Water_supply_and_sanitation_in_Brazil Top Unknown 3.0000 672 672 0 20 20 0 1868 1848 20 7 4.392317 10.868051 Top Top
7133037 Toll_roads_in_the_United_States Top Top 1.0000 32 30 2 25 23 2 1680 1635 45 8 4.700440 10.715104 Top Top
774820 List_of_Azerbaijanis Top Top 1.0000 225 207 18 42 36 6 2016 1705 311 9 5.426265 10.977995 Top Top
15318 IPv6 Top Unknown 2.3333 1549 1481 68 801 762 39 96623 87490 9133 0 9.647458 16.560094 Top Top
229104 Matter_wave Top Top 1.0000 375 188 187 220 94 126 24648 20355 4293 1 7.787903 14.589241 Top Top
13113498 Style_(fiction) Top Top 1.0000 208 196 12 62 55 7 3429 3186 243 2 5.977280 11.743993 Top Top
365451 Itamar_Franco Top Unknown 3.6667 168 167 1 90 89 1 1947 1928 19 3 6.507795 10.927778 Top Top
253038 Northern_Mindanao Top Top 1.0000 531 517 14 248 233 15 7629 7516 113 4 7.960002 12.897467 Top Top
840510 Sokoto_Caliphate Top Unknown 1.8000 502 367 135 381 322 59 7563 6294 1269 5 8.577429 12.884934 Top Top
262861 Zeroth_law_of_thermodynamics Top Top 1.0000 264 252 12 57 54 3 13054 12626 428 6 5.857981 13.672315 Top Top
4415837 Roman_Question Top Unknown 1.8000 144 114 30 82 61 21 2378 1770 608 7 6.375039 11.216140 Top Top
3972111 Prime_Minister_of_Haiti Top Top 1.0000 214 147 67 132 85 47 1513 1056 457 8 7.055282 10.564149 Top Top
226808 Chinese_Buddhism Top Top 1.0000 2723 2390 333 480 345 135 64153 57157 6996 9 8.909893 15.969252 Top Top
31464496 Indian_black_money Top Unknown 3.4000 136 123 13 29 25 4 17751 17340 411 0 4.906891 14.115694 Top Top
1963578 Will_Champion Top Unknown 3.0000 246 246 0 0 139 0 8215 8215 0 1 0.000000 13.004220 Top Top
183525 Guangxi Top Top 1.0000 2635 2335 300 2026 1815 211 16643 13743 2900 2 10.985130 14.022715 Top Top
12118871 Florida_State_University_College_of_Social_WorkTop Top 1.0000 80 80 0 6 6 0 186 183 3 3 2.807355 7.546894 Top Low
214486 States_of_Austria Top Top 1.0000 3150 3101 49 499 471 28 6720 6062 658 4 8.965784 12.714460 Top Top
241164 Kukës Top Top 1.0000 229 208 21 112 100 12 1847 1594 253 5 6.820179 10.851749 Top Top
376545 Cathedral_of_Christ_the_Saviour Top Unknown 3.0000 398 280 118 242 152 90 9281 6876 2405 6 7.924813 13.180220 Top Top
41976692 Puerto_Rican_general_election,_1928 Top Top 1.0000 76 76 0 0 5 0 59 59 0 7 0.000000 5.906891 Top Low
14533 India Top Top 1.0000 163458 161379 2079 114949 114274 675 756592 747796 8796 8 16.810647 19.529158 Top Top
22783445 Transformers_(film_series) Top Top 1.0000 881 868 13 168 158 10 108724 106112 2612 9 7.400879 16.730324 Top Top
256305 Santa_Catarina_(state) Top Top 1.0000 1793 1704 89 1451 1375 76 6673 5745 928 0 10.503826 12.704336 Top Top
0.3326

The GBM reports a mean accuracy of 33.26%. This is slightly lower than the SVM classifier, but is due to the limited tree size. When run with a size of 3,000 trees as mentioned above, the two approaches perform comparably.