Xgboost
is short for eXtreme Gradient Boosting and currently one of best machine learning packages available for R. It is fast, memory efficient and can handle huge amounts of data and won several kaggle competitions..
Xgboost
is now on CRAN and can be installed from there.
Installation
library(xgboost)
The data
As an example data set we use the Adult
data from the arules
package:
library(arules)
data("AdultUCI")
We also load the caret
package for the createDataPartition()
function we use later.
library(caret)
This data set is quite large, with over 48000 observations of 15 variables. Unfortunately it has a huge amount of missings, so we have to delete these oberservations. We delete all rows with missing values:
data <- AdultUCI[complete.cases(AdultUCI),]
We still have 30162 observations left. The variable we want to predict is income
which can take two states, small and large, so we are facing a binary classification problem.
A simple prediction model
We split our data
in a training and test set. Xgboost
uses its own data structure called xgb.DMatrix
. R’s matrix
structure also works but xgb.DMatrix
is preferred.
set.seed(41)
idx <- createDataPartition(data$income, p = 0.85, list = FALSE, time = 1)
labels <- seq_along(levels(data$income))[data$income] - 1 # save the labels
data$income <- NULL # now delete them
train <- model.matrix(~ . + 0, data[idx,]) # the training set
test <- model.matrix(~ . + 0, data[-idx,]) # the test set
dtrain <- xgb.DMatrix(data = train, label = labels[idx])
dtest <- xgb.DMatrix(data = test, label = labels[-idx])
We then set the Xgboost
parameters, note that we set auc as the evaluation metric. Xgboost
has some built-in evaluation metrics, this is also the reason why we specified the label
parameter in dtest
. Usually this information is not available for the test set and can be omitted.
For evaluation we will later use the ROCR
package, but we keep this for cosmetical reasons.
param <- list(objective="binary:logistic", eta=0.01,
min_child_weight=4, subsample=0.85, max_delta_step=3,
colsample_bytree=0.3, max_depth=3, scale_pos_weight=1, eval_metric="auc")
num_rounds <- 5000
More information on these parameters can be found here.
Finally, we can train our model:
mod <- xgb.train(params = param, data = dtrain, nrounds = num_rounds)
We then predict on the test set:
pred <- predict(mod, dtest)
To evaluate the AUROC we use the package ROCR
:
library(ROCR)
auc.tmp <- prediction(pred, labels[-idx])
auc <- performance(auc.tmp, "auc")
auc@y.values
## [[1]]
## [1] 0.9302011
We already get a pretty good AUC value of 0.9302011, here is the plot:
auc <- performance(auc.tmp, "tpr", "fpr")
plot(auc,col=rainbow(10))
A more advanced model
The main problem in boosting is when to stop the boosting iterations. Too many interations can lead to overfitting. xgb.train
can follow the progress of the learning process after each round, so if we provide a second data set that is already classified, Xgboost
can evaluate the model on this data and find the ideal stopping iterations.
To do this, we split the training data set again, in a new (smaller) training set and a validation set. Before, we had 25638 observations in our training set. We now use the first 2000 observations from this set for early stopping:
valid <- train[1:2000,]
dvalid <- xgb.DMatrix(valid, label = labels[idx[1:2000]])
train2 <- train[-(1:2000),]
dtrain2 <- xgb.DMatrix(train2, label = labels[setdiff(idx, idx[1:2000])])
In addition, we have to specify a watchlist
watchlist <- list("val"=dvalid, "train"=dtrain2)
Then we train the new model. Because of early stopping we can safely increase the number of boosting iterations:
num_rounds <- 8000
mod2 <- xgb.train(param, dtrain2, nthread=2, nrounds = num_rounds, watchlist = watchlist,
early.stop.round = 200, maximize = TRUE)
We predict again:
pred2 <- predict(mod2, dtest)
We compute the AUROC again:
auc.tmp <- prediction(pred2, labels[-idx])
auc <- performance(auc.tmp, "auc")
auc@y.values
## [[1]]
## [1] 0.930055
We now get an AUC value of 0.930055. This is not an improvement, it is even slightly worse than our prediction from above. The reason for this may be because
- the AUC is already very high. Remember, a perfect value would be 1.0, this is already close.
- We lose some important information in the training set as we hold out 2000 observations for early stopping
To address the second point, we now build 10 new models with different training/validation splits.
Let’s start from the beginning, otherwise our code will soon become very ugly. idx
contains all indices of the test data. We split train
in 10 parts (10 folds), so each validation part has 2563 observations. This is easy with the createFolds()
function in the caret
package:
folds <- createFolds(labels[idx], k = 10)
fmat <- do.call(cbind, folds)
As a result, we get a list of length 10 that holds all the required indices of each fold. I converted the folds to a matrix fmat
for easier indexing (because the observations cannot be divided by 10 evenly, some folds contain one repeated observation).
Let’s build our 10 new training and validation sets:
for(i in 1:10) {
tr <- train[as.vector(fmat[,-i]),]
v <- train[as.vector(fmat[,i]),]
eval(parse(text=paste0("dval",i,"<-", "xgb.DMatrix(v, label = labels[idx[as.vector(fmat[,i])]])")))
eval(parse(text=paste0("dtr",i,"<-", "xgb.DMatrix(tr, label = labels[idx[as.vector(fmat[,-i])]])")))
}
We have to build 10 watchlists, too:
for(k in 1:10) {
eval(parse(text=paste0("watchlist",k,"<-list('train'=dtr",k,",'val'=dval",k,")")))
}
Then, train all models:
for(n in 1:10) {
eval(parse(text=paste0("m",n,"<-xgb.train(param, dtr",n,",nthread=2, nrounds = num_rounds, watchlist = watchlist",n,",early.stop.round = 200, maximize = TRUE)")))
}
And predict on the test set:
for(m in 1:10) {
eval(parse(text=paste0("pr",m,"<-predict(m",m,",dtest)")))
}
Because we use AUC as our metric, we can just add up all predictions and then evaluate them again:
finalpred <- pr1 + pr2 + pr3 + pr4 + pr5 + pr6 + pr7 + pr8 + pr9 + pr10
auc.tmp <- prediction(finalpred, labels[-idx])
auc <- performance(auc.tmp, "auc")
auc@y.values
## [[1]]
## [1] 0.9298569
We still could not improve, but I hope you get the idea. Be sure to check out Xgboost
’s github page where you find more examples.