AutoML full workflow with tidymodels

krz · June 29, 2023

AutoML is the process of automating the tasks of applying machine learning to real-world problems. I often use this approach as it requires very little investment in tuning models and the results are usually quite competitive. The following is a notebook for a recent kaggle competition.

suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(h2o))
suppressMessages(library(agua))
tidymodels_prefer()

Overview

In this notebook I will use automated machine learning using the h2o package wrapped in a tidymodels workflow for preprocessing.

AutoML is the process of automating the tasks of applying machine learning to real-world problems. H2O AutoML trains a random grid of algorithms such as GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space. It also trains two stacked ensembles using all the models from the grid.

Import data

First, we read in the data and drop the Id variable.

dat <- read_csv("/kaggle/input/icr-identify-age-related-conditions/train.csv") %>%
    mutate(
        Class = factor(Class),
        EJ = factor(EJ)
    ) %>%
    select(-Id)

Initialize H2O

Next, we initialize h2o with 25G of RAM. By default we use all available cores.

h2o.init(max_mem_size = "25G")

Define the model

we try 30 different models (set higher to maybe get better results). Note that we also set balance_classes to TRUE because the Class variable is highly unbalanced.

auto_mod <- auto_ml() %>%
  set_engine("h2o",
    max_models = 30,
    stopping_metric = "logloss",
    sort_metric = "logloss",
    balance_classes = T,
    seed = 42
  ) %>%
  set_mode("classification") %>%
  translate()

Feature Engineering

Now we define the recipe. We impute missing values and delete variables with zero variance. There are other steps that can be included to improves results, for example:

  • step_normalize(all_numeric_predictors) to normalize variables
  • step_pca(all_numeric_variables, keep_original_cols=T) to include principal components
  • step_poly(all_numeric_variables) to generate poly splines of variables

See https://recipes.tidymodels.org/reference/index.html for a list of all steps.

In this example we only impute missing variables with a bagged tree and delete highly correlated variables, variables that are linear combinations of each other and variables with zero variance

auto_recipe <-
  recipe(Class ~ ., data = dat) %>%
  step_impute_bag(all_predictors()) %>%
  step_corr(all_numeric_predictors()) %>%
  step_lincomb(all_numeric_predictors()) %>%
  step_zv(all_predictors()) 
auto_recipe
── Recipe ──────────────────────────────────────────────────────────────────────



── Inputs 

Number of variables by role

outcome:    1
predictor: 56



── Operations 

• Bagged tree imputation for: all_predictors()

• Correlation filter on: all_numeric_predictors()

• Linear combination filter on: all_numeric_predictors()

• Zero variance filter on: all_predictors()

Workflow

This is the full workflow with pre-processing and training. h2o internally evaluates the models in 10-fold cross-validation.

auto_workflow <-
  workflow() %>%
  add_model(auto_mod) %>%
  add_recipe(auto_recipe)

auto_res <-
  auto_workflow %>%
  fit(data = dat)

Results

The best model is a stacked ensemble of all 30 models with a logloss of 0.16. Nice.

print(auto_res)

══ Workflow [trained] ══════════════════════════════════════════════════════════ Preprocessor: Recipe Model: auto_ml()

── Preprocessor ──────────────────────────────────────────────────────────────── 4 Recipe Steps

• step_impute_bag() • step_corr() • step_lincomb() • step_zv()

── Model ───────────────────────────────────────────────────────────────────────

Leader Algorithm: stackedensemble

Leader ID: StackedEnsemble_BestOfFamily_1_AutoML_1_20230705_70330

═════════════════════════ H2O AutoML Summary: 32 models ════════════════════════

Leader Algorithm: stackedensemble

Leader ID: StackedEnsemble_BestOfFamily_1_AutoML_1_20230705_70330

══════════════════════════════════ Leaderboard ═════════════════════════════════ model_id logloss auc 1 StackedEnsemble_BestOfFamily_1_AutoML_1_20230705_70330 0.1585723 0.9683930 2 StackedEnsemble_AllModels_1_AutoML_1_20230705_70330 0.1629910 0.9691752 3 XGBoost_grid_1_AutoML_1_20230705_70330_model_4 0.1869239 0.9615713 4 XGBoost_grid_1_AutoML_1_20230705_70330_model_5 0.1963171 0.9546314 5 XGBoost_2_AutoML_1_20230705_70330 0.1978969 0.9564051 6 XGBoost_3_AutoML_1_20230705_70330 0.1987732 0.9570963 aucpr mean_per_class_error rmse mse 1 0.8946899 0.08179619 0.2117525 0.04483912 2 0.8920815 0.10564469 0.2174633 0.04729029 3 0.8663235 0.10999236 0.2258904 0.05102648 4 0.8586601 0.11210252 0.2407249 0.05794847 5 0.8427151 0.10817325 0.2389873 0.05711494 6 0.8366848 0.10677254 0.2428506 0.05897642

Predict

Finally we predict on the test data and create a submission file. The preprocessing workflow will be automatically applied to the test data.

test <- read_csv("/kaggle/input/icr-identify-age-related-conditions/test.csv") %>%
    mutate(
        EJ = factor(EJ)
    )
preds <- predict(auto_res, test, type="prob")
preds
submission <- tibble(Id = test$Id,
                 class_0 = preds$.pred_p0,
                 class_1 = preds$.pred_p1)
submission
write_csv(submission, file="submission.csv")

Twitter, Facebook