How hard is it to evaluate multiple models when working on a given data problem?

If you’re using the tidymodels package, the answer is surprisingly simple.

Today’s post is an attempt to use the tidymodels framework to screen multiple models. Inspiration for this post comes from the Ch 15 of the Tidymodels with R textbook along with two more noteworthy blogs

Olivier Gimenez’s post that had me marvelling at the feature engineering
Tural Sadigov’s post on handling multiple models

I’m skipping the EDA component as I’ve covered it in the previous posts. Moving on to some boring (but very necessary) sections.

As always, please feel free to leave a comment using the side-bar👉🏼

Loading the data

test and train data is inputted and merged into a single dataframe

Code

library(tidyverse)
library(tidymodels)


titanic_train <- read.csv("train.csv", header = T)

#adding column to identify source
titanic_train <- titanic_train %>% 
  mutate(source = "train")

titanic_test <- read.csv("test.csv", header = T)
#adding column to identify source
titanic_test <- titanic_test %>% 
  mutate(Survived = NA,
         source = "test")

#merging the data. step1 starts here
titanic_data <- bind_rows(titanic_train, titanic_test)

glimpse(titanic_data)

Rows: 1,309
Columns: 13
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
$ source      <chr> "train", "train", "train", "train", "train", "train", "tra…

Cleaning data

Checking for NA values in the full df tells us that Age column has 86 missing in test & 177 missing values in train while there’s 1 missing NA value in Fare. The 418 missing values in Survived in test are the ones we need to predict.

Code

titanic_data %>% 
  group_by(source) %>% 
  summarise_all(~ sum(is.na(.)))

# A tibble: 2 × 13
  source Passe…¹ Survi…² Pclass  Name   Sex   Age SibSp Parch Ticket  Fare Cabin
  <chr>    <int>   <int>  <int> <int> <int> <int> <int> <int>  <int> <int> <int>
1 test         0     418      0     0     0    86     0     0      0     1     0
2 train        0       0      0     0     0   177     0     0      0     0     0
# … with 1 more variable: Embarked <int>, and abbreviated variable names
#   ¹PassengerId, ²Survived

Janitor::clean_names()

Now i’ve got a thing about keeping column names clean so summoning janitor with a magic wand:

Code

titanic_data <- titanic_data %>% 
  mutate(family_count = SibSp+Parch+1) %>% 
  janitor::clean_names()
names(titanic_data)

 [1] "passenger_id" "survived"     "pclass"       "name"         "sex"         
 [6] "age"          "sib_sp"       "parch"        "ticket"       "fare"        
[11] "cabin"        "embarked"     "source"       "family_count"

Voila. Everything now in lower case and snake case!

Imputing `NA` values in `embarked`

Now under embarked there are two rows that don’t have NA but are blank. That’s a bit odd🤨

Code

titanic_data %>% 
  count(embarked, sort = T)

  embarked   n
1        S 914
2        C 270
3        Q 123
4            2

Since it is only 2 such rows, I’ll be replacing the NA values with the most repeated embarked value. So now zero empty values in embarked

Code

mode_embarked <- titanic_data %>% 
  count(embarked, sort = T) %>% 
  select(embarked) %>% 
  head(1) %>% 
#this beautiful func comes from the purrr package. equivalent of .[[1]]
  pluck(1)

titanic_data <- titanic_data %>% 
  mutate(embarked = if_else(embarked == "", mode_embarked, embarked))

titanic_data %>% 
  count(embarked, sort = T)

  embarked   n
1        S 916
2        C 270
3        Q 123

Imputing `NA` values in `age`

The age column has a bunch of missing values. My approach today is to substitute them with the median values when grouped by sex and class. Here’s a function written to impute within the test and train data accordingly.

note: Feature engineering functions can be addressed in the tidymodels framework at recipe stage.

Code

median_age_calc <- function(df){
  median_ages <- df %>% 
    group_by(sex, pclass) %>% 
    summarise(age = median(age,na.rm = T))
  df %>%
    mutate(age = case_when((sex =="male" & pclass ==1 & is.na(age)) ~ median_ages$age[1],
                           (sex =="male" & pclass ==2 & is.na(age)) ~ median_ages$age[2],
                           (sex =="male" & pclass ==3 & is.na(age)) ~ median_ages$age[3],
                           (sex =="female" & pclass ==1 & is.na(age)) ~ median_ages$age[4],
                           (sex =="female" & pclass ==2 & is.na(age)) ~ median_ages$age[5],
                           (sex =="female" & pclass ==3 & is.na(age)) ~ median_ages$age[6],
                           .default = age)
    )
}

titanic_data <- titanic_data %>% 
  median_age_calc() %>% ungroup()

Are there any na values in the titanic data now?

Code

titanic_data %>% 
  select(source, survived, sex, pclass, fare, age) %>% 
  group_by(source) %>% 
  summarise_all(~ sum(is.na(.)))

# A tibble: 2 × 6
  source survived   sex pclass  fare   age
  <chr>     <int> <int>  <int> <int> <int>
1 test        418     0      0     1     0
2 train         0     0      0     0     0

Phew. So age is covered but one NA in fare. What are the median age values now?

Code

titanic_data %>% 
  group_by(pclass, sex) %>%
  summarise(median_age = median(age))

# A tibble: 6 × 3
# Groups:   pclass [3]
  pclass sex    median_age
   <int> <chr>       <dbl>
1      1 female       37.5
2      1 male         37  
3      2 female       28  
4      2 male         28  
5      3 female       25  
6      3 male         22

Imputing `NA` values for `fare`

There’s this one person (passenger_id = 1044, male and pclass = 3) who has a na value in fare from test data. Replacing it with the median value.

Code

titanic_data %>% 
  select(source, name, passenger_id, sex, age, pclass, fare) %>% 
  filter(is.na(fare))

  source               name passenger_id  sex  age pclass fare
1   test Storey, Mr. Thomas         1044 male 60.5      3   NA

Code

titanic_data %>% 
  group_by(sex, pclass) %>% 
  summarise(median_fare = median(fare, na.rm = T))

# A tibble: 6 × 3
# Groups:   sex [2]
  sex    pclass median_fare
  <chr>   <int>       <dbl>
1 female      1       80.9 
2 female      2       23   
3 female      3       10.5 
4 male        1       49.5 
5 male        2       13   
6 male        3        7.90

Replacing in the main df

Code

titanic_data <- titanic_data %>% 
  mutate(fare = if_else(is.na(fare), 
#from the above table. Urgh!! hard coding for that 1 guy!! how inelegant.
                        7.8958, age))

checking if has been replaced.

Code

titanic_data %>% 
  select(source, name, passenger_id, sex, age, pclass, fare) %>% 
  filter(is.na(fare))

[1] source       name         passenger_id sex          age         
[6] pclass       fare        
<0 rows> (or 0-length row.names)

Feature Engineering

For this post, I’ve focussed on a different approach while predicting survival probabilities. The guiding principle here is to club passengers from the same family and/or the same ticket_ID. The titanic dataset poses the challenge of having people within the same family purchase different tickets while at the same time so surnames really are not always the best grouping feature.

One such example is the Cacic family:

Code

titanic_data %>% 
  select(name, sex,age, ticket, survived) %>% 
  filter(str_detect(name, "Cacic"))

                  name    sex age ticket survived
1      Cacic, Mr. Luka   male  38 315089        0
2  Cacic, Miss. Marija female  30 315084        0
3   Cacic, Miss. Manda female  21 315087       NA
4 Cacic, Mr. Jego Grga   male  18 315091       NA

How about only clubbing with only surname? Well, that also gets tricky because sometimes we have instances of people within the same ticket sharing multiple surnames.

Case in point is ticket 1601:

Code

titanic_data %>% 
  select(name, sex, age, ticket, survived) %>% 
  filter(ticket==1601)

             name  sex age ticket survived
1   Bing, Mr. Lee male  32   1601        1
2   Ling, Mr. Lee male  28   1601        0
3  Lang, Mr. Fang male  26   1601        1
4 Foo, Mr. Choong male  22   1601        1
5    Lam, Mr. Ali male  22   1601        1
6    Lam, Mr. Len male  22   1601        0
7 Chip, Mr. Chang male  32   1601        1
8   Hee, Mr. Ling male  22   1601       NA

In times of crisis, it is expected that groups that traveled together will look out for their own. While this is not a thumb rule, I wanted to create a custom column that will combine the ticket and surnames.

So first step is to finding the surnames. That means using regex. Lovely 🤢

Splitting names

With absolutely no guilt, I confess that this took me almost an entire day to figure out. And am I glad to have done it. The best thing about the tidyverse approach is the onus on making readable data. For that I’m grateful to discover functions like seperate_wider_regex.

Essentially, it is a delimiter that breaks up columns based on the string patterns. So neat!

Code

names_with_splchar <- regex("[A-Za-z]+[\\'\\-\\s]+[A-Za-z]+")
names_with_3words <- regex("[A-Za-z]+\\s[A-Za-z]+\\s[A-Za-z]+")
names_with_1word <- regex("[A-Za-z]+") 
names_with_2words <- regex("[A-Za-z]+\\s+[A-Za-z]+") # for 'the countess'


titanic_data <- titanic_data %>% 
  separate_wider_regex(
    name, 
    patterns = c(
#IMP: ordering of regex patterns changes the outcome
      surname = str_c(c(names_with_splchar, 
                        names_with_3words,
                        names_with_1word), 
                      collapse = "|"),    # picks the first word before comma
      ", ",                               # the comma  
#IMP: ordering of regex patterns changes the outcome
      title = str_c(c(names_with_2words , # two words with special char in between like 'the countess'
                      names_with_1word),  # one word such as Mr Miss Mrs etc
                    collapse = "|"),      
      ". ",                               # the dot
      given_name = ".+"),                 # picks anything else which occurs at least once
    cols_remove = F                       # retains the original column    
  ) 

titanic_data %>% 
  select(name, title, surname, given_name) %>% 
  head(10)

# A tibble: 10 × 4
   name                                                title  surname   given_…¹
   <chr>                                               <chr>  <chr>     <chr>   
 1 Braund, Mr. Owen Harris                             Mr     Braund    Owen Ha…
 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) Mrs    Cumings   John Br…
 3 Heikkinen, Miss. Laina                              Miss   Heikkinen Laina   
 4 Futrelle, Mrs. Jacques Heath (Lily May Peel)        Mrs    Futrelle  Jacques…
 5 Allen, Mr. William Henry                            Mr     Allen     William…
 6 Moran, Mr. James                                    Mr     Moran     James   
 7 McCarthy, Mr. Timothy J                             Mr     McCarthy  Timothy…
 8 Palsson, Master. Gosta Leonard                      Master Palsson   Gosta L…
 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   Mrs    Johnson   Oscar W…
10 Nasser, Mrs. Nicholas (Adele Achem)                 Mrs    Nasser    Nichola…
# … with abbreviated variable name ¹given_name

What is the break-up of titles now?

Code

titanic_data %>% 
  count(title, sort= T)

# A tibble: 18 × 2
   title            n
   <chr>        <int>
 1 Mr             757
 2 Miss           260
 3 Mrs            197
 4 Master          61
 5 Dr               8
 6 Rev              8
 7 Col              4
 8 Major            2
 9 Mlle             2
10 Ms               2
11 Capt             1
12 Don              1
13 Dona             1
14 Jonkheer         1
15 Lady             1
16 Mme              1
17 Sir              1
18 the Countess     1

Creating a custom grouping

The ticket is going to be broken up into ticket_tail (the last character) and ticket_head (all but the last character). Then we merge surname and ticket_head to create a group_id

Code

titanic_data <- titanic_data %>% 
  mutate(ticket_head = substr(ticket, 1, nchar(ticket)-1),
         ticket_tail = substr(ticket, nchar(ticket), nchar(ticket)),
         group_id = paste0(surname, "_", ticket_head))

Creating columns that indicate the number of people and other flags

Code

titanic_data <- titanic_data %>% 
  add_count(group_id) %>% rename(pax_in_group = n)

titanic_data <- titanic_data %>% 
  mutate(flag = case_when ((family_count==1 & pax_in_group==1) ~ "1_solo",
                           family_count == pax_in_group ~ "2_family_full",
                           !(family_count == pax_in_group) ~ "3_clubbed",
                           .default = "x"))

titanic_data <- titanic_data %>% 
  add_count(ticket_head) %>% 
  rename(pax_in_ticket_head = n)

# how many instances of the same ticket having multiple groups? 
titanic_data <- titanic_data %>% 
  group_by(ticket) %>% 
  mutate(groups_in_ticket = n_distinct(group_id)) %>% ungroup()

# which tickets that have more than 1 groups in them? 
#     these passengers will have ticket_grouping precedence as they may include 
#     nannies, relatives & friends that don't share the same surname
ticket_with_multiple_groups <- titanic_data %>% 
  filter(!groups_in_ticket==1) %>% 
  count(ticket, sort = T)

titanic_data <- titanic_data %>% 
  mutate(final_grouping = if_else(ticket %in% ticket_with_multiple_groups$ticket, 
                             ticket, group_id),
         final_label = if_else(ticket %in% ticket_with_multiple_groups$ticket,
                         "4_ticket_grouping", flag))

Since the code now has become a bit too long and I’m creating a checkpost here with a new df called titanic_data2

Code

titanic_data2 <- titanic_data %>% 
  select(source, passenger_id, survived,  sex, age, fare, pclass, embarked, 
         family_count, pax_in_group, pax_in_ticket_head, groups_in_ticket,
         final_grouping) %>% 
  mutate(final_grouping = as_factor(final_grouping),
         survived = as_factor(survived),
         pclass = as_factor(pclass),
         sex = as_factor(sex),
         embarked = as_factor(embarked))
glimpse(titanic_data2)

Rows: 1,309
Columns: 13
$ source             <chr> "train", "train", "train", "train", "train", "train…
$ passenger_id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ survived           <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, …
$ sex                <fct> male, female, female, female, male, male, male, mal…
$ age                <dbl> 22, 38, 26, 35, 35, 22, 54, 2, 27, 14, 4, 58, 20, 3…
$ fare               <dbl> 22, 38, 26, 35, 35, 22, 54, 2, 27, 14, 4, 58, 20, 3…
$ pclass             <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, …
$ embarked           <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, …
$ family_count       <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, …
$ pax_in_group       <int> 1, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, …
$ pax_in_ticket_head <int> 5, 14, 2, 8, 1, 1, 6, 5, 4, 4, 3, 14, 1, 22, 8, 1, …
$ groups_in_ticket   <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ final_grouping     <fct> Braund_A/5 2117, Cumings_PC 1759, Heikkinen_STON/O2…

Now this is the part I don’t get. Why does Kaggle call it train and test when it can be easily told to be given_data and to_predict data?

Code

to_predict <- titanic_data2 %>% 
  filter(source == "test") %>% select(-survived, -source)
given_data <- titanic_data2 %>% 
  filter(source == "train") %>% select(-source)

given_data has all the necessary columns we need to train the algo on

Code

glimpse(given_data)

Rows: 891
Columns: 12
$ passenger_id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ survived           <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, …
$ sex                <fct> male, female, female, female, male, male, male, mal…
$ age                <dbl> 22, 38, 26, 35, 35, 22, 54, 2, 27, 14, 4, 58, 20, 3…
$ fare               <dbl> 22, 38, 26, 35, 35, 22, 54, 2, 27, 14, 4, 58, 20, 3…
$ pclass             <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, …
$ embarked           <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, …
$ family_count       <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, …
$ pax_in_group       <int> 1, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, …
$ pax_in_ticket_head <int> 5, 14, 2, 8, 1, 1, 6, 5, 4, 4, 3, 14, 1, 22, 8, 1, …
$ groups_in_ticket   <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ final_grouping     <fct> Braund_A/5 2117, Cumings_PC 1759, Heikkinen_STON/O2…

to_predict dataframe has everything else except survived column that needs to be predicted.

Code

glimpse(to_predict)

Rows: 418
Columns: 11
$ passenger_id       <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 9…
$ sex                <fct> male, female, male, male, female, male, female, mal…
$ age                <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.…
$ fare               <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.…
$ pclass             <fct> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, …
$ embarked           <fct> Q, S, Q, S, S, S, Q, S, C, S, S, S, S, S, S, C, Q, …
$ family_count       <dbl> 1, 2, 1, 1, 3, 1, 1, 3, 1, 3, 1, 1, 2, 2, 2, 2, 1, …
$ pax_in_group       <int> 1, 1, 1, 1, 2, 1, 1, 3, 1, 3, 1, 1, 2, 2, 2, 2, 1, …
$ pax_in_ticket_head <int> 3, 1, 1, 6, 11, 3, 3, 6, 16, 4, 10, 3, 2, 2, 2, 4, …
$ groups_in_ticket   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, …
$ final_grouping     <fct> Kelly_33091, Wilkes_36327, Myles_24027, Wirz_31515,…

Phew. finally done with the pre-processing. Thanks for sticking around. Here have a gif.

Model Building

So today I’m going to be taking different classification models and compare the findings on a 10-fold bootstrap resample. The 4 selected models are:

Logistic classification
Random Forest
Support Vector Machines
Decision Trees

The Tidymodels textbook is possibly the best place to begin understanding about workflow. In a nutshell:

Create resampling folds
Describe each model (I’m going to be using untuned parameters for sake of simplicity)
create a base recipe for given_data
Defining a workflow that uses the recipe for each model
Fit the resampled folds on the workflow_set
Find the winning model - since we’re dealing with classification, I am using roc_auc
fit the winning model to the given data
Generate predictions on to_predict df using the best fit
submit on kaggle
…
profit? (well not exactly. but you can create another blog and share your new found knowledge with the world🤓)

So lets get cooking!

Step 1: creating resampling folds

Code

set.seed(2023)
titanic_folds <- bootstraps(data = given_data, 
                            times = 15)
titanic_folds

# Bootstrap sampling 
# A tibble: 15 × 2
   splits            id         
   <list>            <chr>      
 1 <split [891/318]> Bootstrap01
 2 <split [891/313]> Bootstrap02
 3 <split [891/340]> Bootstrap03
 4 <split [891/313]> Bootstrap04
 5 <split [891/326]> Bootstrap05
 6 <split [891/342]> Bootstrap06
 7 <split [891/316]> Bootstrap07
 8 <split [891/323]> Bootstrap08
 9 <split [891/338]> Bootstrap09
10 <split [891/331]> Bootstrap10
11 <split [891/319]> Bootstrap11
12 <split [891/322]> Bootstrap12
13 <split [891/340]> Bootstrap13
14 <split [891/321]> Bootstrap14
15 <split [891/320]> Bootstrap15

Step 2: Recipe

Creating the base_recipe object that has only 3 steps

Code

base_recipe <- recipe(survived ~ ., data = given_data) %>% 
  update_role(passenger_id, new_role = "id_variable") %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 
base_recipe

Or if one would like to see it in a tidy format.

Code

tidy(base_recipe)

# A tibble: 2 × 6
  number operation type      trained skip  id             
   <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
1      1 step      dummy     FALSE   FALSE dummy_FSxXz    
2      2 step      normalize FALSE   FALSE normalize_bu1Ti

Step 3: Model definitions

All the four models are defined as:

Code

#logistic regression
glm_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

#random forest
rf_model <- rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

#support vector machines
svm_model <- svm_rbf() %>% # rbf - radial based
  set_engine("kernlab") %>% 
  set_mode("classification")

#decision tree
dt_model <- decision_tree(mode = "classification", 
                          tree_depth = 3) %>% 
  set_engine("rpart")

Step 4: Workflow set

Just as you would with a single model, the workflow combines the base recipe with the multiple models by using lists when declaring the object.

Code

titanic_wf_set <- workflow_set(
  list(base_recipe),
  list(glm_model, rf_model, svm_model, dt_model),
  cross = T
)
titanic_wf_set

# A workflow set/tibble: 4 × 4
  wflow_id             info             option    result    
  <chr>                <list>           <list>    <list>    
1 recipe_logistic_reg  <tibble [1 × 4]> <opts[0]> <list [0]>
2 recipe_rand_forest   <tibble [1 × 4]> <opts[0]> <list [0]>
3 recipe_svm_rbf       <tibble [1 × 4]> <opts[0]> <list [0]>
4 recipe_decision_tree <tibble [1 × 4]> <opts[0]> <list [0]>

In the table above, the result column shows a list[0] implying that it is currently empty. This is because till now, we’ve only defined the workflows and models but are yet to pass the data though it.

Obligatory shout out to the phenom Julia Silge whose YT tutorials and blogs have been my learning books at each step. These steps are succinctly explained in her Tidy Tuesday post which I highly recommend.

Step 5: Fitting on resampled folds

So this part of it needs to be dealt with care. Resampling may take a while…

Code

start_time <- Sys.time()
set.seed(2023)
doParallel::registerDoParallel()
titanic_rs <- workflow_map(
  titanic_wf_set,
  "fit_resamples",
  resamples = titanic_folds
)
end_time <- Sys.time()

… which is …

Code

end_time - start_time

Time difference of 10.71708 mins

Hmm… So moving on. What does the workflow_set object look like now?

Code

titanic_rs

# A workflow set/tibble: 4 × 4
  wflow_id             info             option    result   
  <chr>                <list>           <list>    <list>   
1 recipe_logistic_reg  <tibble [1 × 4]> <opts[1]> <rsmp[+]>
2 recipe_rand_forest   <tibble [1 × 4]> <opts[1]> <rsmp[+]>
3 recipe_svm_rbf       <tibble [1 × 4]> <opts[1]> <rsmp[+]>
4 recipe_decision_tree <tibble [1 × 4]> <opts[1]> <rsmp[+]>

Step 6: Finding the winning model

This is how the tidymodels package all fits in. If you’ve got all the steps covered this far, the decision making shouldn’t take much effort

Code

collect_metrics(titanic_rs)

# A tibble: 8 × 9
  wflow_id             .config preproc model .metric .esti…¹  mean     n std_err
  <chr>                <chr>   <chr>   <chr> <chr>   <chr>   <dbl> <int>   <dbl>
1 recipe_logistic_reg  Prepro… recipe  logi… accura… binary  0.675    15 0.0147 
2 recipe_logistic_reg  Prepro… recipe  logi… roc_auc binary  0.695    15 0.0184 
3 recipe_rand_forest   Prepro… recipe  rand… accura… binary  0.824    15 0.00484
4 recipe_rand_forest   Prepro… recipe  rand… roc_auc binary  0.872    15 0.00488
5 recipe_svm_rbf       Prepro… recipe  svm_… accura… binary  0.780    15 0.00706
6 recipe_svm_rbf       Prepro… recipe  svm_… roc_auc binary  0.806    15 0.00704
7 recipe_decision_tree Prepro… recipe  deci… accura… binary  0.809    15 0.00476
8 recipe_decision_tree Prepro… recipe  deci… roc_auc binary  0.811    15 0.00684
# … with abbreviated variable name ¹.estimator

Since we’re looking at classification, lets see which one of the models resulted in the best roc_auc?

Code

collect_metrics(titanic_rs) %>% 
  filter(.metric == "roc_auc") %>% 
  arrange(desc(mean))

# A tibble: 4 × 9
  wflow_id             .config preproc model .metric .esti…¹  mean     n std_err
  <chr>                <chr>   <chr>   <chr> <chr>   <chr>   <dbl> <int>   <dbl>
1 recipe_rand_forest   Prepro… recipe  rand… roc_auc binary  0.872    15 0.00488
2 recipe_decision_tree Prepro… recipe  deci… roc_auc binary  0.811    15 0.00684
3 recipe_svm_rbf       Prepro… recipe  svm_… roc_auc binary  0.806    15 0.00704
4 recipe_logistic_reg  Prepro… recipe  logi… roc_auc binary  0.695    15 0.0184 
# … with abbreviated variable name ¹.estimator

Looks like the rand_forest is the winner!

Another wonderful way it all ties in together is the visualisation with a single line of code.

Code

autoplot(titanic_rs)

Pieces of code that just fit into each other feels like a spoon-full of ice cream!

Step 7 & 8: Fitting the winner and predictions

From the object titanic_rs we need to pick the winning model and fit to to_predict

Code

final_fit <- extract_workflow(titanic_rs, 
                              "recipe_rand_forest") %>% 
  fit(given_data)

final_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  1000 
Sample size:                      891 
Number of independent variables:  908 
Mtry:                             30 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.1386875

Creating a new dataframe for predictions

Code

final_predictions <- predict(object = final_fit, 
                             new_data = to_predict)

final_predictions <- final_predictions %>% 
  rename(Survived = .pred_class) %>% 
  bind_cols(PassengerId = to_predict$passenger_id)
head(final_predictions)

# A tibble: 6 × 2
  Survived PassengerId
  <fct>          <int>
1 0                892
2 0                893
3 0                894
4 0                895
5 0                896
6 0                897

Step 9: Submit to Kaggle

Converting the final_predictions to csv and uploading to kaggle

Code

write.csv(final_predictions, row.names = F, 
          file = "submissions.csv")

Uploaded it to Kaggle and behold…

Not bad !!! that’s like… top 12% percentile. Woohoo!

Understanding variable importance

Since we’ve got a random forest winner, I wanted to take a look at the importance of the variables used.

Inspired from Julia’s Tidy Tuesday on Ikea Prices. For a ranger model, we do need to go back to the model specification itself and update the engine with importance = "permutation" in order to compute feature importance. This means fitting the model one more time.

Code

library(vip)

rf_model2 <- rand_forest(trees = 1000) %>% 
  set_engine("ranger", importance = "permutation") %>% 
  set_mode("classification")

workflow() %>% 
  add_recipe(base_recipe) %>% 
  add_model(rf_model2) %>% 
  fit(given_data) %>% 
  pull_workflow_fit() %>% 
  vip(aesthetics = list(alpha= 0.8, fill = "green"))

This is interesting to me. While I was expecting the gender and class predictors to play an important part in the model, other grouped columns like family_count, pax_in_group & pax_in_ticket_head also added in improving the accuracy.

Tailpiece regarding `autoplot()`

After model creation, I couldn’t help wonder at the marvel behind the autoplot() function. How can a single function decipher convey such meaningful depth?

Peeking under the hood

I tried to replicate the same errorbar plot generated by the function. For me, this was possibly one of the bigger conceptual lessons about tidymodels framework. Here’s how I did it:

There are 15 bootsample folds that were generated. The workflow_set maps the fit_resamples function on the workflow. This implies that each of the 4 models have generated 2 pairs of metrics (accuracy & roc_auc) for each of the 15 resamples.

The collect_metrics function generates the mean vale of each metric allowing us to chose the best one.

The manner in which all of this is done within a single object is sublime. Information is stored in tibbles within tibbles!

For instance, let us look at the class of titanic_rs

Code

class(titanic_rs)

[1] "workflow_set" "tbl_df"       "tbl"          "data.frame"

It is a tibble with 4 columns out of which results is a list object consisting of more tibbles.

Code

titanic_rs

# A workflow set/tibble: 4 × 4
  wflow_id             info             option    result   
  <chr>                <list>           <list>    <list>   
1 recipe_logistic_reg  <tibble [1 × 4]> <opts[1]> <rsmp[+]>
2 recipe_rand_forest   <tibble [1 × 4]> <opts[1]> <rsmp[+]>
3 recipe_svm_rbf       <tibble [1 × 4]> <opts[1]> <rsmp[+]>
4 recipe_decision_tree <tibble [1 × 4]> <opts[1]> <rsmp[+]>

What does the first tibble in the results list look like? This corresponds to the logistic regression model as shown in the table above.

Code

titanic_rs$result[[1]]

# Resampling results
# Bootstrap sampling 
# A tibble: 15 × 4
   splits            id          .metrics         .notes          
   <list>            <chr>       <list>           <list>          
 1 <split [891/318]> Bootstrap01 <tibble [2 × 4]> <tibble [3 × 3]>
 2 <split [891/313]> Bootstrap02 <tibble [2 × 4]> <tibble [3 × 3]>
 3 <split [891/340]> Bootstrap03 <tibble [2 × 4]> <tibble [3 × 3]>
 4 <split [891/313]> Bootstrap04 <tibble [2 × 4]> <tibble [3 × 3]>
 5 <split [891/326]> Bootstrap05 <tibble [2 × 4]> <tibble [3 × 3]>
 6 <split [891/342]> Bootstrap06 <tibble [2 × 4]> <tibble [3 × 3]>
 7 <split [891/316]> Bootstrap07 <tibble [2 × 4]> <tibble [3 × 3]>
 8 <split [891/323]> Bootstrap08 <tibble [2 × 4]> <tibble [3 × 3]>
 9 <split [891/338]> Bootstrap09 <tibble [2 × 4]> <tibble [3 × 3]>
10 <split [891/331]> Bootstrap10 <tibble [2 × 4]> <tibble [3 × 3]>
11 <split [891/319]> Bootstrap11 <tibble [2 × 4]> <tibble [3 × 3]>
12 <split [891/322]> Bootstrap12 <tibble [2 × 4]> <tibble [3 × 3]>
13 <split [891/340]> Bootstrap13 <tibble [2 × 4]> <tibble [3 × 3]>
14 <split [891/321]> Bootstrap14 <tibble [2 × 4]> <tibble [3 × 3]>
15 <split [891/320]> Bootstrap15 <tibble [2 × 4]> <tibble [3 × 3]>

There were issues with some computations:

  - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x1: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x4: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x11: Column(s) have zero variance so scaling cannot be used: `final_gr...   - Warning(s) x15: Column(s) have zero variance so scaling cannot be used: `final_gr...

Run `show_notes(.Last.tune.result)` for more information.

What does the .metrics list contain?

Code

titanic_rs$result[[1]]$.metrics[[1]]

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.682 Preprocessor1_Model1
2 roc_auc  binary         0.655 Preprocessor1_Model1

So .metrics is a list of results that is generated for each of the folds. The above tibble is the resulting metrics after cross-validation of the first fold of titanic_folds using glm (logisitic classification).

If the intent is to create the errorgraph manually, then we’ll need to extract the data within each of these tibbles.

Now I wasn’t able to check the code for autoplot() and I’m pretty certain there’s a more elegant method out there. Hit me up if you feel there’s a better way to extract data from within tibbles in the comments here 👉🏼

First step is to create an empty tibble consisting of 4 models, 2 metrics and 80 empty cells that need to be filled in from titanic_rs object.

Code

tidy_titanic_rs <- tibble(wf = rep(unique(titanic_rs$wflow_id), 15*2),
                          metrics = rep(c(rep("accuracy", 4), 
                                          rep("roc_auc",4)),
                                        15),
                          values = rep(NA, 4*15*2))
head(tidy_titanic_rs, 8)

# A tibble: 8 × 3
  wf                   metrics  values
  <chr>                <chr>    <lgl> 
1 recipe_logistic_reg  accuracy NA    
2 recipe_rand_forest   accuracy NA    
3 recipe_svm_rbf       accuracy NA    
4 recipe_decision_tree accuracy NA    
5 recipe_logistic_reg  roc_auc  NA    
6 recipe_rand_forest   roc_auc  NA    
7 recipe_svm_rbf       roc_auc  NA    
8 recipe_decision_tree roc_auc  NA

Dimensions of this empty table?

Code

dim(tidy_titanic_rs)

[1] 120   3

To extract the values from the tibbles in titanic_rs and save in tidy_titanic_rs, I proceeded to employ a nested for loop. Dont’ judge, i say!

Extracting the tibbles

pluck has got to be the coolest function I’ve ever come across!

Code

bigtable <- purrr:::pluck(titanic_rs, 4)
wflow_id_titanic <- unique(titanic_rs$wflow_id)
for(i in 1:length(wflow_id_titanic)){
    
  wflow_id <- wflow_id_titanic[i]
  smalltable <- bigtable[[i]]
  
  for(j in 1:length(smalltable$.metrics)){
    smallertable <- purrr::pluck(smalltable$.metrics, j)
    tidy_titanic_rs$values[(tidy_titanic_rs$wf==wflow_id & 
                              tidy_titanic_rs$metrics=="accuracy")][j] <- smallertable$.estimate[smallertable$.metric == "accuracy"]
    tidy_titanic_rs$values[(tidy_titanic_rs$wf==wflow_id & 
                              tidy_titanic_rs$metrics=="roc_auc")][j] <- smallertable$.estimate[smallertable$.metric == "roc_auc"]
    
  }
}

tidy_titanic_rs2 <- tidy_titanic_rs %>% 
  group_by(wf, metrics) %>% 
  summarise(value_min = mean(values) - 0.5*sd(values),
            value_max = mean(values) + 0.5*sd(values),
            value_mean = mean(values)) %>% ungroup() %>% 
  right_join(tidy_titanic_rs, by = c("wf", "metrics"))

For some reason, value_max and value_min for the error-bar are 0.5 * 𝝈 or approximately ±19% of the mean of values

Manually generated plot

With the data in a rectangular tidy format, the rest of the magic is handled by ggplot

cracks knuckles

Code

tidy_titanic_rs2 %>% 
ggplot(aes(x = reorder(wf, desc(value_mean)), 
             y = values, 
             color = wf))+
  geom_errorbar(aes(ymax = value_max, ymin = value_min), 
                width = 0.1)+
  geom_point(aes(y= value_mean))+
  scale_y_continuous(breaks = seq(0.65, 0.9, 0.05),
                     limits = c(0.65, 0.9))+
  theme(legend.position = "none")+
  labs(x = NULL, y = NULL)+
  facet_wrap(~metrics)

And here once again is the auto-generated plot. Not bad, eh?

Code

autoplot(titanic_rs)

Loading the data

Cleaning data

Janitor::clean_names()

Imputing NA values in embarked

Imputing NA values in age

Imputing NA values for fare

Feature Engineering

Splitting names

Creating a custom grouping

Model Building

Step 1: creating resampling folds

Step 2: Recipe

Step 3: Model definitions

Step 4: Workflow set

Step 5: Fitting on resampled folds

Step 6: Finding the winning model

Step 7 & 8: Fitting the winner and predictions

Step 9: Submit to Kaggle

Understanding variable importance

Tailpiece regarding autoplot()

Peeking under the hood

Extracting the tibbles

Manually generated plot

Imputing `NA` values in `embarked`

Imputing `NA` values in `age`

Imputing `NA` values for `fare`

Tailpiece regarding `autoplot()`