Day 11 of #50daysofkaggle

Roadmap to tidymodels

Implementing DT using R
kaggle
R
Author

Ramakant

Published

February 20, 2023

Till now I practiced creating classification predictions on the Titanic dataset using KNN, DT and SVM algorithms. As per Kaggle, my submission got a score of 77%. Now I’m going to try these approaches in R.

steps to do : data reading > cleaning > replacing NA > splitting > model using Decision Trees> comparing results

Data reading and cleaning

loading the necessary libraries & reading train.csv from a zipped file. Taking a glimpse of the resulting df

Code
library(tidyverse)
library(zip)
library(readr)
library(tidymodels)

#reading kaggle zip file that I downloaded in older folder
ziplocation <- "D:/Ramakant/Personal/Weekends in Mumbai/Blog/quarto_blog/posts/2022-10-12-day-6-of-50daysofkaggle/titanic.zip"
df <-  read_csv(unz(ziplocation, "train.csv"))
glimpse(df)
Rows: 891
Columns: 12
$ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Reformatting the df to create a new tibble df_n

Code
df_n <- df %>% 
  #selecting only the numerical variables
  select_if(is.numeric) %>% 
  #converting outcome variable into factor for classification 
  mutate(Survived = as.factor(Survived)) %>% 
  #adding back the Sex & Embarked predictors
  bind_cols(Sex = df$Sex, Embarked = df$Embarked) 

head(df_n)
# A tibble: 6 × 9
  PassengerId Survived Pclass   Age SibSp Parch  Fare Sex    Embarked
        <dbl> <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <chr>   
1           1 0             3    22     1     0  7.25 male   S       
2           2 1             1    38     1     0 71.3  female C       
3           3 1             3    26     0     0  7.92 female S       
4           4 1             1    35     1     0 53.1  female S       
5           5 0             3    35     0     0  8.05 male   S       
6           6 0             3    NA     0     0  8.46 male   Q       

Finding the null values in the new df

Code
df_n %>%  
  summarise_all(~ sum(is.na(.)))
# A tibble: 1 × 9
  PassengerId Survived Pclass   Age SibSp Parch  Fare   Sex Embarked
        <int>    <int>  <int> <int> <int> <int> <int> <int>    <int>
1           0        0      0   177     0     0     0     0        2

We see that there are 177 null values in the Age column. This will be tackled in the recipe section along with PassengerId

Model Building

Splitting the data

Splitting the data into train & test

Code
df_split <- initial_split(df_n, prop = 0.8)
train <- training(df_split)
test <- testing(df_split)

df_split
<Training/Testing/Total>
<712/179/891>

creating the recipe

Code
dt_recipe <- recipe(Survived ~ ., data = df_n) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  #replacing NA values in Age with median Age
  step_mutate_at(Age, fn = ~ replace_na(Age, median(Age, na.rm = T))) %>% 
  #updating the role of the PassengerId to exclude from analysis
  update_role(PassengerId, new_role = "id_variable")

dt_recipe

Another way to view the recipe using tidy() function

Code
tidy(dt_recipe)
# A tibble: 3 × 6
  number operation type      trained skip  id             
   <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
1      1 step      dummy     FALSE   FALSE dummy_jD1sy    
2      2 step      normalize FALSE   FALSE normalize_CvfCk
3      3 step      mutate_at FALSE   FALSE mutate_at_xAmJj

Model Creation

Declaring a model dt_model as a Decision Tree with depth as 3 and engine rpart

Code
dt_model <- decision_tree(mode = "classification", tree_depth = 3) %>% 
  set_engine("rpart")
dt_model %>% translate()
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 3

Computational engine: rpart 

Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
    maxdepth = 3)

Workflow creation

Workflow = recipe + model

Code
dt_wf <- workflow() %>%
  add_model(dt_model) %>% 
  add_recipe(dt_recipe)

Predicting on test data

Fitting the dt_wf workflow with model created on train data to predict the test data

Code
set.seed(2023)
dt_predict <- predict(fit(dt_wf, data = train), test)
head(dt_predict)
# A tibble: 6 × 1
  .pred_class
  <fct>      
1 0          
2 0          
3 0          
4 1          
5 0          
6 0          

Creating a new tibble called preidcted_table by binding the predicted values .pred_class to the test data

Code
predicted_table <- bind_cols(test, dt_predict) %>% 
  rename(dt_yhat = .pred_class) %>% 
  select(Survived, dt_yhat) 
head(predicted_table)
# A tibble: 6 × 2
  Survived dt_yhat
  <fct>    <fct>  
1 0        0      
2 0        0      
3 0        0      
4 0        1      
5 0        0      
6 0        0      

Testing accuracy

As mentioned in the TMRW documentation for binary classification metrics, we will try creating the confusion matrix and checking accuracy

Code
conf_mat(predicted_table, truth = Survived, estimate = dt_yhat)
          Truth
Prediction   0   1
         0 109  22
         1  11  37

Estimating the accuracy of our model

Code
accuracy(predicted_table, truth = Survived, estimate = dt_yhat)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.816

In the tidymodels approach, we can define the required metrics with metric_set seperately to check the model accuracy

Code
classification_metrics <- metric_set(accuracy, f_meas)
predicted_table %>% 
  classification_metrics(truth = Survived, estimate = dt_yhat)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.816
2 f_meas   binary         0.869

Submission on Kaggle

When I ran this code on Kaggle, the Decision Tree predictions resulted in a score of 0.7799. Exactly similar to the DT code written in python earlier.

Overall, I’m glad that I was able to wrap my head around the tidymodels workflow.

Next steps

  • Figure out how to compare accuracy of different models (KNN, SVM) that I had coded earlier in python
  • figure out hyper-parameter tuning from the tune() package