Till now I practiced creating classification predictions on the Titanic dataset using KNN, DT and SVM algorithms. As per Kaggle, my submission got a score of 77%. Now I’m going to try these approaches in R.

steps to do : data reading > cleaning > replacing NA > splitting > model using Decision Trees> comparing results

Data reading and cleaning

loading the necessary libraries & reading train.csv from a zipped file. Taking a glimpse of the resulting df

Code

library(tidyverse)
library(zip)
library(readr)
library(tidymodels)

#reading kaggle zip file that I downloaded in older folder
ziplocation <- "D:/Ramakant/Personal/Weekends in Mumbai/Blog/quarto_blog/posts/2022-10-12-day-6-of-50daysofkaggle/titanic.zip"
df <-  read_csv(unz(ziplocation, "train.csv"))
glimpse(df)

Rows: 891
Columns: 12
$ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Reformatting the df to create a new tibble df_n

Code

df_n <- df %>% 
  #selecting only the numerical variables
  select_if(is.numeric) %>% 
  #converting outcome variable into factor for classification 
  mutate(Survived = as.factor(Survived)) %>% 
  #adding back the Sex & Embarked predictors
  bind_cols(Sex = df$Sex, Embarked = df$Embarked) 

head(df_n)

# A tibble: 6 × 9
  PassengerId Survived Pclass   Age SibSp Parch  Fare Sex    Embarked
        <dbl> <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <chr>   
1           1 0             3    22     1     0  7.25 male   S       
2           2 1             1    38     1     0 71.3  female C       
3           3 1             3    26     0     0  7.92 female S       
4           4 1             1    35     1     0 53.1  female S       
5           5 0             3    35     0     0  8.05 male   S       
6           6 0             3    NA     0     0  8.46 male   Q

Finding the null values in the new df

Code

df_n %>%  
  summarise_all(~ sum(is.na(.)))

# A tibble: 1 × 9
  PassengerId Survived Pclass   Age SibSp Parch  Fare   Sex Embarked
        <int>    <int>  <int> <int> <int> <int> <int> <int>    <int>
1           0        0      0   177     0     0     0     0        2

We see that there are 177 null values in the Age column. This will be tackled in the recipe section along with PassengerId

Model Building

Splitting the data

Splitting the data into train & test

Code

df_split <- initial_split(df_n, prop = 0.8)
train <- training(df_split)
test <- testing(df_split)

df_split

<Training/Testing/Total>
<712/179/891>

creating the recipe

Code

dt_recipe <- recipe(Survived ~ ., data = df_n) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  #replacing NA values in Age with median Age
  step_mutate_at(Age, fn = ~ replace_na(Age, median(Age, na.rm = T))) %>% 
  #updating the role of the PassengerId to exclude from analysis
  update_role(PassengerId, new_role = "id_variable")

dt_recipe

Another way to view the recipe using tidy() function

Code

tidy(dt_recipe)

# A tibble: 3 × 6
  number operation type      trained skip  id             
   <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
1      1 step      dummy     FALSE   FALSE dummy_jD1sy    
2      2 step      normalize FALSE   FALSE normalize_CvfCk
3      3 step      mutate_at FALSE   FALSE mutate_at_xAmJj

Model Creation

Declaring a model dt_model as a Decision Tree with depth as 3 and engine rpart

Code

dt_model <- decision_tree(mode = "classification", tree_depth = 3) %>% 
  set_engine("rpart")
dt_model %>% translate()

Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 3

Computational engine: rpart 

Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
    maxdepth = 3)

Workflow creation

Workflow = recipe + model

Code

dt_wf <- workflow() %>%
  add_model(dt_model) %>% 
  add_recipe(dt_recipe)

Predicting on `test` data

Fitting the dt_wf workflow with model created on train data to predict the test data

Code

set.seed(2023)
dt_predict <- predict(fit(dt_wf, data = train), test)
head(dt_predict)

# A tibble: 6 × 1
  .pred_class
  <fct>      
1 0          
2 0          
3 0          
4 1          
5 0          
6 0

Creating a new tibble called preidcted_table by binding the predicted values .pred_class to the test data

Code

predicted_table <- bind_cols(test, dt_predict) %>% 
  rename(dt_yhat = .pred_class) %>% 
  select(Survived, dt_yhat) 
head(predicted_table)

# A tibble: 6 × 2
  Survived dt_yhat
  <fct>    <fct>  
1 0        0      
2 0        0      
3 0        0      
4 0        1      
5 0        0      
6 0        0

Testing accuracy

As mentioned in the TMRW documentation for binary classification metrics, we will try creating the confusion matrix and checking accuracy

Code

conf_mat(predicted_table, truth = Survived, estimate = dt_yhat)

          Truth
Prediction   0   1
         0 109  22
         1  11  37

Estimating the accuracy of our model

Code

accuracy(predicted_table, truth = Survived, estimate = dt_yhat)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.816

In the tidymodels approach, we can define the required metrics with metric_set seperately to check the model accuracy

Code

classification_metrics <- metric_set(accuracy, f_meas)
predicted_table %>% 
  classification_metrics(truth = Survived, estimate = dt_yhat)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.816
2 f_meas   binary         0.869

Submission on Kaggle

When I ran this code on Kaggle, the Decision Tree predictions resulted in a score of 0.7799. Exactly similar to the DT code written in python earlier.

Overall, I’m glad that I was able to wrap my head around the tidymodels workflow.

Next steps

Figure out how to compare accuracy of different models (KNN, SVM) that I had coded earlier in python
figure out hyper-parameter tuning from the tune() package

Data reading and cleaning

Model Building

Splitting the data

Model Creation

Workflow creation

Predicting on test data

Testing accuracy

Submission on Kaggle

Next steps

Predicting on `test` data