Day 10 of #50daysofkaggle

Support Vector Machines

Classification through SVM
kaggle
Author

Me

Published

October 16, 2022

Day 10: Titanic Dataset

Part of an ongoing series to familiarise working on kaggle

Progress till date:

  • Download titanic dataset and assign to train & test
  • Rearranging the data
  • EDA (including plots and finding survival rate using .groupby())
  • Modelling
  • Data preparation - one-hot encoding the Sex, Pclass & Embarked columns - appending these to the numerical columns - normalising the data - splitting between train into X_train, y_train, X_test, y_test
  • Applying KNN algo
    • finding the right K based on accuracy. (best at K = 7)
    • Calculating the accuracy based on test
  • Applying Decision Trees algo
    • with criterion = entropy and max_depth = 3
    • sligthly better accuracy in prediction than KNN

To do today: - classification using Support Vector Machines algo

Reading the data

Reading and printing the top 5 rows

Code
import numpy as np
import pandas as pd
import zipfile

#importing the zipfile already saved in the other folder. 
zf = zipfile.ZipFile("../2022-10-12-day-6-of-50daysofkaggle/titanic.zip")
train = pd.read_csv(zf.open("train.csv"))
test = pd.read_csv(zf.open("test.csv"))

#Selecting only the numerical columns
num_col = train.select_dtypes(include=np.number).columns.tolist()

#deslecting passenger ID and 'Survived' 
del num_col[0:2] #.remove() can remove only 1 item. so for more than 1, use for loop 
select_col = num_col

#remaining columns
str_col= ["Sex", "Embarked", "Survived"]
#Adding more elements into a list using `extend` and not `append`
select_col.extend(str_col)

train_eda= train[train.columns.intersection(select_col)]
train_eda
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S
... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S
887 1 1 female 19.0 0 0 30.0000 S
888 0 3 female NaN 1 2 23.4500 S
889 1 1 male 26.0 0 0 30.0000 C
890 0 3 male 32.0 0 0 7.7500 Q

891 rows × 8 columns

Cleaning up the data

Checking all na values in the existing dataset.

Code
train_eda.isna().sum().sort_values()
Survived      0
Pclass        0
Sex           0
SibSp         0
Parch         0
Fare          0
Embarked      2
Age         177
dtype: int64

Replacing empty cells with median age (28)

Code
median_age = train_eda.Age.median() #28
train_eda.loc[train_eda.Age.isna(), "Age"] = median_age #.loc returns the view and doesn't throw warning msg
train_eda.isna().sum().sort_values()
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Model Building

Seperating X & y

Code
train_eda = train_eda.dropna(axis = 0) #removing all rows with NA

X = train_eda[["Age", "SibSp", "Parch", "Fare"]]
X = pd.concat([X,pd.get_dummies(data = train_eda[["Sex", "Embarked", "Pclass"]], columns = ["Sex", "Embarked", "Pclass"])], axis = 1)

y = train_eda["Survived"].values

Normalising the data

Transform X and printing the first 5 datapoints

Code
from sklearn import preprocessing

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
array([[-0.56367407,  0.43135024, -0.47432585, -0.50023975, -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.66921696,  0.43135024, -0.47432585,  0.78894661,  1.35991138,
        -1.35991138,  2.07163382, -0.30794088, -1.62128697,  1.77600834,
        -0.51087465, -1.11070624],
       [-0.25545131, -0.47519908, -0.47432585, -0.48664993,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.43804989,  0.43135024, -0.47432585,  0.42286111,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395,  1.77600834,
        -0.51087465, -1.11070624],
       [ 0.43804989, -0.47519908, -0.47432585, -0.4841333 , -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807]])

Splitting into Test & Train data

Splitting into test & train data and comparing the dimensions.

Code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set\t :', X_train.shape,  y_train.shape,
'\nTest set\t :', X_test.shape,  y_test.shape)
Train set    : (711, 12) (711,) 
Test set     : (178, 12) (178,)

Support Vector Machines

Lets check the classification results using SVM. First 10 are as follows:

Code
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

yhat_svm = clf.predict(X_test)

print("First 10 actual\t\t:", y_test[0:10],"\nFirst 10 predicted\t:", yhat_svm[0:10])
First 10 actual     : [1 1 0 1 1 1 0 0 0 0] 
First 10 predicted  : [0 1 0 1 1 0 0 0 0 0]

Confusion matrix using SVM

Code
from sklearn.metrics import classification_report, confusion_matrix
import itertools

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat_svm)
np.set_printoptions(precision=2)

print (classification_report(y_test, yhat_svm))
              precision    recall  f1-score   support

           0       0.78      0.95      0.85       105
           1       0.90      0.60      0.72        73

    accuracy                           0.81       178
   macro avg       0.84      0.78      0.79       178
weighted avg       0.83      0.81      0.80       178

Checking the accuracy

Code
from sklearn import metrics

print("SVM Accuracy\t:", metrics.accuracy_score(y_test, yhat_svm),"\nRMSE\t\t\t:", metrics.mean_squared_error(y_test,yhat_svm),"\nNormalised RMSE\t:", metrics.mean_squared_error(y_test,yhat_svm)/np.std(y_test))
SVM Accuracy    : 0.8089887640449438 
RMSE            : 0.19101123595505617 
Normalised RMSE : 0.38834957789472496

Achieved 81% accuracy using SVM with RMSE of 0.1911. This is is not as good as Decision Trees which resulted in RMSE of 0.168

Therefore after 10 days of struggle, I have come to the conclusion that Decision Trees is a good classification algorithm for the Titanic dataset.