Day 10: Titanic Dataset

Part of an ongoing series to familiarise working on kaggle

Progress till date:

Download titanic dataset and assign to train & test
Rearranging the data
EDA (including plots and finding survival rate using .groupby())
Modelling
Data preparation - one-hot encoding the Sex, Pclass & Embarked columns - appending these to the numerical columns - normalising the data - splitting between train into X_train, y_train, X_test, y_test
Applying KNN algo
- finding the right K based on accuracy. (best at K = 7)
- Calculating the accuracy based on test
Applying Decision Trees algo
- with criterion = entropy and max_depth = 3
- sligthly better accuracy in prediction than KNN

To do today: - classification using Support Vector Machines algo

Reading the data

Reading and printing the top 5 rows

Code

import numpy as np
import pandas as pd
import zipfile

#importing the zipfile already saved in the other folder. 
zf = zipfile.ZipFile("../2022-10-12-day-6-of-50daysofkaggle/titanic.zip")
train = pd.read_csv(zf.open("train.csv"))
test = pd.read_csv(zf.open("test.csv"))

#Selecting only the numerical columns
num_col = train.select_dtypes(include=np.number).columns.tolist()

#deslecting passenger ID and 'Survived' 
del num_col[0:2] #.remove() can remove only 1 item. so for more than 1, use for loop 
select_col = num_col

#remaining columns
str_col= ["Sex", "Embarked", "Survived"]
#Adding more elements into a list using `extend` and not `append`
select_col.extend(str_col)

train_eda= train[train.columns.intersection(select_col)]
train_eda

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
0	0	3	male	22.0	1	0	7.2500	S
1	1	1	female	38.0	1	0	71.2833	C
2	1	3	female	26.0	0	0	7.9250	S
3	1	1	female	35.0	1	0	53.1000	S
4	0	3	male	35.0	0	0	8.0500	S
...	...	...	...	...	...	...	...	...
886	0	2	male	27.0	0	0	13.0000	S
887	1	1	female	19.0	0	0	30.0000	S
888	0	3	female	NaN	1	2	23.4500	S
889	1	1	male	26.0	0	0	30.0000	C
890	0	3	male	32.0	0	0	7.7500	Q

891 rows × 8 columns

Cleaning up the data

Checking all na values in the existing dataset.

Code

train_eda.isna().sum().sort_values()

Survived      0
Pclass        0
Sex           0
SibSp         0
Parch         0
Fare          0
Embarked      2
Age         177
dtype: int64

Replacing empty cells with median age (28)

Code

median_age = train_eda.Age.median() #28
train_eda.loc[train_eda.Age.isna(), "Age"] = median_age #.loc returns the view and doesn't throw warning msg
train_eda.isna().sum().sort_values()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Model Building

Seperating X & y

Code

train_eda = train_eda.dropna(axis = 0) #removing all rows with NA

X = train_eda[["Age", "SibSp", "Parch", "Fare"]]
X = pd.concat([X,pd.get_dummies(data = train_eda[["Sex", "Embarked", "Pclass"]], columns = ["Sex", "Embarked", "Pclass"])], axis = 1)

y = train_eda["Survived"].values

Normalising the data

Transform X and printing the first 5 datapoints

Code

from sklearn import preprocessing

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.56367407,  0.43135024, -0.47432585, -0.50023975, -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.66921696,  0.43135024, -0.47432585,  0.78894661,  1.35991138,
        -1.35991138,  2.07163382, -0.30794088, -1.62128697,  1.77600834,
        -0.51087465, -1.11070624],
       [-0.25545131, -0.47519908, -0.47432585, -0.48664993,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.43804989,  0.43135024, -0.47432585,  0.42286111,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395,  1.77600834,
        -0.51087465, -1.11070624],
       [ 0.43804989, -0.47519908, -0.47432585, -0.4841333 , -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807]])

Splitting into Test & Train data

Splitting into test & train data and comparing the dimensions.

Code

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set\t :', X_train.shape,  y_train.shape,
'\nTest set\t :', X_test.shape,  y_test.shape)

Train set    : (711, 12) (711,) 
Test set     : (178, 12) (178,)

Support Vector Machines

Lets check the classification results using SVM. First 10 are as follows:

Code

from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

yhat_svm = clf.predict(X_test)

print("First 10 actual\t\t:", y_test[0:10],"\nFirst 10 predicted\t:", yhat_svm[0:10])

First 10 actual     : [1 1 0 1 1 1 0 0 0 0] 
First 10 predicted  : [0 1 0 1 1 0 0 0 0 0]

Confusion matrix using SVM

Code

from sklearn.metrics import classification_report, confusion_matrix
import itertools

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat_svm)
np.set_printoptions(precision=2)

print (classification_report(y_test, yhat_svm))

              precision    recall  f1-score   support

           0       0.78      0.95      0.85       105
           1       0.90      0.60      0.72        73

    accuracy                           0.81       178
   macro avg       0.84      0.78      0.79       178
weighted avg       0.83      0.81      0.80       178

Checking the accuracy

Code

from sklearn import metrics

print("SVM Accuracy\t:", metrics.accuracy_score(y_test, yhat_svm),"\nRMSE\t\t\t:", metrics.mean_squared_error(y_test,yhat_svm),"\nNormalised RMSE\t:", metrics.mean_squared_error(y_test,yhat_svm)/np.std(y_test))

SVM Accuracy    : 0.8089887640449438 
RMSE            : 0.19101123595505617 
Normalised RMSE : 0.38834957789472496

Achieved 81% accuracy using SVM with RMSE of 0.1911. This is is not as good as Decision Trees which resulted in RMSE of 0.168

Therefore after 10 days of struggle, I have come to the conclusion that Decision Trees is a good classification algorithm for the Titanic dataset.