Day 8: Titanic Dataset

Progress till date:

Download titanic dataset and assign to train & test
Rearranging the data
EDA (including plots and finding survival rate using .groupby())
Modelling
Data preparation - one-hot encoding the Sex, Pclass & Embarked columns - appending these to the numerical columns - normalising the data - splitting between train into X_train, y_train, X_test, y_test
Applying KNN algo
- finding the right K based on accuracy. (best at K = 7)
- Calculating the accuracy based on test

To do today: - Perform Decision Tree classification

Loading the data

Reading and printing the top 5 rows

Code

import numpy as np
import pandas as pd
import zipfile


#importing the zipfile already saved in the other folder. 
zf = zipfile.ZipFile("../2022-10-12-day-6-of-50daysofkaggle/titanic.zip")
train = pd.read_csv(zf.open("train.csv"))
test = pd.read_csv(zf.open("test.csv"))

#Selecting only the numerical columns
num_col = train.select_dtypes(include=np.number).columns.tolist()

#deslecting passenger ID and 'Survived' 
del num_col[0:2] #.remove() can remove only 1 item. so for more than 1, use for loop 
select_col = num_col

#remaining columns
str_col= ["Sex", "Embarked", "Survived"]


#Adding more elements into a list using `extend` and not `append`
select_col.extend(str_col)

train_eda= train[train.columns.intersection(select_col)]
train_eda.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	0	3	male	22.0	1	7.2500	S
1	1	1	female	38.0	1	71.2833	C
2	1	3	female	26.0	0	7.9250	S
3	1	1	female	35.0	1	53.1000	S
4	0	3	male	35.0	0	8.0500	S

Cleaning up the data

Checking all na values in the existing dataset

Code

train_eda.isna().sum().sort_values()

Survived      0
Pclass        0
Sex           0
SibSp         0
Parch         0
Fare          0
Embarked      2
Age         177
dtype: int64

Finding the mean

Code

train_eda["Age"].median()

28.0

Replacing na cells with the mean

Code

train_eda["Age"].fillna(value = train_eda["Age"].median(), inplace = True)
train_eda.isna().sum().sort_values()

C:\Users\DELL\AppData\Local\Temp\ipykernel_9908\1076914416.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Sidenote: Was getting a wierd warning (SettingWithCopyWarning) while using .fillna() to replace na with the median values. Turns out there’s a between calling a view or a copy. One way of avoiding this error is to use train_eda.loc[:,"Age"] instead of train_eda["Age"]. This is because .loc returns the view (original) while using subsets. Elegant explanation here. Below code will not throw up a warning.

Code

xx = train_eda.copy()
xx.loc[:,"Age"].fillna(value = xx.Age.median(), inplace = True)
xx.isna().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Model Building

Seperating X & y. Here’s the first 5 rows of X

Code

train_eda = train_eda.dropna(axis = 0) #removing all rows with NA

X = train_eda[["Age", "SibSp", "Parch", "Fare"]]
X = pd.concat([X,pd.get_dummies(data = train_eda[["Sex", "Embarked", "Pclass"]], columns = ["Sex", "Embarked", "Pclass"])], axis = 1)

X.head()

	Age	SibSp	Fare	Sex_female	Sex_male	Embarked_C	Embarked_S	Pclass_1	Pclass_3
0	22.0	1	7.2500	0	1	0	1	0	1
1	38.0	1	71.2833	1	0	1	0	1	0
2	26.0	0	7.9250	1	0	0	1	0	1
3	35.0	1	53.1000	1	0	0	1	1	0
4	35.0	0	8.0500	0	1	0	1	0	1

Here’s the first 5 rows of y

Code

y = train_eda["Survived"].values
y[0:5]

array([0, 1, 1, 1, 0], dtype=int64)

comparing the shapes of X and y

Code

len(y) #889 after filling up the NA. previously 712
X.shape #(889, 12)

(889, 12)

Normalising the data

Standardising and printing the first 5 datapoints.

Code

from sklearn import preprocessing

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.56367407,  0.43135024, -0.47432585, -0.50023975, -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.66921696,  0.43135024, -0.47432585,  0.78894661,  1.35991138,
        -1.35991138,  2.07163382, -0.30794088, -1.62128697,  1.77600834,
        -0.51087465, -1.11070624],
       [-0.25545131, -0.47519908, -0.47432585, -0.48664993,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.43804989,  0.43135024, -0.47432585,  0.42286111,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395,  1.77600834,
        -0.51087465, -1.11070624],
       [ 0.43804989, -0.47519908, -0.47432585, -0.4841333 , -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807]])

Splitting into Test & Train data

Splitting into test & train data and comparing the dimensions.

Code

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set\t :', X_train.shape,  y_train.shape,
'\nTest set\t :', X_test.shape,  y_test.shape)

Train set    : (711, 12) (711,) 
Test set     : (178, 12) (178,)

Decision Trees

Lets check the classification results using Decision trees. First 10 are as follows:

Code

from sklearn.tree import DecisionTreeClassifier

Dtree = DecisionTreeClassifier(criterion = "entropy", max_depth = 3)
Dtree.fit(X_train,y_train)
y_test_hat = Dtree.predict(X_test)
print("First 10 actual\t\t:", y_test[0:10],"\nFirst 10 predicted\t:", y_test_hat[0:10])

First 10 actual     : [1 1 0 1 1 1 0 0 0 0] 
First 10 predicted  : [1 1 0 1 1 0 0 0 0 0]

Checking Accuracy of DT

Calculating accuracy using Decision Tree classification for y_test

Code

from sklearn import metrics

print("Decision Tree Accuracy\t:", metrics.accuracy_score(y_test, y_test_hat),"\nRMSE\t\t\t:", metrics.mean_squared_error(y_test,y_test_hat),"\nNormalised RMSE\t\t:", metrics.mean_squared_error(y_test,y_test_hat)/np.std(y_test))

Decision Tree Accuracy  : 0.8314606741573034 
RMSE            : 0.16853932584269662 
Normalised RMSE     : 0.34266139226005143

Not bad. We find that Test accuracy is around 83% for Decision Trees and RMSE of 0.168

Visualising the DT

Here’s a neat little trick to see how the DT actually thinks.

Code

from sklearn import tree
import matplotlib.pyplot as plt

plt.clf()
tree.plot_tree(Dtree)
plt.show()