Day 8 of #50daysofkaggle

Decision Tree

kaggle
Author

Published

October 14, 2022

Day 8: Titanic Dataset

Progress till date:

  • Download titanic dataset and assign to train & test
  • Rearranging the data
  • EDA (including plots and finding survival rate using .groupby())
  • Modelling
  • Data preparation - one-hot encoding the Sex, Pclass & Embarked columns - appending these to the numerical columns - normalising the data - splitting between train into X_train, y_train, X_test, y_test
  • Applying KNN algo
    • finding the right K based on accuracy. (best at K = 7)
    • Calculating the accuracy based on test

To do today: - Perform Decision Tree classification

Loading the data

Reading and printing the top 5 rows

Code
import numpy as np
import pandas as pd
import zipfile


#importing the zipfile already saved in the other folder. 
zf = zipfile.ZipFile("../2022-10-12-day-6-of-50daysofkaggle/titanic.zip")
train = pd.read_csv(zf.open("train.csv"))
test = pd.read_csv(zf.open("test.csv"))

#Selecting only the numerical columns
num_col = train.select_dtypes(include=np.number).columns.tolist()

#deslecting passenger ID and 'Survived' 
del num_col[0:2] #.remove() can remove only 1 item. so for more than 1, use for loop 
select_col = num_col

#remaining columns
str_col= ["Sex", "Embarked", "Survived"]


#Adding more elements into a list using `extend` and not `append`
select_col.extend(str_col)

train_eda= train[train.columns.intersection(select_col)]
train_eda.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S

Cleaning up the data

Checking all na values in the existing dataset

Code
train_eda.isna().sum().sort_values()
Survived      0
Pclass        0
Sex           0
SibSp         0
Parch         0
Fare          0
Embarked      2
Age         177
dtype: int64

Finding the mean

Code
train_eda["Age"].median()
28.0

Replacing na cells with the mean

Code
train_eda["Age"].fillna(value = train_eda["Age"].median(), inplace = True)
train_eda.isna().sum().sort_values()
C:\Users\DELL\AppData\Local\Temp\ipykernel_9908\1076914416.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Sidenote: Was getting a wierd warning (SettingWithCopyWarning) while using .fillna() to replace na with the median values. Turns out there’s a between calling a view or a copy. One way of avoiding this error is to use train_eda.loc[:,"Age"] instead of train_eda["Age"]. This is because .loc returns the view (original) while using subsets. Elegant explanation here. Below code will not throw up a warning.

Code
xx = train_eda.copy()
xx.loc[:,"Age"].fillna(value = xx.Age.median(), inplace = True)
xx.isna().sum()
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Model Building

Seperating X & y. Here’s the first 5 rows of X

Code
train_eda = train_eda.dropna(axis = 0) #removing all rows with NA

X = train_eda[["Age", "SibSp", "Parch", "Fare"]]
X = pd.concat([X,pd.get_dummies(data = train_eda[["Sex", "Embarked", "Pclass"]], columns = ["Sex", "Embarked", "Pclass"])], axis = 1)

X.head()
Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3
0 22.0 1 0 7.2500 0 1 0 0 1 0 0 1
1 38.0 1 0 71.2833 1 0 1 0 0 1 0 0
2 26.0 0 0 7.9250 1 0 0 0 1 0 0 1
3 35.0 1 0 53.1000 1 0 0 0 1 1 0 0
4 35.0 0 0 8.0500 0 1 0 0 1 0 0 1

Here’s the first 5 rows of y

Code
y = train_eda["Survived"].values
y[0:5]
array([0, 1, 1, 1, 0], dtype=int64)

comparing the shapes of X and y

Code
len(y) #889 after filling up the NA. previously 712
X.shape #(889, 12)
(889, 12)

Normalising the data

Standardising and printing the first 5 datapoints.

Code
from sklearn import preprocessing

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
array([[-0.56367407,  0.43135024, -0.47432585, -0.50023975, -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.66921696,  0.43135024, -0.47432585,  0.78894661,  1.35991138,
        -1.35991138,  2.07163382, -0.30794088, -1.62128697,  1.77600834,
        -0.51087465, -1.11070624],
       [-0.25545131, -0.47519908, -0.47432585, -0.48664993,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.43804989,  0.43135024, -0.47432585,  0.42286111,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395,  1.77600834,
        -0.51087465, -1.11070624],
       [ 0.43804989, -0.47519908, -0.47432585, -0.4841333 , -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807]])

Splitting into Test & Train data

Splitting into test & train data and comparing the dimensions.

Code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set\t :', X_train.shape,  y_train.shape,
'\nTest set\t :', X_test.shape,  y_test.shape)
Train set    : (711, 12) (711,) 
Test set     : (178, 12) (178,)

Decision Trees

Lets check the classification results using Decision trees. First 10 are as follows:

Code
from sklearn.tree import DecisionTreeClassifier

Dtree = DecisionTreeClassifier(criterion = "entropy", max_depth = 3)
Dtree.fit(X_train,y_train)
y_test_hat = Dtree.predict(X_test)
print("First 10 actual\t\t:", y_test[0:10],"\nFirst 10 predicted\t:", y_test_hat[0:10])
First 10 actual     : [1 1 0 1 1 1 0 0 0 0] 
First 10 predicted  : [1 1 0 1 1 0 0 0 0 0]

Checking Accuracy of DT

Calculating accuracy using Decision Tree classification for y_test

Code
from sklearn import metrics

print("Decision Tree Accuracy\t:", metrics.accuracy_score(y_test, y_test_hat),"\nRMSE\t\t\t:", metrics.mean_squared_error(y_test,y_test_hat),"\nNormalised RMSE\t\t:", metrics.mean_squared_error(y_test,y_test_hat)/np.std(y_test))
Decision Tree Accuracy  : 0.8314606741573034 
RMSE            : 0.16853932584269662 
Normalised RMSE     : 0.34266139226005143

Not bad. We find that Test accuracy is around 83% for Decision Trees and RMSE of 0.168

Visualising the DT

Here’s a neat little trick to see how the DT actually thinks.

Code
from sklearn import tree
import matplotlib.pyplot as plt

plt.clf()
tree.plot_tree(Dtree)
plt.show()