Day 6 of #50daysofkaggle

Classification using KNN

K-Nearest Neighbours
kaggle
Author

Me

Published

October 12, 2022

Day 6: The Titanic Dataset

Progress till date:

  • Download titanic dataset and assign to train & test
  • Rearranging the data
  • EDA

To do today:

  • write function to find share of survivors by each variable
  • attempt to create model

Reading the data

Loading the data using kaggle library and examining the top rows of relevant columns.

Code
import requests
import numpy as np
import pandas as pd
import kaggle 
import zipfile 

kaggle.api.authenticate()

kaggle.api.competition_download_files("titanic", path = ".")

zf = zipfile.ZipFile("titanic.zip")
train = pd.read_csv(zf.open("train.csv"))
test = pd.read_csv(zf.open("test.csv"))

#Selecting only the numerical columns
num_col = train.select_dtypes(include=np.number).columns.tolist()

#deslecting passenger ID and 'Survived' 
del num_col[0:2] #.remove() can remove only 1 item. so for more than 1, use for loop 
select_col = num_col

#remaining columns
str_col= ["Sex", "Embarked", "Survived"]

#Adding more elements into a list using `extend` and not `append`
select_col.extend(str_col)

train_eda= train[train.columns.intersection(select_col)]
train_eda.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S

How many columns with na values?

Code
train_eda.isna().sum().sort_values()
Survived      0
Pclass        0
Sex           0
SibSp         0
Parch         0
Fare          0
Embarked      2
Age         177
dtype: int64

Almost 177 entries in the Age column have no value. Calculating the median age of remaining data.

Code
train_eda["Age"].median() #28
28.0

Replacing these with the median age (28) instead of removing them.

Code
train_eda["Age"].fillna(value = train_eda["Age"].median(), inplace = True)
train_eda.isna().sum().sort_values()
C:\Users\DELL\AppData\Local\Temp\ipykernel_5312\1076914416.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Today I want to calculate the survival rate of each of these attributes (Pclass, Sex, Embarked).

Code
df_copy2 = pd.DataFrame(columns = {"category", "col", "survive_rate"})

for t in ["Pclass", "Sex", "Embarked"]:
  df_copy = train_eda.groupby([t])["Survived"].mean().reset_index()
  df_copy["category"] = t
  #trying to create a `tidy` version of the data 
  df_copy.rename(columns = {t: "col", "Survived": "survive_rate"}, errors = "raise", inplace = True)
  df_copy = df_copy[["category", "col", "survive_rate"]]
  df_copy2= pd.concat([df_copy2, df_copy], ignore_index = True)


#final table in a tidy format that can be used to create graphs. but that i'm keeping for later
df_copy2[["category", "col", "survive_rate"]]
category col survive_rate
0 Pclass 1 0.62963
1 Pclass 2 0.472826
2 Pclass 3 0.242363
3 Sex female 0.742038
4 Sex male 0.188908
5 Embarked C 0.553571
6 Embarked Q 0.38961
7 Embarked S 0.336957

With this, its pretty clear that among the sex category, males had the least likelihood of surviving with 19%. The richer class 1 managed a 63% chance of survival while only 24% of the lower class 3 survived. Finally those that embarked from Cherbourg had a higher survival rate 55% compared to Southampton at 34%.

Model building

Seperating the X & y. Here’s the first 5 rows of X

Code
train_eda.isna().sum().sort_values()
train_eda = train_eda.dropna(axis = 0) #removing all rows with NA

X = train_eda[["Age", "SibSp", "Parch", "Fare"]]

X = pd.concat([X,pd.get_dummies(data = train_eda[["Sex", "Embarked", "Pclass"]], columns = ["Sex", "Embarked", "Pclass"])], axis = 1)

X.head()
Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3
0 22.0 1 0 7.2500 0 1 0 0 1 0 0 1
1 38.0 1 0 71.2833 1 0 1 0 0 1 0 0
2 26.0 0 0 7.9250 1 0 0 0 1 0 0 1
3 35.0 1 0 53.1000 1 0 0 0 1 1 0 0
4 35.0 0 0 8.0500 0 1 0 0 1 0 0 1

First 5 rows of y

Code
y = train_eda["Survived"].values
y[0:5]
array([0, 1, 1, 1, 0], dtype=int64)

Checking dimensions of y & X

Code
len(y) #889 after filling up the NA. previously 712
X.shape #(889, 12)
(889, 12)

Normalising the data

Transform X and printing the first 5 datapoints

Code
from sklearn import preprocessing

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
array([[-0.56367407,  0.43135024, -0.47432585, -0.50023975, -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.66921696,  0.43135024, -0.47432585,  0.78894661,  1.35991138,
        -1.35991138,  2.07163382, -0.30794088, -1.62128697,  1.77600834,
        -0.51087465, -1.11070624],
       [-0.25545131, -0.47519908, -0.47432585, -0.48664993,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.43804989,  0.43135024, -0.47432585,  0.42286111,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395,  1.77600834,
        -0.51087465, -1.11070624],
       [ 0.43804989, -0.47519908, -0.47432585, -0.4841333 , -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807]])

Splitting into Test & Train data

Splitting into test & train data and comparing the dimensions.

Code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set\t :', X_train.shape,  y_train.shape,
'\nTest set\t :', X_test.shape,  y_test.shape)
Train set    : (711, 12) (711,) 
Test set     : (178, 12) (178,)

K Nearest Neighbours

Using KNN at k = 4

Code
from sklearn.neighbors import KNeighborsClassifier
k = 4
neighbours = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neighbours
KNeighborsClassifier(n_neighbors=4)

Predicting the output yhat and checking accuracy

Code
yhat1 = neighbours.predict(X_test)
yhat1[0:5]
array([0, 1, 0, 0, 1], dtype=int64)

Calculating the accuracy at k = 4

Code
from sklearn import metrics

print("Train set Accuracy \t:", metrics.accuracy_score(y_train, neighbours.predict(X_train)), "\nTest set Accuracy \t:", metrics.accuracy_score(y_test, yhat1))
Train set Accuracy  : 0.8509142053445851 
Test set Accuracy   : 0.7584269662921348

(without replacing na values, the previous test accuracy was 78%)

Checking for other K

Code
from sklearn import metrics

Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

for n in range(1,Ks):
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc
array([0.78651685, 0.76404494, 0.7752809 , 0.75842697, 0.78089888,
       0.78651685, 0.80337079, 0.7752809 , 0.78089888])

Glad that IBM coursera assignments came in handy! Now visualising the accuracy across each K

Code
import matplotlib.pyplot as plt

plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

Looks like accuracy of KNN is best at 7 neighbours. previously without replacing NA the accuracy was highest at k = 5

Redo with K = 7

Code
k = 7

neighbours_7 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat = neighbours_7.predict(X_test)

print("Train set Accuracy \t:", metrics.accuracy_score(y_train, neighbours_7.predict(X_train)),"\nTest set Accuracy \t:", metrics.accuracy_score(y_test, yhat),"\nRMSE \t\t\t:",metrics.mean_squared_error(y_test, yhat),"\nNormalised RMSE\t\t:",metrics.mean_squared_error(y_test, yhat)/np.std(y_test))
Train set Accuracy  : 0.8509142053445851 
Test set Accuracy   : 0.8033707865168539 
RMSE            : 0.19662921348314608 
Normalised RMSE     : 0.3997716243033934

We find that Test accuracy is around 80% for KNN1 with RMSE of 0.197 and Normalised RMSE of 40%2. formula for NRMSE here

Footnotes

  1. pretty much the same as previous attempt before replacing NA↩︎

  2. actually NRMSE is not needed as all models are of the same scale. This is used typically for model comparisons across log, decimal etc scales↩︎