Day 6: The Titanic Dataset

Progress till date:

Download titanic dataset and assign to train & test
Rearranging the data
EDA

To do today:

write function to find share of survivors by each variable
attempt to create model

Reading the data

Loading the data using kaggle library and examining the top rows of relevant columns.

Code

import requests
import numpy as np
import pandas as pd
import kaggle 
import zipfile 

kaggle.api.authenticate()

kaggle.api.competition_download_files("titanic", path = ".")

zf = zipfile.ZipFile("titanic.zip")
train = pd.read_csv(zf.open("train.csv"))
test = pd.read_csv(zf.open("test.csv"))

#Selecting only the numerical columns
num_col = train.select_dtypes(include=np.number).columns.tolist()

#deslecting passenger ID and 'Survived' 
del num_col[0:2] #.remove() can remove only 1 item. so for more than 1, use for loop 
select_col = num_col

#remaining columns
str_col= ["Sex", "Embarked", "Survived"]

#Adding more elements into a list using `extend` and not `append`
select_col.extend(str_col)

train_eda= train[train.columns.intersection(select_col)]
train_eda.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	0	3	male	22.0	1	7.2500	S
1	1	1	female	38.0	1	71.2833	C
2	1	3	female	26.0	0	7.9250	S
3	1	1	female	35.0	1	53.1000	S
4	0	3	male	35.0	0	8.0500	S

How many columns with na values?

Code

train_eda.isna().sum().sort_values()

Survived      0
Pclass        0
Sex           0
SibSp         0
Parch         0
Fare          0
Embarked      2
Age         177
dtype: int64

Almost 177 entries in the Age column have no value. Calculating the median age of remaining data.

Code

train_eda["Age"].median() #28

28.0

Replacing these with the median age (28) instead of removing them.

Code

train_eda["Age"].fillna(value = train_eda["Age"].median(), inplace = True)
train_eda.isna().sum().sort_values()

C:\Users\DELL\AppData\Local\Temp\ipykernel_5312\1076914416.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

Today I want to calculate the survival rate of each of these attributes (Pclass, Sex, Embarked).

Code

df_copy2 = pd.DataFrame(columns = {"category", "col", "survive_rate"})

for t in ["Pclass", "Sex", "Embarked"]:
  df_copy = train_eda.groupby([t])["Survived"].mean().reset_index()
  df_copy["category"] = t
  #trying to create a `tidy` version of the data 
  df_copy.rename(columns = {t: "col", "Survived": "survive_rate"}, errors = "raise", inplace = True)
  df_copy = df_copy[["category", "col", "survive_rate"]]
  df_copy2= pd.concat([df_copy2, df_copy], ignore_index = True)


#final table in a tidy format that can be used to create graphs. but that i'm keeping for later
df_copy2[["category", "col", "survive_rate"]]

	category	col	survive_rate
0	Pclass	1	0.62963
1	Pclass	2	0.472826
2	Pclass	3	0.242363
3	Sex	female	0.742038
4	Sex	male	0.188908
5	Embarked	C	0.553571
6	Embarked	Q	0.38961
7	Embarked	S	0.336957

With this, its pretty clear that among the sex category, males had the least likelihood of surviving with 19%. The richer class 1 managed a 63% chance of survival while only 24% of the lower class 3 survived. Finally those that embarked from Cherbourg had a higher survival rate 55% compared to Southampton at 34%.

Model building

Seperating the X & y. Here’s the first 5 rows of X

Code

train_eda.isna().sum().sort_values()
train_eda = train_eda.dropna(axis = 0) #removing all rows with NA

X = train_eda[["Age", "SibSp", "Parch", "Fare"]]

X = pd.concat([X,pd.get_dummies(data = train_eda[["Sex", "Embarked", "Pclass"]], columns = ["Sex", "Embarked", "Pclass"])], axis = 1)

X.head()

	Age	SibSp	Fare	Sex_female	Sex_male	Embarked_C	Embarked_S	Pclass_1	Pclass_3
0	22.0	1	7.2500	0	1	0	1	0	1
1	38.0	1	71.2833	1	0	1	0	1	0
2	26.0	0	7.9250	1	0	0	1	0	1
3	35.0	1	53.1000	1	0	0	1	1	0
4	35.0	0	8.0500	0	1	0	1	0	1

First 5 rows of y

Code

y = train_eda["Survived"].values
y[0:5]

array([0, 1, 1, 1, 0], dtype=int64)

Checking dimensions of y & X

Code

len(y) #889 after filling up the NA. previously 712
X.shape #(889, 12)

(889, 12)

Normalising the data

Transform X and printing the first 5 datapoints

Code

from sklearn import preprocessing

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.56367407,  0.43135024, -0.47432585, -0.50023975, -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.66921696,  0.43135024, -0.47432585,  0.78894661,  1.35991138,
        -1.35991138,  2.07163382, -0.30794088, -1.62128697,  1.77600834,
        -0.51087465, -1.11070624],
       [-0.25545131, -0.47519908, -0.47432585, -0.48664993,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807],
       [ 0.43804989,  0.43135024, -0.47432585,  0.42286111,  1.35991138,
        -1.35991138, -0.48271079, -0.30794088,  0.61679395,  1.77600834,
        -0.51087465, -1.11070624],
       [ 0.43804989, -0.47519908, -0.47432585, -0.4841333 , -0.73534203,
         0.73534203, -0.48271079, -0.30794088,  0.61679395, -0.56306042,
        -0.51087465,  0.90032807]])

Splitting into Test & Train data

Splitting into test & train data and comparing the dimensions.

Code

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set\t :', X_train.shape,  y_train.shape,
'\nTest set\t :', X_test.shape,  y_test.shape)

Train set    : (711, 12) (711,) 
Test set     : (178, 12) (178,)

K Nearest Neighbours

Using KNN at k = 4

Code

from sklearn.neighbors import KNeighborsClassifier
k = 4
neighbours = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neighbours

KNeighborsClassifier(n_neighbors=4)

Predicting the output `yhat` and checking accuracy

Code

yhat1 = neighbours.predict(X_test)
yhat1[0:5]

array([0, 1, 0, 0, 1], dtype=int64)

Calculating the accuracy at k = 4

Code

from sklearn import metrics

print("Train set Accuracy \t:", metrics.accuracy_score(y_train, neighbours.predict(X_train)), "\nTest set Accuracy \t:", metrics.accuracy_score(y_test, yhat1))

Train set Accuracy  : 0.8509142053445851 
Test set Accuracy   : 0.7584269662921348

(without replacing na values, the previous test accuracy was 78%)

Checking for other K

Code

from sklearn import metrics

Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

for n in range(1,Ks):
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

array([0.78651685, 0.76404494, 0.7752809 , 0.75842697, 0.78089888,
       0.78651685, 0.80337079, 0.7752809 , 0.78089888])

Glad that IBM coursera assignments came in handy! Now visualising the accuracy across each K

Code

import matplotlib.pyplot as plt

plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

Looks like accuracy of KNN is best at 7 neighbours. previously without replacing NA the accuracy was highest at k = 5

Redo with `K = 7`

Code

k = 7

neighbours_7 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat = neighbours_7.predict(X_test)

print("Train set Accuracy \t:", metrics.accuracy_score(y_train, neighbours_7.predict(X_train)),"\nTest set Accuracy \t:", metrics.accuracy_score(y_test, yhat),"\nRMSE \t\t\t:",metrics.mean_squared_error(y_test, yhat),"\nNormalised RMSE\t\t:",metrics.mean_squared_error(y_test, yhat)/np.std(y_test))

Train set Accuracy  : 0.8509142053445851 
Test set Accuracy   : 0.8033707865168539 
RMSE            : 0.19662921348314608 
Normalised RMSE     : 0.3997716243033934

We find that Test accuracy is around 80% for KNN¹ with RMSE of 0.197 and Normalised RMSE of 40%². formula for NRMSE here

Footnotes

pretty much the same as previous attempt before replacing NA↩︎
actually NRMSE is not needed as all models are of the same scale. This is used typically for model comparisons across log, decimal etc scales↩︎

Day 6: The Titanic Dataset

Reading the data

Model building

Normalising the data

Splitting into Test & Train data

K Nearest Neighbours

Predicting the output yhat and checking accuracy

Checking for other K

Redo with K = 7

Footnotes

Predicting the output `yhat` and checking accuracy

Redo with `K = 7`