day 1 of #50daysofkaggle

kaggle
Author

Me

Published

October 7, 2022

Introducing my own personal sprint training “50 Days of Kaggle”

The task is simple:

  1. Improve Kaggle score by the end of 50 days
  2. Work on ML models daily. Try interacting on the portal as much as possible.
  3. Keep ISLR notes handy. Blog over here for revision
  4. Use Python. Or else use R Tidymodels. (neither of which I am currently proficient in)

I’d want to use this blog to journal my progress. Hopefully by 26th Nov’22, I’d have improved from where I’m starting out.

So what do we have for Day 1?

The Titanic Dataset

Everyone’s first starting point. And I’m slowly starting to appreciate why. Lets see if we can read the data directly into this notebook

Reading the data

First things first, import libraries

Code
import requests
import numpy as np
import pandas as pd
import kaggle

kaggle.api.authenticate()

Note to self: below command did not work

Code
#kaggle.api.dataset_download_files("titanic", path = ".", unzip = True)

However, this one does as per this link https://www.kaggle.com/general/138914

Code
kaggle.api.competition_download_files("titanic", path = ".")

This pulls the .zip file in the local folder. because this is a zip file, we need package called zipfile(note to self: don’t forget the console command reticulate::py_install("zipfile"))

https://stackoverflow.com/a/56786517/7938068

Reading and checking the first rows of train

Code
import zipfile

zf = zipfile.ZipFile("titanic.zip")
train = pd.read_csv(zf.open("train.csv"))
test = pd.read_csv(zf.open("test.csv"))
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Checking the first rows of test

Code
test.head()
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

This took me a whole day to figure out. End of Day1 🤷‍♂️