Titanic Kaggle Competition
Project Overview
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
My Role in the Project
As a data scientist, in this Kaggle competition, I used feature engineering and ensemble methods to improve the classification algorithm that performed as one of the top teams in the private leaderboard by taking terms to code, iterating through strategies, and trial and error. Working with a team of data scientists, I acquired the capabilities to speak coherently about ideas, listen and learn from peers, and expand my toolsets to achieve goals.
The Process
Feature Engineering
We learned that there are a lot of missing values in the Age category. To fill in the missing value in Age, we use the average Age from grouping Gender and Pclass. We used several neural network algorithms to train with several different combinations of features along with Age and learned that Age is not a very good indicator for survivability. Simply selecting Sex, and Pclass with neural network provides an accuracy of 77.844%. Sex, Pclass, and Parch with neural networks offer an accuracy of 79.041%. We wanted to explore the possibility of transformation by transforming Age and fare with log and z-score, but accuracy decreased to 76% to 77% range.
Simply using feature engineering fail to increase the accuracy. Thus, we started exploring various ensemble methods.
Ensemble with a majority
We trained four different neural networks selecting different features and making perdition for each model. Using simple combination, we took the majority result as the final prediction and got an accuracy of 79.041%
Strategy
Finally, we use a simple combination of all the models used previously with heavier weights to models using K-nearest neighbors and xgbclassifier to obtain the final prediction, resulting in an accuracy score of 80.0239% on the public leaderboard and 79.282% on the private leaderboard.
Discussion
We found that grouping women and children who were likely traveling together and classifying survival based on the survival of one of these people in each group was a good indicator of whether someone survived. We first combined the test and training data set for groups traveling together and used last name and fare. We dropped the adult males (since adult males were the least likely to survive) using title indicators from their names to isolate the women and children. We also grouped by family and friends, using ticket and fare, and a similar method to the women and children model where if one person survives, then predicting the whole group survived and using the k-nearest neighbors clustering model.
We ran into many issues with overfitting, which we figured out that age wasn’t the best predictor of survival. That helped us balance a useful complex model and a simple enough model to apply to the test set.
We learned that if a person was a woman or child in the same family and one of their family members survived, they were more likely to have survived as well.