We all know what happened on the Titanic. Spoiler Alert: Yes, Jack dies.
You may ask, do we need a machine learning to predict this? Even the next door kid knows the answer. But jokes apart can we train a machine to predict who will survive the titanic crash?
This is an infamous competition in the kaggle data science community. In this article, we’ll try to find out the patterns and correlations inside the data and try to predict who will survive the Titanic.
Machine learning can be divided into two main categories.
And you might have already guessed what we are trying to do here.
Yes, we are trying to predict whether you would die or survive. This falls under the classification problem. (Dead or Survived)
For this article, we will use google’s Collaboratory to run the samples. Collaboratory is similar to google sheets where you can run and share your data science experiments. The article uses python and sci-kit learn library for the ML.
To begin anything in Machine learning, we need data/ information. For this, we will use the RMS Titanic dataset and let’s see how the data looks like.
- survival: Survival (0 = no; 1 = yes)
- class: Passenger class (1 = first; 2 = second; 3 = third)
- name: Name of the passenger
- sex: Sex
- age: Age of the passenger
- sibsp: Number of siblings/spouses aboard
- parch: Number of parents/children aboard
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin
- embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
This is how the sample data set will look like. Let us also look at the count of each attribute below and see if we got all the data in place.
From a glance, we can see some data are missing. The cabin number age and some other fields are missing. We all know the golden rule.
But for this, we need to make some assumptions and fill the missing data or we need to remove those data points. As the data has very few records (2000), throwing away data is not an option. So let’s try to fill the missing age with the average age from the data set.
We can see the data set has numbers and strings (letters). To put things into perspective, the machine learning algorithms understand number well (not all though).
The “sex” and “embarked” fields are both string values that correspond to categories (i.e “Male” and “Female”), so we will run each through a preprocessor. This preprocessor will convert these strings into integer keys, making it easier for the classification algorithms to find patterns. For instance, “Female” and “Male” will be converted to 0 and 1, respectively.
The “name” and “ticket” columns consist of non-categorical string values. These are difficult to use in a classification algorithm, so we will drop them from the data set. The algorithm will have a hard time figuring out the connection between Allen, Miss. Elisabeth Walton and Allison, Master. Hudson Trevor. With a small data set, it is hard to derive the social class using the names.
The fun part – Classification with Decision trees
Let’s do an elementary classification algorithm. This will try to build a decision tree based on the data set.
The tree first splits by sex, and then by class, since it has learned during the training phase that these are the two most important features for determining survival. The dark blue boxes indicate passengers who are likely to survive, and the shaded orange boxes represent passengers who are almost certainly doomed. Interestingly, after splitting by class, the main deciding factor determining the survival of women is the ticket fare that they paid, while the deciding factor for men is their age (with children being much more likely to survive).
To create this tree, we first initialize a decision tree classifier. (Here we will set the maximum depth of the tree to 10). Next we “fit” this classifier to our training set, enabling it to learn about how different factors affect the survivability of a passenger. Now that the decision tree is ready, we can “score” it using our test data to determine how accurate it is.
The dt.score states the accuracy of the predictions, which is 78%.
Let’s get more technical
Data science will look not cool if we don’t use some technical words. So let’s build a random forest classifier and see if we can improve the accuracy? Let’s not get into the math but simply put random forest is many decision trees built using random samples and each decision tree votes on the final results.
So let’s say we make 3 decision trees and 2 says that a person dies and 1 decision trees say that person will survive then we can conclude that the person definitely dies.
With a few lines of code, we can implement this, and now the accuracy has improved to 88%. OOB stands for out of the bag score.
Let’s get in a neural network and find if we can improve the accuracy. Some would say it is an overkill to use a neural network with such little data. But let’s try out and see.
With just the basic setup, we are able to get a 78% accuracy.
So how can we improve it more, for starters we can tweak the ML algorithms and try to enhance the results a bit, we can also do something called feature engineering? Feature engineering is where we can bring our human insights to enrich the data. For example, remember we dropped the names saying its too hard to compute, but we can write a method to identify the social classes based on their salutations.
By introducing more feature engineering, we can improve the accuracy to a certain extent.
Yes, cant we achieve 100% accuracy?
It will be really hard in this scenario because of the outliers. For example, we can take Allison’s family. Although the machine learning algorithm states they will survive, in reality, they did not make it. The ML says almost 96% of the wealthy women have survived, so how did this happen?
It turns out that the Allison family was unable to find their youngest son Trevor and was unwilling to leave the ship without him.
From the results, we can see there is a strong correlation in the data, and women and children have a strong chance of surviving. But this would be a different case if this happened in 2016 with LGBT and transgender movements.
So What’s Next?
The article was intended to provide a stepping stone into the data science world. From here, you can fine-tune the machine learning models and try to implement new machine learning models to evaluate the accuracies. And can apply these learning to other interesting data sets in the Kaggle community.