The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people seemed more likely to survive than others, such as women, children, and the upper-class.
More Information: https://en.wikipedia.org/wiki/RMS_Titanic
In this project, we will investigate the dataset and explore who the passengers were and if some were more likely to survive than the others. Some of the questions that this dataset could answer are:
First let's take a look what this dataset looks like.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#To disable warnings for this particular notebook
import warnings
warnings.filterwarnings('ignore')
filename = 'titanic-data.csv'
titanic_dataset = pd.read_csv(filename)
titanic_dataset.head(20)
As we can see, many columns are self explanatory. Here's what the other columns indicate:
Survived: indicates survival (0=No, 1=Yes)
Pclass: indicates ticket class (1=1st, 2=2nd, 3=3rd)
SibSp: indicates no. of siblings/spouse onboard
Parch: indicates no. of parents/children on board
Ticket: Ticket Number
Cabin: Cabin Number
Embarked: Port of Embarkation (S=Southampton, C= Cherbourg, Q=Queenstown)
Variable Notes
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
Before we analyse this dataset, we need to clean up the data. Some data values are observed to be missing in this sample itself and are indicated by NaN. Let's check which are.
titanic_dataset.isnull().sum()
The Cabin column seems to have too many missing values to provide useful information during analysis. Also since for this particular analysis it won't be used to answer any posed questions, we can safely delete it.
The Embarked and Ticket columns will not be used in any of our analysis so those can be removed too.
titanic_data = titanic_dataset.drop(['Cabin','Embarked', 'Ticket'], axis=1)
titanic_data.head()
The Age column has 177 missing values. Its a large portion of the dataset (almost 20%) and if dropped, we will lose data in other columns. Also the missing values in Age column will not affect any computations or plotting of graphs, as these are ignored in python.
But they may affect counts in the dataset, and we may need to account for missing values everytime. So we will remove the missing values from Age column and create a new dataset titanic_df_ages, which will be used for age related calculations. For all other analysis we will use the original titanic_data dataset.
titanic_df_ages = titanic_data.dropna(subset = ['Age'])
titanic_df_ages['PassengerId'].count()
# Total 891 - Missing 177 = 714 values
Now that we have a reasonably clean dataset, let us find out some basic stats from it. To start with, as we will be dealing with a lot of survival stats, let's first find out what the survival rate is for this dataset.
#Function to compute survival rate
def survival_stats(data):
survived = data.groupby('Survived').get_group(1)['PassengerId'].count()
total = data['PassengerId'].count()
print ("Total Passengers: {}".format(total))
print ("Surviving Passengers: {}".format(survived))
print ("Percentage of passengers who survived: {:.2f}%".format(survived*100.00/total))
survival_stats(titanic_data)
titanic_data['Age'].describe()
Mean Age of the passengers aboard the ship: 29.69 years.
Median Age of the passengers aboard the ship: 28 years.
Youngest Passenger: < 1 year old
Oldest Passenger: 80 years
Standard Deviation: 14.52
Let's now plot a histogram of the distribution of the ages.
fig = plt.figure()
ax = fig.add_subplot(titanic_data['Age'].hist(bins=20))
ax.set_title("Distribution of Age of Passengers")
ax.set_xlabel('Age')
ax.set_ylabel('Number of Passengers')
We'll now explore the other statistics from the dataset to find out if some groups of passengers were more likely to survive than others and how much.
%matplotlib inline
passengers_by_class = titanic_data.groupby('Pclass')['PassengerId'].count()
print('Total number of passengers by class')
print (passengers_by_class)
survivors_by_class = titanic_data.groupby('Pclass')['Survived'].sum()
print('Total number of survivors by class')
print survivors_by_class
print ("Percentage of passengers who survived: {}".format(survivors_by_class*100.00/passengers_by_class))
Below is a factorplot showing the proportions
s = sns.factorplot(x="Pclass", y="Survived",
data=titanic_data, saturation=.5,
kind="bar", ci=None, aspect=.9)
(s.set_axis_labels("", "Survival Rate")
.set_xticklabels(["1st Class", "2nd Class","3rd Class"]))
sns.plt.suptitle('Survival Rate by Class')
A significantly large percentage of passengers survived from the first class, followed by second class and then third. It looks like the first class passengers had priority access to the lifeboats.
Now let's take a look at what role gender played in the likelihood of survival.
#Function to compute survival rate by gender
def statistics_by_gender(sex):
titanic_df_gender = titanic_data.groupby(['Sex']).get_group((sex))
survival_stats(titanic_df_gender)
print("Statistics for Male Passengers")
statistics_by_gender('male')
print("Statistics for Female Passengers")
statistics_by_gender('female')
Women had a far greater chance of survival than men.
This could be because of the 'women and children first' rule that was followed in those days.
Now these statistics for male and female passengers are for all ages, meaning they also include data for children. Let us consider a child to be a passenger less than 18 years of age for exploration purposes and look at the data for children in more detail.
Note: We will use the dataset titanic_df_ages, where missing values in Age Column have been removed, for analysing Age related data.
print ("Survival percentage for children")
titanic_child_df = titanic_df_ages[titanic_df_ages['Age'] < 18]
survival_stats(titanic_child_df)
How likely were infants to survive? Let's consider children < 1 year of age as infants.
under_age_one_df = titanic_df_ages[titanic_df_ages['Age'] < 1]
survival_stats(under_age_one_df)
That seems like an interesting observation. Does this trend continue?
Let's visualize this data in form of a histogram.
mpl_fig = plt.figure()
ax = mpl_fig.add_subplot(titanic_df_ages[titanic_df_ages['Age'] < 18]['Age'].hist(bins=18))
titanic_df_ages[titanic_df_ages['Survived'] == 1][titanic_df_ages['Age'] < 18]['Age'].hist(bins=18)
ax.set_title("Distribution of Ages of Children")
ax.set_xlabel('Age')
ax.set_ylabel('Number of Children')
plt.legend(["Total","Surviving"],loc="upper center")
But there does seem a noticeable decrease in chances of survival for 16 and 17 year olds. One reason could be the cut off age of 18 is arbitary and may not apply in those days. A 16 year old may have qualified as an adult. Or if that is not the case, he may probably look like a grown adult and could have been prohibited from boarding.
Now let's see how the survival rate varies based on the gender and class together. That is, for each class how was the survival rate for male and female passengers.
#Function to compute survival rate by gender and travel class
def statistics_by_gender_and_class(sex,Pclass):
print (sex + "s in class " + str(Pclass))
titanic_data_by_class_and_gender = titanic_data.groupby(['Sex','Pclass']).get_group((sex,Pclass))
survival_stats(titanic_data_by_class_and_gender)
print("Statistics for Passengers by Gender and Class")
print ("\n")
statistics_by_gender_and_class('male',1)
print ("---------------------------------------------------------------------------------------------------------------------")
statistics_by_gender_and_class('male',2)
print ("---------------------------------------------------------------------------------------------------------------------")
statistics_by_gender_and_class('male',3)
print ("---------------------------------------------------------------------------------------------------------------------")
statistics_by_gender_and_class('female',1)
print ("---------------------------------------------------------------------------------------------------------------------")
statistics_by_gender_and_class('female',2)
print ("---------------------------------------------------------------------------------------------------------------------")
statistics_by_gender_and_class('female',3)
s = sns.factorplot(x="Sex", y="Survived", col="Pclass",
data=titanic_data, saturation=.5,
kind="bar", ci=None, aspect=.6)
(s.set_axis_labels("", "Survival Rate")
.set_xticklabels(["Men", "Women"]))
s.fig.suptitle("Survival Rate by Gender in each Class", size=12)
s.fig.subplots_adjust(top=.8)
It's interesting to see from the above analysis, that almost all of the women (96.81%) from first class survived. The rate looked good even for women in second class (92.11%), but it isn't so encouraging for anybody else. In fact the rate for men traveling in second and third class is comparable (15.74% and 13.54% respectively). However, in general for both men and women, traveling in upper class seems to lead to better chances of survival.
Let's take a look now at how other factors may have affected survival.
We have information about whether a passenger was traveling alone or in a group, based on the SibSp and Parch column values. For someone traveling by himself, both the SibSp and Parch values will be 0. For someone traveling with a spouse/sibling SibSp is 1 and for someone traveling with children, Parch is 1. Based on this information let's compute the survival rate. We will consider only adults (age > 18) for this analysis.
Note: Again, we will use the dataset titanic_df_ages, where missing values in Age Column have been removed, for analysing Age related data.
print ("Chances of survival of adults traveling with someone")
adults_with_children = titanic_df_ages[titanic_df_ages['Age'] >= 18][titanic_df_ages['Parch'] > 0]
adults_with_adults = titanic_df_ages[titanic_df_ages['Age'] >= 18][titanic_df_ages['SibSp'] > 0]
adults_not_traveling_alone = pd.concat([adults_with_children,adults_with_adults],join='outer').drop_duplicates()
survival_stats(adults_not_traveling_alone)
print ("Chances of survival of adults traveling alone")
adults_traveling_alone = titanic_df_ages[titanic_df_ages['Age'] >= 18][titanic_df_ages['Parch'] == 0][titanic_df_ages['SibSp'] == 0]
#If the above code gives warning Boolean Series key will be reindexed to match DataFrame index, breaking down the dataframes
#df1 = titanic_df[titanic_df['Age'] >= 18]
#df2 = df1[df1['Parch'] == 0]
#adults_traveling_alone =df2[df2['SibSp'] == 0]
survival_stats(adults_traveling_alone)
We can speculate that this must be the case because children were given a higher priority to get to the lifeboats. So could it be likely that an adult man traveling alone with a child also got access to the lifeboats and that increased his chances of survival?
But it's not possible to examine this data separately, as Parch contains both data for parents and children. Consider the data below which contains only males traveling alone having Parch > 0. From this sample, a 28 year old could be traveling with either his mother or daughter. That information cannot be obtained from this dataset.
titanic_df_ages[titanic_df_ages['Age'] < 18].count()
#titanic_df['Age'].isnull().sum()
titanic_df_ages[titanic_df_ages['Age'] >= 18][titanic_df_ages['Sex'] == 'male'][titanic_df_ages['Parch'] >= 1][titanic_df_ages['SibSp']==0]
Let's explore now another group of passengers in this data. The data says that children with Parch = 0 were traveling with nannies.
children_with_nannies = titanic_df_ages[titanic_df_ages['Age'] < 18][titanic_df_ages['Parch'] == 0]
print ("Children traveling with nannies")
survival_stats(children_with_nannies)
print ("----------------------------------------------------------------")
print ("Children traveling with parents")
children_with_parents = titanic_df_ages[titanic_df_ages['Age'] < 18][titanic_df_ages['Parch'] >= 1]
survival_stats(children_with_parents)
Exploring the Titanic Dataset provides some interesting insights into who the passengers were and how their chances of survival were affected by various factors. The findings of this data analysis though tentative could be summarized as follows:
There are however a few limitations of this dataset: