# General imports and functions
%autosave 0
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Import csv as DataFrame
titanic_df = pd.read_csv('titanic_data.csv', header=0)
# Make sure it worked...
print titanic_df.tail()
print "\n" + "No. passengers:", titanic_df.shape[0]
# Existing columns
print "No. columns:", titanic_df.shape[1], "\n"
print titanic_df.columns
# Change to desired column names
new_col_names = ['Passenger_ID', 'Survived', 'Class', 'Name', 'Sex', 'Age',
'Siblings_spouses_aboard', 'Parents_children_aboard', 'Ticket', 'Fare', 'Cabin_num', 'Port_of_Embarkation']
titanic_df.columns = new_col_names
print titanic_df.columns
# Transforming 'Survived' column values
titanic_df['Survived'] = titanic_df['Survived'].map({0:'Died', 1:'Survived'})
# Checking that it worked...
print titanic_df['Survived'].head()
# Transforming 'Class' column values
titanic_df['Class'] = titanic_df['Class'].map({1:'First Class', 2:'Second Class', 3:'Third Class'})
# Checking that it worked...
print titanic_df['Class'].head()
# Transforming 'Port' column values
titanic_df['Port_of_Embarkation'] = titanic_df['Port_of_Embarkation'].map({'S':'Southampton', \
'C':'Cherbourg', 'Q':'Queenstown'})
# Checking that it worked...
print titanic_df['Port_of_Embarkation'].head()
To answer my question, I need to first choose passenger characteristics to test against survival. Second, I will want to graph or plot the data; this will give me a preview of the data and allow me to pick out any obvious patterns. If appropriate, I would need to statistically determine if there is any relationship between the variables.
# Check for NaNs in each column
print "NaNs present in columns"
print "-----------------------"
for col in titanic_df.columns:
print str(col) + ":", titanic_df[titanic_df[col].isnull()].shape[0]
I will handle NaNs in the data when I analyse variables that contain them.
The 'Survived' column represents survival as either a 0 (died) or a 1 (survived).
Variables that seem like they might be connected to one's survival aboard the Titanic and that will be investigated:
Variables that seem useless; not investigating:
Variables that may or may not be insightful; not investigating:
# Variables
print titanic_df['Sex'].unique()
print titanic_df['Survived'].unique()
# Plot
sex_v_survival = titanic_df[['Sex', 'Survived']]
plt_clrs = ['green' if (x == 'Survived') else 'red' for x in sex_v_survival['Survived']] # red for dead, green for survived
g = sns.countplot(x='Sex', hue='Survived', data=sex_v_survival, palette=plt_clrs)
# Modify appearance
g.axes.set_title('Passenger Sex and Survival Counts', fontsize=18, fontweight='bold', color="k", alpha=0.8)
g.set_xlabel("Class", size = 14, color="k", alpha=0.8)
g.set_ylabel("Count", size = 14, color="k", alpha=0.8)
g.figure.set_size_inches(6,6)
sns.set_style("whitegrid")
# Get bars and create bar labels
bars_list = g.patches # axis.patches gives the bar objects produced by the plot
male_deaths = len(sex_v_survival[(sex_v_survival['Sex'] == 'male') & (sex_v_survival['Survived'] == 'Died')])
male_survivors = len(sex_v_survival[(sex_v_survival['Sex'] == 'male') & (sex_v_survival['Survived'] == 'Survived')])
female_deaths = len(sex_v_survival[(sex_v_survival['Sex'] == 'female') & (sex_v_survival['Survived'] == 'Died')])
female_survivors = len(sex_v_survival[(sex_v_survival['Sex'] == 'female') & (sex_v_survival['Survived'] == 'Survived')])
bar_labels = [str(freq) for freq in [male_deaths, female_deaths, male_survivors,female_survivors]]
# Place bar labels
for bar, label in zip(bars_list, bar_labels): # iterate through these
g.text((bar.get_x() + bar.get_width()/2), (bar.get_height() + 4), label, ha='center', size=12, fontweight='bold')
# Males
total_males = len(sex_v_survival[sex_v_survival['Sex'] == 'male'])
print 'Total males:', total_males
print 'Male deaths:', male_deaths
print '% dead:', (male_deaths/float(total_males))*100, "\n"
# Females
total_females = len(sex_v_survival[sex_v_survival['Sex'] == 'female'])
print 'Total females:', len(sex_v_survival[sex_v_survival['Sex'] == 'female'])
print 'Female deaths:', len(sex_v_survival[(sex_v_survival['Sex'] == 'female') & (sex_v_survival['Survived'] == 'Died')])
print '% dead:', (female_deaths/float(total_females))*100
The most deaths, in absolute terms and in terms of percent, were of men.
Total males: 577
Male deaths: 468
% dead: 81.1091854419
Female death rate is much lower than that of men.
Total females: 314
Female deaths: 81
% dead: 25.7961783439
Sex seems to affect survival, let's prove it with a Pearson's chi-squared test for goodness of fit!
Hypotheses:
H0: The death rate for males and females is 0.62
H1: The death rate for males and females is not 0.62
Confidence level: 99% ($\alpha$ = 0.01)
# Construct a frequency distribution table
obs_sex = pd.DataFrame(['Non-survivors, male']*male_deaths + ['Survivors, male']*(total_males - male_deaths) +\
['Non-survivors, female']*female_deaths + ['Survivors, female']*(total_females - female_deaths))
obs_sex_table = pd.crosstab(index=obs_sex[0], columns="counts")
print 'Observed counts'
print obs_sex_table['counts'], '\n'
# Get expected frequencies (observed is obs_sex_table)
survivors_total = len(titanic_df[titanic_df['Survived'] == 'Survived'])
deaths_total = len(titanic_df[titanic_df['Survived'] == 'Died'])
passengers_total = total_females + total_males
exp_table = pd.Series({'Non-survivors, female': ((deaths_total/float(passengers_total))*total_females), \
'Non-survivors, male': ((deaths_total/float(passengers_total))*total_males), \
'Survivors, female': ((survivors_total/float(passengers_total))*total_females), \
'Survivors, male': ((survivors_total/float(passengers_total))*total_males)})
print 'Expected counts if sex does not influence'
print exp_table, '\n'
# Pearson's chi-squared test (goodness of fit)
sex_v_surv_chisquared = stats.chisquare(obs_sex_table['counts'], f_exp=exp_table, ddof=1, axis=None)
sex_crit = stats.chi2.ppf(q = 0.99, df = 1)
# Report
print 'Chi-squared statistic:', sex_v_surv_chisquared[0]
print 'Chi-squared critical @ p=0.01:', sex_crit
print 'p-value:', sex_v_surv_chisquared[1]
Test conditions and results:
Chi-squared critical @ p=0.01: 6.63489660102
Chi-squared statistic result: 263.050574071
Calculated p-value: 7.57344733665e-58
Null hypothesis rejected at p=0.01
Interpretation:
Survival based on sex is NOT by random chance-- extremely tiny probability that this pattern was seen by random chance.
If a passenger is female, she is more likely to survive than her male counterparts.
# Plot
class_v_survival = titanic_df[['Class', 'Survived']]
plt_clrs = ['green' if (x == 'Survived') else 'red' for x in class_v_survival['Survived']] #red for dead, green for survived
g = sns.countplot(x='Class', hue='Survived', data=class_v_survival, palette=plt_clrs)
g.figure.set_size_inches(8,4)
# Modify appearance
sns.set_style("whitegrid")
g.axes.set_title('Passenger Class and Survival Counts', fontsize=18, fontweight='bold', color="k", alpha=0.8)
g.set_xlabel("Class", size = 14, color="k", alpha=0.8)
g.set_ylabel("Count", size = 14, color="k", alpha=0.8)
g.figure.set_size_inches(9,6)
# Get bars and create bar labels
bars_list = g.patches # axis.patches gives the bar objects produced by the plot
first_class_deaths = len(class_v_survival[(class_v_survival['Class'] == 'First Class') & \
(class_v_survival['Survived'] == 'Died')])
first_class_survivors = len(class_v_survival[(class_v_survival['Class'] == 'First Class') & \
(class_v_survival['Survived'] == 'Survived')])
second_class_deaths = len(class_v_survival[(class_v_survival['Class'] == 'Second Class') & \
(class_v_survival['Survived'] == 'Died')])
second_class_survivors = len(class_v_survival[(class_v_survival['Class'] == 'Second Class') & \
(class_v_survival['Survived'] == 'Survived')])
third_class_deaths = len(class_v_survival[(class_v_survival['Class'] == 'Third Class') & \
(class_v_survival['Survived'] == 'Died')])
third_class_survivors = len(class_v_survival[(class_v_survival['Class'] == 'Third Class') & \
(class_v_survival['Survived'] == 'Survived')])
label_list = [third_class_deaths, first_class_deaths, second_class_deaths, \
third_class_survivors, first_class_survivors, second_class_survivors]
bar_labels = [str(freq) for freq in label_list]
# Place bar labels
for bar, label in zip(bars_list, bar_labels):
g.text((bar.get_x() + bar.get_width()/2), (bar.get_height() + 4), label, ha='center', size=12, fontweight='bold')
# 1st class
total_first_class = len(class_v_survival[class_v_survival['Class'] == 'First Class'])
print 'Total 1st class:', total_first_class
print '1st class deaths:', first_class_deaths
print '% dead:', (first_class_deaths/float(total_first_class))*100, "\n"
# 2nd class
total_second_class = len(class_v_survival[class_v_survival['Class'] == 'Second Class'])
print 'Total 2nd class:', total_second_class
print '2nd class deaths:', second_class_deaths
print '% dead:', (second_class_deaths/float(total_second_class))*100, "\n"
# 3rd class
total_third_class = len(class_v_survival[class_v_survival['Class'] == 'Third Class'])
print 'Total 3rd class:', total_third_class
print '3rd class deaths:', third_class_deaths
print '% dead:', (third_class_deaths/float(total_third_class))*100, "\n"
1st class survival rate:
% dead: 37.037037037
2nd class survival rate:
% dead: 52.7173913043
3rd class survival rate:
% dead: 75.7637474542
Class appears to correlate with survival, let's demonstrate it with a Pearson's chi-squared test for goodness of fit!!!!
Hypotheses:
H0: The death rate for 1st class passengers, 2nd class passengers, and 3rd class passengers is 0.62 for each class.
H1: The death rate for 1st class passengers, 2nd class passengers, or 3rd class passengers is not 0.62.
Confidence level: 99% ($\alpha$ = 0.01)
# Construct a frequency distribution table
obs_class = pd.DataFrame(['1st class deaths']*first_class_deaths + ['2nd class deaths']*second_class_deaths +\
['3rd class deaths']*third_class_deaths + ['1st class survivors']*(total_first_class - \
first_class_deaths) +\
['2nd class survivors']*(total_second_class - second_class_deaths) +\
['3rd class survivors']*(total_third_class - third_class_deaths))
obs_class_table = pd.crosstab(index=obs_class[0], columns="counts")
print 'Observed rates'
print obs_class_table['counts']
print
# Get expected frequencies
death_ratio = (deaths_total/float(passengers_total))
surv_ratio = (survivors_total/float(passengers_total))
exp_class = pd.Series({'1st class deaths': (death_ratio*total_first_class), \
'1st class survivors': (surv_ratio*total_first_class), \
'2nd class deaths': (death_ratio*total_second_class), \
'2nd class survivors': (surv_ratio*total_second_class), \
'3rd class deaths': (death_ratio*total_third_class), \
'3rd class survivors': (surv_ratio*total_third_class)})
print 'Expected rates if class does not influence'
print exp_class
print
# Pearson's chi-squared test for goodness of fit
class_v_surv_chisquared = stats.chisquare(obs_class_table['counts'], f_exp=exp_class, ddof=2, axis=None)
class_crit = stats.chi2.ppf(q = 0.99, df = 2) # t-critical value
# Report
print 'Chi-squared stat:', class_v_surv_chisquared[0]
print 'Critical value @ p=0.01:', class_crit
print 'p-value:', class_v_surv_chisquared[1]
Test conditions and results:
Critical value @ p=0.01: 9.21034037198
Chi-squared statistic result: 102.888988757
Calculated p-value: 3.71728275658e-22
Null hypothesis rejected at p=0.01
Interpretation:
Survival based on class is NOT by random chance-- extremely tiny probability that this pattern was seen by random chance.
The higher a passenger's class, the more likely that the passenger survives.
# Initial boxplot of fare data
bx_plt = sns.boxplot(x=titanic_df['Survived'], y=titanic_df['Fare'])
bx_plt.figure.set_size_inches(6,8)
The extreme outliers make it difficult to see the boxplot features. Let's remove extreme value(s) (fare >= 200) to improve the visualization.
# Extreme value(s) in survived category
print "No. extreme outliers (fare > 200):", len(titanic_df[titanic_df['Fare'] >= 200]), '\n'
print 'First 5 of 20:' + '\n', titanic_df[titanic_df['Fare'] >= 200].head()
# Removing these values; RUN ONLY ONCE
no_xtreme_outliers_df = titanic_df[titanic_df['Fare'] < 200].reset_index(drop=True)
# Checking that it worked
no_xtreme_outliers_df.shape
Now that these extreme values are removed, let's re-plot the data and get some descriptive statistics.
# Plot
plt_clrs = ['green' if (x == 'Survived') else 'red' for x in no_xtreme_outliers_df['Survived']] \
#red for dead, green for survived
bx_plt = sns.boxplot(x=no_xtreme_outliers_df['Survived'], y=no_xtreme_outliers_df['Fare'], palette=plt_clrs)
# Modify appearance
bx_plt.axes.set_title('Passenger Fare and Survival (Fare < 200)', fontsize=18, fontweight='bold', color="k", alpha=0.8)
bx_plt.set_xlabel("Survived", size = 14, color="k", alpha=0.8)
bx_plt.set_ylabel("Fare", size = 14, color="k", alpha=0.8)
bx_plt.figure.set_size_inches(5,8)
# Descriptive statistics
titanic_by_survival = no_xtreme_outliers_df.groupby('Survived')
print titanic_by_survival['Fare'].describe(percentiles=[0.5])
Surviving passenger fares:
Range: 164.87
Median: 26.00
Mean: 37.80
SD: 36.29
Non-surviving passenger fares:
Range: 153.46
Median: 10.46
Mean: 19.72
SD: 21.56
There appears to be a great deal of overlap amongst fares. I noticed this when I printed out the the first five entries of extreme outliers with fares above 200. All were for first class, but some fares were more than 500 when others were closer to 250.
Are different prices paid for the same class? If so, why? Does this have to do with any other variable present in the data set?
Maybe the place of embarkation has something to do with this; it is feasible that one place charges more or less for a ticket of the same class compared to another place.
# Checking out the ports
print titanic_df['Port_of_Embarkation'].unique()
print titanic_df.shape
Recall: only two NaNs present in 'Port_of_Embarkation' column, so ignore them in exploration by removing them.
# Removing 'Port_of_Embarkation' entries with NaN; RUN ONLY ONCE
cleaned_embark_df = titanic_df[titanic_df['Port_of_Embarkation'].notnull()]
# Checking that it worked
print cleaned_embark_df.shape
print cleaned_embark_df['Port_of_Embarkation'].unique()
With those NaN 'Port_of_Embarkation' column entries removed, let's visualize survival, fare, and port.
# Plot
bx_plt = sns.boxplot(x=cleaned_embark_df['Survived'], y=cleaned_embark_df['Fare'], \
hue=cleaned_embark_df['Port_of_Embarkation'])
# Modify appearance
bx_plt.axes.set_title('Passenger Fare and Survival with respect to \n Ports of Embarkation', \
fontsize=18, fontweight='bold', color="k", alpha=0.8)
bx_plt.set_xlabel("Survived", size = 14, color="k", alpha=0.8)
bx_plt.set_ylabel("Fare", size = 14, color="k", alpha=0.8)
bx_plt.figure.set_size_inches(5,8)
It's hard to see the finer details because of the extreme outliers (Fare > 500) from the survived, Cherbourg group. Let's remove that for a better visualization.
# Removing these values; RUN ONLY ONCE
cleaned_embark_df = cleaned_embark_df[cleaned_embark_df['Fare'] < 500].reset_index(drop=True)
# Checking that it worked...
cleaned_embark_df.shape
Let's replot since those are now removed.
# Plot
bx_plt = sns.boxplot(x=cleaned_embark_df['Survived'], y=cleaned_embark_df['Fare'], \
hue=cleaned_embark_df['Port_of_Embarkation'])
# Modify appearance
bx_plt.axes.set_title('Passenger Fare and Survival with respect to \n Ports of Embarkation (Fare < 500)', \
fontsize=18, fontweight='bold', color="k", alpha=0.8)
bx_plt.set_xlabel("Survived", size = 14, color="k", alpha=0.8)
bx_plt.set_ylabel("Fare", size = 14, color="k", alpha=0.8)
bx_plt.figure.set_size_inches(5,8)
plt.legend(bbox_to_anchor=(1.45, 0.5), fontsize='large')
On first glance, it looks like tickets from Cherbourg are generally the more expensive than both Southampton and Queenstown. Queenstown generally seems to have the cheapest fare. Cherbourg also looks like it has a lot of spread overall, but especially in terms of those who survived.
Let's get some descriptive statistics...
# Grouping
fare_port_groups = cleaned_embark_df.groupby(['Survived', 'Port_of_Embarkation'])
# Checking what I have...
for group in fare_port_groups.groups:
print group
# Getting some descriptive statistics
print fare_port_groups['Fare'].describe(percentiles=[0.5])
Ports and fares of survivors and non-survivors
Although there is a good deal of spread present for both groups (but more so with survivors), survivors generally have a greater fare.
The very large spread within Port C might indicate that many different classes boarded at this port. I can make this (educated) guess owing to the fact that I know upper class generally costs more (so fare and class are related) and because class correlates positive with survival.
This could be further explored.
Port Q:
Survivors generally have a greater fare.
There is little spread in this port (compared to port C and S). This makes me suspicious that most of 3rd class boarded here.
Port S:
Survivors generally have a greater fare.
I am not going to do a hypothesis test on passenger fare because I believe there are too many factors affecting fare to make it overly insightful.
I can say, from an observational viewpoint, that generally survivors have greater fares than non-survivors because the difference are quite large to the naked eye; though I have not statistically demonstrated this.
This one may be tricky because many entries are missing; my approach will ignore entries that have no age listed.
# Remove entries with NaN for age
age_nan_removed_df = titanic_df[titanic_df['Age'].notnull()].reset_index(drop=True)
# Making sure it worked...
print age_nan_removed_df.shape
# Plot
plt_clrs = ['green' if (x == 'Survived') else 'red' for x in age_nan_removed_df['Survived']] \
#red for dead, green for survived
bx_plt = sns.boxplot(x=age_nan_removed_df['Survived'], y=age_nan_removed_df['Age'], palette=plt_clrs)
# Modify appearance
bx_plt.axes.set_title('Age and Survival', fontsize=18, fontweight='bold', color="k", alpha=0.8)
bx_plt.set_xlabel("Survived", size = 14, color="k", alpha=0.8)
bx_plt.set_ylabel("Age", size = 14, color="k", alpha=0.8)
bx_plt.figure.set_size_inches(5,6)
There appears to be no significant difference between survivors and non-survivors. Let's get some descriptive statistics.
# Group age by survival
print age_nan_removed_df.groupby('Survived')['Age'].describe(percentiles=[0.5])
Descriptive statistics summary
Non-survivors:
Median: 28.00
Mean: 30.63
SD: 14.17
Survivors:
Median: 28.00
Mean: 28.34
SD: 14.95
The results are nearly identical, suggesting that there is no significant difference in survival based on age.
Let's do a two sample t-test for independent samples to confirm this hunch.
Hypotheses (two-sided):
H0: $\mu$0 = $\mu$1
H1: $\mu$0 $\neq$ $\mu$1
Where $\mu$0 is the non-survivor population mean and $\mu$1 is the survivor population mean.
Confidence level: 99% ($\alpha$ = 0.01)
# Two sample t-test for independent samples, two-sided
age_surv_ttest = stats.ttest_ind(a=age_nan_removed_df[age_nan_removed_df['Survived'] == 'Died']['Age'], \
b=age_nan_removed_df[age_nan_removed_df['Survived'] == 'Survived']['Age'])
# Results
print 'T-critical value at p=0.01:', u"\u00B1" + str(stats.t.ppf(q=0.995, df=341)) # quantile to check and d.o.f.
print 'T-statistic result:', age_surv_ttest[0]
print 'Calculated p-value:', age_surv_ttest[1]
Test conditions and results:
T-critical values at p=0.01: $\pm$2.590
T statistic result: 2.067
Calculated p-value: 0.039
Null hypothesis rejected at p=0.01 (but not at 0.05 according to calculated p-value).
Interpretation:
Survival based on age appears to be borderline random. There is about a 4% chance that we would see the difference that we do between the groups.
The difference that I can see in the boxplot is that the survived 50% of values are shifted down (very modestly) in age compared to the non-survivor group.
I investigated the four variables (passenger characteristics) in terms of their relationship to survival: sex, class, fare, and age. The analysis leading up to each conclusion is detailed in above sections.
Passenger sex correlates with survival: females were more likely to survive than males.
Passenger class correlates with survival: the higher the passenger's class, the more likely that he or she survived.
Passenger fare likely correlates with survival: confounding variaibles like port and class affect the results, so it is hard to say for certain.
Passenger age weakly correlates with survival: there is an extremely modest difference between surviving and non-surviving passengers with respect to age; so it is difficult to say there is definitely a correlation.