통계분석, 머신러닝을 이용한 데이터 분석

Investigate a Titanic Dataset

01 Apr 2016

I investigated Titanic dataset using NumPy and Pandas. I went through the entire data analysis process, starting by posing a question and finishing by sharing my findings. In this report, the passengers survival rate is analyzed according to passenger class, age, and sex.

Questions

Variables

Reading Titanic data

%pylab inline
import seaborn as sns
import numpy as np
import pandas as pd

titanic_data_df = pd.read_csv('titanic_data.csv')
Populating the interactive namespace from numpy and matplotlib

Statistics of Titanic data

titanic_data_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
print 'Total number of passengers :', titanic_data_df['PassengerId'].count()
print 'Total number of male/female :', titanic_data_df.groupby('Sex').count()['PassengerId']['male'], '/', titanic_data_df.groupby('Sex').count()['PassengerId']['female']
print 'Statistics: '
print titanic_data_df.describe()
Total number of passengers : 891
Total number of male/female : 577 / 314
Statistics: 
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

Does pclass affect to survival rate?

survived_passengers_by_class = titanic_data_df.groupby('Pclass').sum()['Survived']
passengers_by_class = titanic_data_df.groupby('Pclass').count()['PassengerId']

def survived_percentage(passengers, survived):
    return survived / passengers


survived_passengers =  survived_percentage(passengers_by_class, survived_passengers_by_class)

survived_passengers.plot(kind='bar', title='Survival Rate by Pclass')

png

What age were more likely to survive?

def correlation(x, y):
    std_x = (x - x.mean()) / x.std(ddof=0)
    std_y = (y - y.mean()) / y.std(ddof=0)
    
    return (std_x * std_y).mean()

print 'Pearson\'s r:', correlation(titanic_data_df['Survived'], titanic_data_df['Age'])
print 'Age and Survival have negative correlation. Younger passengers were more likely to survive.'
print ' '

avg_age_by_survived = titanic_data_df.groupby('Survived').mean()['Age']
avg_age_by_survived.plot(kind='bar', title='Average Age of Survivors')
Pearson's r: -0.0779826784139
Age and Survival have negative correlation. Younger passengers were more likely to survive.

png

ranged_age_of_survivors = titanic_data_df.groupby(pd.cut(titanic_data_df['Age'], np.arange(0, 90, 10))).mean()
print 'Titanic data with age range (pandas groupby range is refered http://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values)'
ranged_age_of_survivors
Titanic data with age range (pandas groupby range is refered http://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values)
PassengerId Survived Pclass Age SibSp Parch Fare
Age
(0, 10] 430.843750 0.593750 2.640625 4.268281 1.843750 1.421875 30.434439
(10, 20] 447.660870 0.382609 2.530435 17.317391 0.591304 0.391304 29.529531
(20, 30] 428.682609 0.365217 2.386957 25.423913 0.321739 0.239130 28.306719
(30, 40] 468.690323 0.445161 2.090323 35.051613 0.374194 0.393548 42.496100
(40, 50] 483.500000 0.383721 1.918605 45.372093 0.372093 0.430233 41.163181
(50, 60] 449.809524 0.404762 1.523810 54.892857 0.309524 0.309524 44.774802
(60, 70] 430.882353 0.235294 1.529412 63.882353 0.176471 0.352941 45.910782
(70, 80] 438.200000 0.200000 1.800000 73.300000 0.000000 0.000000 25.936680
ranged_age_of_survivors.plot(kind='line', x='Age', y='Survived', title='Survival Rate by Age Range')

png

Does sex affect to survival rate?

survivors_by_sex = titanic_data_df.groupby('Sex').sum()['Survived']
print survivors_by_sex
survivors_by_sex.plot(kind='bar', title='Number of Survivors by sex')

Sex
female    233
male      109
Name: Survived, dtype: int64

png

comments powered by Disqus