Once successful and dominant company in the energy business, but has fallen because they tried to hide company losses and avoid taxes by creating made-up entities, known since as the Enron scandal. The most infamous name associated with this scandal is Kenneth Lay, who is on trial as of januray 2006. he is charged on 11 counts ranging from insider trading to bank fraud.
The goal of this project is to identify Enron employees who may have committed fraud based on the public Enron financial and email dataset. To find out person of interest, we can explore many features and apply specific algorithms. Also, we should validate the performance by evaluating metircs. Through this process, we can gain useful insights from dataset.
- 1. Data exploration
- 2. Outlier Investigation
- 3. Feature Selection
- 4. Algorithms
- 5. Evaluation
- Appendix
1. Data exploration
I looked through dataset to describe overall statistics and find out what features can help identifying fraud from Enron email. I have financial and email dataset labeled who is POI. With 21 features, I categorized list of features as following:
Label | Financial | Email (6 features) | |
---|---|---|---|
payments (10 features) | stock value (4 features) | ||
|
|
|
The first 5 rows of dataset:
# %run ./exploration.py
#!/usr/bin/python
import sys
import pickle
sys.path.append("../tools/")
from tester import test_classifier, dump_classifier_and_data
features_list = ['poi', 'salary', 'bonus','long_term_incentive','deferral_payments',
'expenses','deferred_income','director_fees','loan_advances','other',
'restricted_stock', 'exercised_stock_options','restricted_stock_deferred',
'shared_receipt_with_poi']
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(data_dict, orient='index', dtype=np.float)
df.reset_index(level=0, inplace=True)
columns = list(df.columns)
columns[0] = 'name'
df.columns = columns
df.fillna(0, inplace=True)
df.head()
name | salary | to_messages | deferral_payments | total_payments | exercised_stock_options | bonus | restricted_stock | shared_receipt_with_poi | restricted_stock_deferred | ... | loan_advances | from_messages | other | from_this_person_to_poi | poi | director_fees | deferred_income | long_term_incentive | email_address | from_poi_to_this_person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ALLEN PHILLIP K | 201955 | 2902 | 2869717 | 4484442 | 1729541 | 4175000 | 126027 | 1407 | -126027 | ... | 0 | 2195 | 152 | 65 | 0 | 0 | -3081055 | 304805 | phillip.allen@enron.com | 47 |
1 | BADUM JAMES P | 0 | 0 | 178980 | 182466 | 257817 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 |
2 | BANNANTINE JAMES M | 477 | 566 | 0 | 916197 | 4046157 | 0 | 1757552 | 465 | -560222 | ... | 0 | 29 | 864523 | 0 | 0 | 0 | -5104 | 0 | james.bannantine@enron.com | 39 |
3 | BAXTER JOHN C | 267102 | 0 | 1295738 | 5634343 | 6680544 | 1200000 | 3942714 | 0 | 0 | ... | 0 | 0 | 2660303 | 0 | 0 | 0 | -1386055 | 1586055 | NaN | 0 |
4 | BAY FRANKLIN R | 239671 | 0 | 260455 | 827696 | 0 | 400000 | 145796 | 0 | -82782 | ... | 0 | 0 | 69 | 0 | 0 | 0 | -201641 | 0 | frank.bay@enron.com | 0 |
5 rows × 22 columns
Overall statistics of dataset:
df.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
bonus | 146 | 1333474.232877 | 8094029.239637 | 0 | 0.00 | 300000.0 | 800000.00 | 97343619 |
deferral_payments | 146 | 438796.520548 | 2741325.337926 | -102500 | 0.00 | 0.0 | 9684.50 | 32083396 |
deferred_income | 146 | -382762.205479 | 2378249.890202 | -27992891 | -37926.00 | 0.0 | 0.00 | 0 |
director_fees | 146 | 19422.486301 | 119054.261157 | 0 | 0.00 | 0.0 | 0.00 | 1398517 |
email_address | 146 | 0.000000 | 0.000000 | 0 | 0.00 | 0.0 | 0.00 | 0 |
exercised_stock_options | 146 | 4182736.198630 | 26070399.807568 | 0 | 0.00 | 608293.5 | 1714220.75 | 311764000 |
expenses | 146 | 70748.267123 | 432716.319438 | 0 | 0.00 | 20182.0 | 53740.75 | 5235198 |
from_messages | 146 | 358.602740 | 1441.259868 | 0 | 0.00 | 16.5 | 51.25 | 14368 |
from_poi_to_this_person | 146 | 38.226027 | 73.901124 | 0 | 0.00 | 2.5 | 40.75 | 528 |
from_this_person_to_poi | 146 | 24.287671 | 79.278206 | 0 | 0.00 | 0.0 | 13.75 | 609 |
loan_advances | 146 | 1149657.534247 | 9649342.029695 | 0 | 0.00 | 0.0 | 0.00 | 83925000 |
long_term_incentive | 146 | 664683.945205 | 4046071.990875 | 0 | 0.00 | 0.0 | 375064.75 | 48521928 |
other | 146 | 585431.794521 | 3682344.576631 | 0 | 0.00 | 959.5 | 150606.50 | 42667589 |
poi | 146 | 0.123288 | 0.329899 | 0 | 0.00 | 0.0 | 0.00 | 1 |
restricted_stock | 146 | 1749257.020548 | 10899953.192164 | -2604490 | 8115.00 | 360528.0 | 814528.00 | 130322299 |
restricted_stock_deferred | 146 | 20516.369863 | 1439660.966040 | -7576788 | 0.00 | 0.0 | 0.00 | 15456290 |
salary | 146 | 365811.356164 | 2203574.963717 | 0 | 0.00 | 210596.0 | 270850.50 | 26704229 |
shared_receipt_with_poi | 146 | 692.986301 | 1072.969492 | 0 | 0.00 | 102.5 | 893.50 | 5521 |
to_messages | 146 | 1221.589041 | 2226.770637 | 0 | 0.00 | 289.0 | 1585.75 | 15149 |
total_payments | 146 | 4350621.993151 | 26934479.950729 | 0 | 93944.75 | 941359.5 | 1968286.75 | 309886585 |
total_stock_value | 146 | 5846018.075342 | 36246809.190047 | -44093 | 228869.50 | 965955.0 | 2319991.25 | 434509511 |
Number of person-of-interest and non person-of-interest:
bypoi = df.groupby(['poi'])
print bypoi['poi'].aggregate([len])
len
poi
0 128
1 18
2. Outlier Investigation
There was one outlier and two correction in the dataset.
- outlier : datapoint of salary 26,704,229 is clear outlier. The sum of all salaries seem to be parsed accidently from pdf file, so I deleted it.
del data_dict['TOTAL']
- incorrect numbers : BELFER ROBERT, BHATNAGAR SANJAY records were incorrect, so I updated value according to Enron Statement of Financial Affairs pdf file.
payment_cols = ['salary', 'bonus','long_term_incentive','deferral_payments','expenses','deferred_income','director_fees','loan_advances','other']
stock_cols = ['restricted_stock', 'exercised_stock_options','restricted_stock_deferred']
def check_consistency(df):
consistency = pd.DataFrame()
consistency['name'] = df['name']
consistency['total1'] = df[payment_cols].sum(axis=1)
consistency['total2'] = df[stock_cols].sum(axis=1)
consistency['consistent_payments'] = (consistency['total1'] == df['total_payments'])
consistency['consistent_stockvalue'] = (consistency['total2'] == df['total_stock_value'])
checks = consistency[(consistency['consistent_payments'] == False) | (consistency['consistent_stockvalue'] == False)]['name'].tolist()
return checks
check_names = check_consistency(df)
print check_names
#if len(check_names) > 0:
payment_cols.append('total_payments')
df[df['name'].isin(check_names)][payment_cols]
['BELFER ROBERT', 'BHATNAGAR SANJAY']
salary | bonus | long_term_incentive | deferral_payments | expenses | deferred_income | director_fees | loan_advances | other | total_payments | |
---|---|---|---|---|---|---|---|---|---|---|
8 | 0 | 0 | 0 | -102500 | 0 | 0 | 3285 | 0 | 0 | 102500 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 137864 | 0 | 137864 | 15456290 |
#if len(check_names) > 0:
stock_cols.append('total_stock_value')
df[df['name'].isin(check_names)][stock_cols]
restricted_stock | exercised_stock_options | restricted_stock_deferred | total_stock_value | |
---|---|---|---|---|
8 | 0 | 3285 | 44093 | -44093 |
11 | -2604490 | 2604490 | 15456290 | 0 |
payment_cols.remove('total_payments')
stock_cols.remove('total_stock_value')
data_dict['BELFER ROBERT']['deferred_income'] = -102500
data_dict['BELFER ROBERT']['deferral_payments'] = 0
data_dict['BELFER ROBERT']['expenses'] = 3285
data_dict['BELFER ROBERT']['director_fees'] = 102500
data_dict['BELFER ROBERT']['total_payments'] = 3285
data_dict['BHATNAGAR SANJAY']['total_payments'] =137864
data_dict['BHATNAGAR SANJAY']['expenses'] =137864
data_dict['BHATNAGAR SANJAY']['other'] = 0
data_dict['BHATNAGAR SANJAY']['director_fees'] = 0
data_dict['BELFER ROBERT']['exercised_stock_options'] = 0
data_dict['BELFER ROBERT']['restricted_stock'] = 44093
data_dict['BELFER ROBERT']['restricted_stock_deferred'] = -44093
data_dict['BELFER ROBERT']['total_stock_value'] = 0
data_dict['BHATNAGAR SANJAY']['exercised_stock_options'] = 15456290
data_dict['BHATNAGAR SANJAY']['restricted_stock'] = 2604490
data_dict['BHATNAGAR SANJAY']['restricted_stock_deferred'] = -2604490
data_dict['BHATNAGAR SANJAY']['total_stock_value'] = 15456290
3. Feature Selection
-
new features (fraction_from_poi, fraction_to_poi)
I created new features: fraction_from_poi, fraction_to_poi. Number of messages related to POI are divided by the total number of messages to or from this person. The figure on the left showed scatter plot of old features that have sparse POIs(red points). On the other hand, POIs(red points) of the left figure are denser.
def compute_fraction( poi_messages, all_messages ):
""" given a number messages to/from POI (numerator)
and number of all messages to/from a person (denominator),
return the fraction of messages to/from that person
that are from/to a POI
"""
import math
if poi_messages == 0 or all_messages == 0 or math.isnan(float(poi_messages)) or math.isnan(float(all_messages)) :
return 0.
fraction = 0.
fraction = float(poi_messages) / float(all_messages)
return fraction
submit_dict = {}
for name in data_dict:
data_point = data_dict[name]
from_poi_to_this_person = data_point["from_poi_to_this_person"]
to_messages = data_point["to_messages"]
fraction_from_poi = compute_fraction( from_poi_to_this_person, to_messages )
data_point["fraction_from_poi"] = round(fraction_from_poi,3)
from_this_person_to_poi = data_point["from_this_person_to_poi"]
from_messages = data_point["from_messages"]
fraction_to_poi = compute_fraction( from_this_person_to_poi, from_messages )
submit_dict[name]={"from_poi_to_this_person":fraction_from_poi,
"from_this_person_to_poi":fraction_to_poi}
data_point["fraction_to_poi"] = round(fraction_to_poi, 3)
## append two features to the list
if not ('fraction_from_poi' in set(features_list)):
features_list.append('fraction_from_poi')
if not ('fraction_to_poi' in set(features_list)):
features_list.append('fraction_to_poi')
%pylab inline
import matplotlib.pyplot as plt
def graph_scatter_with_poi(var1, var2):
for name in data_dict:
point = data_dict[name]
poi = point['poi']
x = point[var1]
y = point[var2]
if poi:
plt.scatter( x, y, color='red')
else:
plt.scatter( x, y, color='blue')
plt.xlabel(var1)
plt.ylabel(var2)
plt.figure(1, figsize=(16, 5))
plt.subplot(1,2,1)
graph_scatter_with_poi('from_poi_to_this_person', 'from_this_person_to_poi')
plt.subplot(1,2,2)
graph_scatter_with_poi('fraction_from_poi', 'fraction_to_poi')
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['random']
`%matplotlib` prevents importing * from pylab and numpy
An example of employee with new features:
data_dict['SKILLING JEFFREY K']
{'bonus': 5600000,
'deferral_payments': 'NaN',
'deferred_income': 'NaN',
'director_fees': 'NaN',
'email_address': 'jeff.skilling@enron.com',
'exercised_stock_options': 19250000,
'expenses': 29336,
'fraction_from_poi': 0.024,
'fraction_to_poi': 0.278,
'from_messages': 108,
'from_poi_to_this_person': 88,
'from_this_person_to_poi': 30,
'loan_advances': 'NaN',
'long_term_incentive': 1920000,
'other': 22122,
'poi': True,
'restricted_stock': 6843672,
'restricted_stock_deferred': 'NaN',
'salary': 1111258,
'shared_receipt_with_poi': 2042,
'to_messages': 3627,
'total_payments': 8682716,
'total_stock_value': 26093672}
intelligently select feature
- automated feature selection
I transformed features with feature scaling(MinMaxScaler), feature selection(SelectKBest). SelectKBest was used in chapter 4.
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from tester import dump_classifier_and_data
from feature_format import featureFormat, targetFeatureSplit
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, f_regression
from sklearn import cross_validation
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
import pprint
### Store to my_dataset for easy export below.
my_dataset = data_dict
### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
folds = 1000
random = 13
cv = StratifiedShuffleSplit(labels, folds, random_state=random)
mdf = []
combined_features = FeatureUnion( [
('scaler', MinMaxScaler())
])
I did not include to_messages, from_messages in the features_list. Because new features, fraction_from_poi and fraction_to_poi, are more relevant to predict POI, as shown in the table.
features_list_fs = list(features_list)
if not ('to_messages' in set(features_list_fs)):
features_list_fs.append('to_messages')
if not ('from_messages' in set(features_list_fs)):
features_list_fs.append('from_messages')
data_fs = featureFormat(my_dataset, features_list_fs, sort_keys = True)
labels_fs, features_fs = targetFeatureSplit(data_fs)
pipeline = Pipeline([
("features", combined_features),
('kbest', SelectKBest(k='all', score_func=f_classif)),
('DecisionTree', DecisionTreeClassifier(random_state=random, min_samples_split=20, criterion='entropy', max_features=None))
])
pipeline.fit(features_fs, labels_fs)
scores = pipeline.named_steps['kbest'].scores_
df = pd.DataFrame(data = list(zip(features_list_fs[1:], scores)), columns=['Feature', 'Score'])
df
Feature | Score | |
---|---|---|
0 | salary | 18.575703 |
1 | bonus | 21.060002 |
2 | long_term_incentive | 10.072455 |
3 | deferral_payments | 0.221214 |
4 | expenses | 5.550684 |
5 | deferred_income | 11.561888 |
6 | director_fees | 2.112762 |
7 | loan_advances | 7.242730 |
8 | other | 4.219888 |
9 | restricted_stock | 8.958540 |
10 | exercised_stock_options | 22.610531 |
11 | restricted_stock_deferred | 0.761863 |
12 | shared_receipt_with_poi | 8.746486 |
13 | fraction_from_poi | 3.230112 |
14 | fraction_to_poi | 16.642573 |
15 | to_messages | 1.698824 |
16 | from_messages | 0.164164 |
Also, I confirmed that precision and recall were higher without original features. This means, without these features, I have more chances to find out real POI, and less chances that non-POIs get flagged.
With to_messages, from_messages: - Accuracy: 0.86073 Precision: 0.47707 Recall: 0.46300 F1: 0.46993 F2: 0.46575
Without to_messages, from_messages: - Accuracy: 0.85860 Precision: 0.46870 Recall: 0.45300 F1: 0.46072 F2: 0.45606
4. Algorithms
I feed transformed features to each algorithm in the following.
- SVC (svm)
- DecisionTreeClassifier (tree)
- GaussianNB (naive bayes)
SVM
pipeline = Pipeline([("features", combined_features), ('svc', SVC())])
param_grid = {
'svc__kernel': [ 'sigmoid', 'poly','rbf'],
#'svc__C': [0.1, 1, 10],
'svc__gamma': ['auto'],
'svc__class_weight' :[None, 'balanced']
}
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, scoring='f1', verbose=1) # f1 for binary targets
grid_search.fit(features, labels)
print grid_search.best_score_
print grid_search.best_params_
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.2s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 0.8s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 1.8s
[Parallel(n_jobs=1)]: Done 799 tasks | elapsed: 3.2s
[Parallel(n_jobs=1)]: Done 1249 tasks | elapsed: 5.1s
[Parallel(n_jobs=1)]: Done 1799 tasks | elapsed: 7.4s
[Parallel(n_jobs=1)]: Done 2449 tasks | elapsed: 10.0s
[Parallel(n_jobs=1)]: Done 3199 tasks | elapsed: 13.0s
[Parallel(n_jobs=1)]: Done 4049 tasks | elapsed: 16.7s
[Parallel(n_jobs=1)]: Done 4999 tasks | elapsed: 20.9s
Fitting 1000 folds for each of 6 candidates, totalling 6000 fits
0.419783910534
{'svc__gamma': 'auto', 'svc__class_weight': 'balanced', 'svc__kernel': 'rbf'}
[Parallel(n_jobs=1)]: Done 6000 out of 6000 | elapsed: 25.4s finished
clf_fin = pipeline.set_params(**grid_search.best_params_)
test_classifier(clf_fin, my_dataset, features_list)
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))],
transformer_weights=None)), ('svc', SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])
Accuracy: 0.80380 Precision: 0.34893 Recall: 0.54450 F1: 0.42531 F2: 0.48961
Total predictions: 15000 True positives: 1089 False positives: 2032 False negatives: 911 True negatives: 10968
DecisionTreeClassifier
pipeline = Pipeline([("features", combined_features), ('DecisionTree', DecisionTreeClassifier(random_state=random))])
param_grid = {
'DecisionTree__min_samples_split':[20, 30, 40],
'DecisionTree__max_features': [None, 'auto', 'log2'],
'DecisionTree__criterion': ['gini', 'entropy']
}
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, scoring='f1', verbose=1) # f1 for binary targets
grid_search.fit(features, labels)
print grid_search.best_score_
print grid_search.best_params_
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.2s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 0.7s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 1.6s
[Parallel(n_jobs=1)]: Done 799 tasks | elapsed: 2.8s
[Parallel(n_jobs=1)]: Done 1249 tasks | elapsed: 4.4s
[Parallel(n_jobs=1)]: Done 1799 tasks | elapsed: 6.4s
[Parallel(n_jobs=1)]: Done 2449 tasks | elapsed: 9.0s
[Parallel(n_jobs=1)]: Done 3199 tasks | elapsed: 11.6s
[Parallel(n_jobs=1)]: Done 4049 tasks | elapsed: 14.5s
[Parallel(n_jobs=1)]: Done 4999 tasks | elapsed: 17.7s
[Parallel(n_jobs=1)]: Done 6049 tasks | elapsed: 21.3s
[Parallel(n_jobs=1)]: Done 7199 tasks | elapsed: 25.2s
[Parallel(n_jobs=1)]: Done 8449 tasks | elapsed: 29.4s
[Parallel(n_jobs=1)]: Done 9799 tasks | elapsed: 34.2s
[Parallel(n_jobs=1)]: Done 11249 tasks | elapsed: 39.5s
[Parallel(n_jobs=1)]: Done 12799 tasks | elapsed: 45.1s
[Parallel(n_jobs=1)]: Done 14449 tasks | elapsed: 50.7s
[Parallel(n_jobs=1)]: Done 16199 tasks | elapsed: 56.8s
Fitting 1000 folds for each of 18 candidates, totalling 18000 fits
0.427048412698
{'DecisionTree__criterion': 'entropy', 'DecisionTree__min_samples_split': 20, 'DecisionTree__max_features': None}
[Parallel(n_jobs=1)]: Done 18000 out of 18000 | elapsed: 1.1min finished
fi = grid_search.best_estimator_.named_steps['DecisionTree'].feature_importances_
df = pd.DataFrame(data = list(zip(features_list[1:], fi)), columns=['Feature', 'Importance'])
df
Feature | Importance | |
---|---|---|
0 | salary | 0.000000 |
1 | bonus | 0.093646 |
2 | long_term_incentive | 0.000000 |
3 | deferral_payments | 0.000000 |
4 | expenses | 0.230564 |
5 | deferred_income | 0.000000 |
6 | director_fees | 0.000000 |
7 | loan_advances | 0.000000 |
8 | other | 0.484556 |
9 | restricted_stock | 0.000000 |
10 | exercised_stock_options | 0.000000 |
11 | restricted_stock_deferred | 0.000000 |
12 | shared_receipt_with_poi | 0.000000 |
13 | fraction_from_poi | 0.000000 |
14 | fraction_to_poi | 0.191234 |
clf_fin = pipeline.set_params(**grid_search.best_params_)
test_classifier(clf_fin, my_dataset, features_list)
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))],
transformer_weights=None)), ('DecisionTree', DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=20, min_weight_fraction_leaf=0.0,
presort=False, random_state=13, splitter='best'))])
Accuracy: 0.86073 Precision: 0.47707 Recall: 0.46300 F1: 0.46993 F2: 0.46575
Total predictions: 15000 True positives: 926 False positives: 1015 False negatives: 1074 True negatives: 11985
GaussianNB
pipeline = Pipeline([("features", combined_features), ('GaussianNB', GaussianNB())])
param_grid = {
## no params
}
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, scoring='f1', verbose=1) # f1 for binary targets
grid_search.fit(features, labels)
print grid_search.best_score_
print grid_search.best_params_
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.1s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 0.4s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 0.8s
[Parallel(n_jobs=1)]: Done 799 tasks | elapsed: 1.5s
Fitting 1000 folds for each of 1 candidates, totalling 1000 fits
0.265048897508
{}
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed: 1.8s finished
clf_fin = pipeline.set_params(**grid_search.best_params_)
test_classifier(clf_fin, my_dataset, features_list)
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))],
transformer_weights=None)), ('GaussianNB', GaussianNB())])
Accuracy: 0.34073 Precision: 0.15589 Recall: 0.89350 F1: 0.26547 F2: 0.45908
Total predictions: 15000 True positives: 1787 False positives: 9676 False negatives: 213 True negatives: 3324
Overall result
SVM
Accuracy: 0.80380 Precision: 0.34893 Recall: 0.54450 F1: 0.42531 F2: 0.48961
Accuracy: 0.80380 Precision: 0.34893 Recall: 0.54450 F1: 0.42531 F2: 0.48961
Accuracy: 0.80380 Precision: 0.34893 Recall: 0.54450 F1: 0.42531 F2: 0.48961
DecisionTreeClassifier
Accuracy: 0.86120 Precision: 0.47880 Recall: 0.46300 F1: 0.47077 F2: 0.46608
Accuracy: 0.86067 Precision: 0.47690 Recall: 0.46450 F1: 0.47062 F2: 0.46693
Accuracy: 0.86093 Precision: 0.47790 Recall: 0.46500 F1: 0.47136 F2: 0.46752
GaussianNB
Accuracy: 0.34073 Precision: 0.15589 Recall: 0.89350 F1: 0.26547 F2: 0.45908
Accuracy: 0.34073 Precision: 0.15589 Recall: 0.89350 F1: 0.26547 F2: 0.45908
Accuracy: 0.34073 Precision: 0.15589 Recall: 0.89350 F1: 0.26547 F2: 0.45908
I tested each algorithm three times to make sure their performances. As a result, DecisionTreeClassifier was the best.
Tune the algorithm
Tuning is adjusting parameters of algorithm to improve performance. If I don’t tune well, overfitting might occur. Even though it correctly classifies the data, its prediction can not be generalized. I can contorl this problem through the parameter of an algorithm, so I tuned the paramethers with GridSearchCV.
Since DecisionTreeClassifier showed the best performance, I decided to tune additional parameters to improve performance. I added feature selection(SelectKBest) in the pipeline, and I found that 12 parameters provide best score.
%%time
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('kbest', SelectKBest()),
('dtree', DecisionTreeClassifier(random_state=random))])
param_grid = {
#'kbest__k':[1, 2, 3, 4, 5],
#'kbest__k':[6,7,8,9,10],
'kbest__k':[11,12,13,14,15],
'dtree__max_features': [None, 'auto'],
'dtree__criterion': ['entropy'],
'dtree__max_depth': [None, 3, 5],
'dtree__min_samples_split': [2, 1, 3],
'dtree__min_samples_leaf': [1, 2],
'dtree__min_weight_fraction_leaf': [0, 0.5],
'dtree__class_weight': [{1: 1, 0: 1}, {1: 0.8, 0: 0.3}, {1:0.7, 0:0.4}],
'dtree__splitter': ['best', 'random']
}
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, scoring='f1', verbose=1) # f1 for binary targets
grid_search.fit(features, labels)
print grid_search.best_score_
print grid_search.best_params_
# k = 1,2,3,4,5
# Fitting 1000 folds for each of 2160 candidates, totalling 2160000 fits
# 0.36747950938
# {'dtree__min_samples_leaf': 2, 'dtree__min_samples_split': 2, 'kbest__k': 5, 'dtree__splitter': 'random',
#'dtree__max_features': None, 'dtree__max_depth': 5, 'dtree__min_weight_fraction_leaf': 0,
#'dtree__class_weight': {0: 0.3, 1: 0.8}, 'dtree__criterion': 'entropy'}
# CPU times: user 3h 15min 23s, sys: 8.36 s, total: 3h 15min 32s
# Wall time: 3h 15min 26s
# k = 6,7,8,9,10
# Fitting 1000 folds for each of 2160 candidates, totalling 2160000 fits
# 0.472646031746
# {'dtree__min_samples_leaf': 2, 'dtree__min_samples_split': 2, 'kbest__k': 10, 'dtree__splitter': 'random', 'dtree__max_features': 'auto',
#'dtree__max_depth': 3, 'dtree__min_weight_fraction_leaf': 0, 'dtree__class_weight': {0: 0.3, 1: 0.8}, 'dtree__criterion': 'entropy'}
# CPU times: user 3h 16min 43s, sys: 8.48 s, total: 3h 16min 51s
# Wall time: 3h 16min 44s
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.3s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 1.1s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 2.5s
[Parallel(n_jobs=1)]: Done 799 tasks | elapsed: 4.5s
[Parallel(n_jobs=1)]: Done 1249 tasks | elapsed: 6.9s
[Parallel(n_jobs=1)]: Done 1799 tasks | elapsed: 9.8s
[Parallel(n_jobs=1)]: Done 2449 tasks | elapsed: 13.2s
[Parallel(n_jobs=1)]: Done 3199 tasks | elapsed: 17.1s
[Parallel(n_jobs=1)]: Done 4049 tasks | elapsed: 21.6s
[Parallel(n_jobs=1)]: Done 4999 tasks | elapsed: 26.6s
[Parallel(n_jobs=1)]: Done 6049 tasks | elapsed: 31.8s
[Parallel(n_jobs=1)]: Done 7199 tasks | elapsed: 37.4s
[Parallel(n_jobs=1)]: Done 8449 tasks | elapsed: 43.5s
[Parallel(n_jobs=1)]: Done 9799 tasks | elapsed: 50.1s
[Parallel(n_jobs=1)]: Done 11249 tasks | elapsed: 57.1s
[Parallel(n_jobs=1)]: Done 12799 tasks | elapsed: 1.1min
[Parallel(n_jobs=1)]: Done 14449 tasks | elapsed: 1.2min
[Parallel(n_jobs=1)]: Done 16199 tasks | elapsed: 1.4min
[Parallel(n_jobs=1)]: Done 18049 tasks | elapsed: 1.5min
[Parallel(n_jobs=1)]: Done 19999 tasks | elapsed: 1.7min
[Parallel(n_jobs=1)]: Done 22049 tasks | elapsed: 1.8min
[Parallel(n_jobs=1)]: Done 24199 tasks | elapsed: 2.0min
[Parallel(n_jobs=1)]: Done 26449 tasks | elapsed: 2.2min
[Parallel(n_jobs=1)]: Done 28799 tasks | elapsed: 2.4min
[Parallel(n_jobs=1)]: Done 31249 tasks | elapsed: 2.6min
[Parallel(n_jobs=1)]: Done 33799 tasks | elapsed: 2.8min
[Parallel(n_jobs=1)]: Done 36449 tasks | elapsed: 3.0min
[Parallel(n_jobs=1)]: Done 39199 tasks | elapsed: 3.2min
[Parallel(n_jobs=1)]: Done 42049 tasks | elapsed: 3.5min
[Parallel(n_jobs=1)]: Done 44999 tasks | elapsed: 3.7min
[Parallel(n_jobs=1)]: Done 48049 tasks | elapsed: 4.0min
[Parallel(n_jobs=1)]: Done 51199 tasks | elapsed: 4.2min
[Parallel(n_jobs=1)]: Done 54449 tasks | elapsed: 4.5min
[Parallel(n_jobs=1)]: Done 57799 tasks | elapsed: 4.8min
[Parallel(n_jobs=1)]: Done 61249 tasks | elapsed: 5.1min
[Parallel(n_jobs=1)]: Done 64799 tasks | elapsed: 5.4min
[Parallel(n_jobs=1)]: Done 68449 tasks | elapsed: 5.7min
[Parallel(n_jobs=1)]: Done 72199 tasks | elapsed: 6.0min
[Parallel(n_jobs=1)]: Done 76049 tasks | elapsed: 6.3min
[Parallel(n_jobs=1)]: Done 79999 tasks | elapsed: 6.6min
[Parallel(n_jobs=1)]: Done 84049 tasks | elapsed: 6.9min
[Parallel(n_jobs=1)]: Done 88199 tasks | elapsed: 7.3min
[Parallel(n_jobs=1)]: Done 92449 tasks | elapsed: 7.6min
[Parallel(n_jobs=1)]: Done 96799 tasks | elapsed: 8.0min
[Parallel(n_jobs=1)]: Done 101249 tasks | elapsed: 8.4min
[Parallel(n_jobs=1)]: Done 105799 tasks | elapsed: 8.8min
[Parallel(n_jobs=1)]: Done 110449 tasks | elapsed: 9.1min
[Parallel(n_jobs=1)]: Done 115199 tasks | elapsed: 9.5min
[Parallel(n_jobs=1)]: Done 120049 tasks | elapsed: 9.9min
[Parallel(n_jobs=1)]: Done 124999 tasks | elapsed: 10.3min
[Parallel(n_jobs=1)]: Done 130049 tasks | elapsed: 10.8min
[Parallel(n_jobs=1)]: Done 135199 tasks | elapsed: 11.2min
[Parallel(n_jobs=1)]: Done 140449 tasks | elapsed: 11.6min
[Parallel(n_jobs=1)]: Done 145799 tasks | elapsed: 12.1min
[Parallel(n_jobs=1)]: Done 151249 tasks | elapsed: 12.5min
[Parallel(n_jobs=1)]: Done 156799 tasks | elapsed: 13.0min
[Parallel(n_jobs=1)]: Done 162449 tasks | elapsed: 13.4min
[Parallel(n_jobs=1)]: Done 168199 tasks | elapsed: 13.9min
[Parallel(n_jobs=1)]: Done 174049 tasks | elapsed: 14.4min
[Parallel(n_jobs=1)]: Done 179999 tasks | elapsed: 14.9min
[Parallel(n_jobs=1)]: Done 186049 tasks | elapsed: 15.3min
[Parallel(n_jobs=1)]: Done 192199 tasks | elapsed: 15.9min
[Parallel(n_jobs=1)]: Done 198449 tasks | elapsed: 16.4min
[Parallel(n_jobs=1)]: Done 204799 tasks | elapsed: 16.9min
[Parallel(n_jobs=1)]: Done 211249 tasks | elapsed: 17.4min
[Parallel(n_jobs=1)]: Done 217799 tasks | elapsed: 17.9min
[Parallel(n_jobs=1)]: Done 224449 tasks | elapsed: 18.5min
[Parallel(n_jobs=1)]: Done 231199 tasks | elapsed: 19.0min
[Parallel(n_jobs=1)]: Done 238049 tasks | elapsed: 19.6min
[Parallel(n_jobs=1)]: Done 244999 tasks | elapsed: 20.2min
[Parallel(n_jobs=1)]: Done 252049 tasks | elapsed: 20.8min
[Parallel(n_jobs=1)]: Done 259199 tasks | elapsed: 21.4min
[Parallel(n_jobs=1)]: Done 266449 tasks | elapsed: 22.0min
[Parallel(n_jobs=1)]: Done 273799 tasks | elapsed: 22.6min
[Parallel(n_jobs=1)]: Done 281249 tasks | elapsed: 23.2min
[Parallel(n_jobs=1)]: Done 288799 tasks | elapsed: 23.8min
[Parallel(n_jobs=1)]: Done 296449 tasks | elapsed: 24.5min
[Parallel(n_jobs=1)]: Done 304199 tasks | elapsed: 25.1min
[Parallel(n_jobs=1)]: Done 312049 tasks | elapsed: 25.8min
[Parallel(n_jobs=1)]: Done 319999 tasks | elapsed: 26.4min
[Parallel(n_jobs=1)]: Done 328049 tasks | elapsed: 27.1min
[Parallel(n_jobs=1)]: Done 336199 tasks | elapsed: 27.8min
[Parallel(n_jobs=1)]: Done 344449 tasks | elapsed: 28.5min
[Parallel(n_jobs=1)]: Done 352799 tasks | elapsed: 29.2min
[Parallel(n_jobs=1)]: Done 361249 tasks | elapsed: 29.9min
[Parallel(n_jobs=1)]: Done 369799 tasks | elapsed: 30.6min
[Parallel(n_jobs=1)]: Done 378449 tasks | elapsed: 31.3min
[Parallel(n_jobs=1)]: Done 387199 tasks | elapsed: 32.0min
[Parallel(n_jobs=1)]: Done 396049 tasks | elapsed: 32.7min
[Parallel(n_jobs=1)]: Done 404999 tasks | elapsed: 33.5min
[Parallel(n_jobs=1)]: Done 414049 tasks | elapsed: 34.2min
[Parallel(n_jobs=1)]: Done 423199 tasks | elapsed: 35.0min
[Parallel(n_jobs=1)]: Done 432449 tasks | elapsed: 35.7min
[Parallel(n_jobs=1)]: Done 441799 tasks | elapsed: 36.5min
[Parallel(n_jobs=1)]: Done 451249 tasks | elapsed: 37.3min
[Parallel(n_jobs=1)]: Done 460799 tasks | elapsed: 38.1min
[Parallel(n_jobs=1)]: Done 470449 tasks | elapsed: 38.9min
[Parallel(n_jobs=1)]: Done 480199 tasks | elapsed: 39.7min
[Parallel(n_jobs=1)]: Done 490049 tasks | elapsed: 40.5min
[Parallel(n_jobs=1)]: Done 499999 tasks | elapsed: 41.3min
[Parallel(n_jobs=1)]: Done 510049 tasks | elapsed: 42.2min
[Parallel(n_jobs=1)]: Done 520199 tasks | elapsed: 43.0min
[Parallel(n_jobs=1)]: Done 530449 tasks | elapsed: 43.9min
[Parallel(n_jobs=1)]: Done 540799 tasks | elapsed: 44.7min
[Parallel(n_jobs=1)]: Done 551249 tasks | elapsed: 45.6min
[Parallel(n_jobs=1)]: Done 561799 tasks | elapsed: 46.5min
[Parallel(n_jobs=1)]: Done 572449 tasks | elapsed: 47.4min
[Parallel(n_jobs=1)]: Done 583199 tasks | elapsed: 48.3min
[Parallel(n_jobs=1)]: Done 594049 tasks | elapsed: 49.2min
[Parallel(n_jobs=1)]: Done 604999 tasks | elapsed: 50.1min
[Parallel(n_jobs=1)]: Done 616049 tasks | elapsed: 51.0min
[Parallel(n_jobs=1)]: Done 627199 tasks | elapsed: 51.9min
[Parallel(n_jobs=1)]: Done 638449 tasks | elapsed: 52.8min
[Parallel(n_jobs=1)]: Done 649799 tasks | elapsed: 53.8min
[Parallel(n_jobs=1)]: Done 661249 tasks | elapsed: 54.7min
[Parallel(n_jobs=1)]: Done 672799 tasks | elapsed: 55.6min
[Parallel(n_jobs=1)]: Done 684449 tasks | elapsed: 56.6min
[Parallel(n_jobs=1)]: Done 696199 tasks | elapsed: 57.5min
[Parallel(n_jobs=1)]: Done 708049 tasks | elapsed: 58.5min
[Parallel(n_jobs=1)]: Done 719999 tasks | elapsed: 59.5min
[Parallel(n_jobs=1)]: Done 732049 tasks | elapsed: 60.5min
[Parallel(n_jobs=1)]: Done 744199 tasks | elapsed: 61.5min
[Parallel(n_jobs=1)]: Done 756449 tasks | elapsed: 62.5min
[Parallel(n_jobs=1)]: Done 768799 tasks | elapsed: 63.6min
[Parallel(n_jobs=1)]: Done 781249 tasks | elapsed: 64.6min
[Parallel(n_jobs=1)]: Done 793799 tasks | elapsed: 65.7min
[Parallel(n_jobs=1)]: Done 806449 tasks | elapsed: 66.7min
[Parallel(n_jobs=1)]: Done 819199 tasks | elapsed: 67.8min
[Parallel(n_jobs=1)]: Done 832049 tasks | elapsed: 68.9min
[Parallel(n_jobs=1)]: Done 844999 tasks | elapsed: 69.9min
[Parallel(n_jobs=1)]: Done 858049 tasks | elapsed: 71.0min
[Parallel(n_jobs=1)]: Done 871199 tasks | elapsed: 72.1min
[Parallel(n_jobs=1)]: Done 884449 tasks | elapsed: 73.2min
[Parallel(n_jobs=1)]: Done 897799 tasks | elapsed: 74.3min
[Parallel(n_jobs=1)]: Done 911249 tasks | elapsed: 75.4min
[Parallel(n_jobs=1)]: Done 924799 tasks | elapsed: 76.5min
[Parallel(n_jobs=1)]: Done 938449 tasks | elapsed: 77.6min
[Parallel(n_jobs=1)]: Done 952199 tasks | elapsed: 78.7min
[Parallel(n_jobs=1)]: Done 966049 tasks | elapsed: 79.9min
[Parallel(n_jobs=1)]: Done 979999 tasks | elapsed: 81.0min
[Parallel(n_jobs=1)]: Done 994049 tasks | elapsed: 82.2min
[Parallel(n_jobs=1)]: Done 1008199 tasks | elapsed: 83.4min
[Parallel(n_jobs=1)]: Done 1022449 tasks | elapsed: 84.5min
[Parallel(n_jobs=1)]: Done 1036799 tasks | elapsed: 85.7min
[Parallel(n_jobs=1)]: Done 1051249 tasks | elapsed: 86.9min
[Parallel(n_jobs=1)]: Done 1065799 tasks | elapsed: 88.1min
[Parallel(n_jobs=1)]: Done 1080449 tasks | elapsed: 89.3min
[Parallel(n_jobs=1)]: Done 1095199 tasks | elapsed: 90.5min
[Parallel(n_jobs=1)]: Done 1110049 tasks | elapsed: 91.7min
[Parallel(n_jobs=1)]: Done 1124999 tasks | elapsed: 93.0min
[Parallel(n_jobs=1)]: Done 1140049 tasks | elapsed: 94.2min
[Parallel(n_jobs=1)]: Done 1155199 tasks | elapsed: 95.4min
[Parallel(n_jobs=1)]: Done 1170449 tasks | elapsed: 96.7min
[Parallel(n_jobs=1)]: Done 1185799 tasks | elapsed: 97.9min
[Parallel(n_jobs=1)]: Done 1201249 tasks | elapsed: 99.2min
[Parallel(n_jobs=1)]: Done 1216799 tasks | elapsed: 100.5min
[Parallel(n_jobs=1)]: Done 1232449 tasks | elapsed: 101.8min
[Parallel(n_jobs=1)]: Done 1248199 tasks | elapsed: 103.1min
[Parallel(n_jobs=1)]: Done 1264049 tasks | elapsed: 104.4min
[Parallel(n_jobs=1)]: Done 1279999 tasks | elapsed: 105.8min
[Parallel(n_jobs=1)]: Done 1296049 tasks | elapsed: 107.1min
[Parallel(n_jobs=1)]: Done 1312199 tasks | elapsed: 108.4min
[Parallel(n_jobs=1)]: Done 1328449 tasks | elapsed: 109.8min
[Parallel(n_jobs=1)]: Done 1344799 tasks | elapsed: 111.1min
[Parallel(n_jobs=1)]: Done 1361249 tasks | elapsed: 112.4min
[Parallel(n_jobs=1)]: Done 1377799 tasks | elapsed: 113.8min
[Parallel(n_jobs=1)]: Done 1394449 tasks | elapsed: 115.1min
[Parallel(n_jobs=1)]: Done 1411199 tasks | elapsed: 116.5min
[Parallel(n_jobs=1)]: Done 1428049 tasks | elapsed: 117.9min
[Parallel(n_jobs=1)]: Done 1444999 tasks | elapsed: 119.3min
[Parallel(n_jobs=1)]: Done 1462049 tasks | elapsed: 120.7min
[Parallel(n_jobs=1)]: Done 1479199 tasks | elapsed: 122.2min
[Parallel(n_jobs=1)]: Done 1496449 tasks | elapsed: 123.6min
[Parallel(n_jobs=1)]: Done 1513799 tasks | elapsed: 125.0min
[Parallel(n_jobs=1)]: Done 1531249 tasks | elapsed: 126.5min
[Parallel(n_jobs=1)]: Done 1548799 tasks | elapsed: 128.0min
[Parallel(n_jobs=1)]: Done 1566449 tasks | elapsed: 129.4min
[Parallel(n_jobs=1)]: Done 1584199 tasks | elapsed: 130.9min
[Parallel(n_jobs=1)]: Done 1602049 tasks | elapsed: 132.3min
[Parallel(n_jobs=1)]: Done 1619999 tasks | elapsed: 133.8min
[Parallel(n_jobs=1)]: Done 1638049 tasks | elapsed: 135.3min
[Parallel(n_jobs=1)]: Done 1656199 tasks | elapsed: 136.8min
[Parallel(n_jobs=1)]: Done 1674449 tasks | elapsed: 138.3min
[Parallel(n_jobs=1)]: Done 1692799 tasks | elapsed: 139.8min
[Parallel(n_jobs=1)]: Done 1711249 tasks | elapsed: 141.3min
[Parallel(n_jobs=1)]: Done 1729799 tasks | elapsed: 142.9min
[Parallel(n_jobs=1)]: Done 1748449 tasks | elapsed: 144.4min
[Parallel(n_jobs=1)]: Done 1767199 tasks | elapsed: 146.0min
[Parallel(n_jobs=1)]: Done 1786049 tasks | elapsed: 147.5min
[Parallel(n_jobs=1)]: Done 1804999 tasks | elapsed: 149.1min
[Parallel(n_jobs=1)]: Done 1824049 tasks | elapsed: 150.7min
[Parallel(n_jobs=1)]: Done 1843199 tasks | elapsed: 152.2min
[Parallel(n_jobs=1)]: Done 1862449 tasks | elapsed: 153.8min
[Parallel(n_jobs=1)]: Done 1881799 tasks | elapsed: 155.4min
[Parallel(n_jobs=1)]: Done 1901249 tasks | elapsed: 156.9min
[Parallel(n_jobs=1)]: Done 1920799 tasks | elapsed: 158.6min
[Parallel(n_jobs=1)]: Done 1940449 tasks | elapsed: 160.2min
[Parallel(n_jobs=1)]: Done 1960199 tasks | elapsed: 161.8min
[Parallel(n_jobs=1)]: Done 1980049 tasks | elapsed: 163.5min
[Parallel(n_jobs=1)]: Done 1999999 tasks | elapsed: 165.2min
[Parallel(n_jobs=1)]: Done 2020049 tasks | elapsed: 166.8min
[Parallel(n_jobs=1)]: Done 2040199 tasks | elapsed: 168.5min
[Parallel(n_jobs=1)]: Done 2060449 tasks | elapsed: 170.2min
[Parallel(n_jobs=1)]: Done 2080799 tasks | elapsed: 171.8min
[Parallel(n_jobs=1)]: Done 2101249 tasks | elapsed: 173.5min
[Parallel(n_jobs=1)]: Done 2121799 tasks | elapsed: 175.2min
[Parallel(n_jobs=1)]: Done 2142449 tasks | elapsed: 176.9min
[Parallel(n_jobs=1)]: Done 2160000 out of 2160000 | elapsed: 195.4min finished
Fitting 1000 folds for each of 2160 candidates, totalling 2160000 fits
0.509383333333
{'dtree__min_samples_leaf': 1, 'dtree__min_samples_split': 3, 'kbest__k': 12, 'dtree__splitter': 'best', 'dtree__max_features': None, 'dtree__max_depth': 3, 'dtree__min_weight_fraction_leaf': 0, 'dtree__class_weight': {0: 0.3, 1: 0.8}, 'dtree__criterion': 'entropy'}
CPU times: user 3h 15min 23s, sys: 7.57 s, total: 3h 15min 31s
Wall time: 3h 15min 23s
bi = grid_search.best_estimator_.named_steps['kbest'].get_support()
df = pd.DataFrame(data = list(zip(features_list[1:], bi)), columns=['Feature', 'Selected?'])
selected_features = df[df['Selected?']]['Feature'].tolist()
fi = grid_search.best_estimator_.named_steps['dtree'].feature_importances_
df = pd.DataFrame(data = list(zip(selected_features, fi)), columns=['Feature', 'Importance'])
df
Feature | Importance | |
---|---|---|
0 | salary | 0.000000 |
1 | bonus | 0.000000 |
2 | long_term_incentive | 0.000000 |
3 | expenses | 0.460502 |
4 | deferred_income | 0.000000 |
5 | loan_advances | 0.000000 |
6 | other | 0.132632 |
7 | restricted_stock | 0.000000 |
8 | exercised_stock_options | 0.000000 |
9 | shared_receipt_with_poi | 0.124567 |
10 | fraction_from_poi | 0.000000 |
11 | fraction_to_poi | 0.282299 |
clf_fin = pipeline.set_params(**grid_search.best_params_)
test_classifier(clf_fin, my_dataset, features_list)
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('kbest', SelectKBest(k=12, score_func=<function f_classif at 0x7f1d1ebf17d0>)), ('dtree', DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.8}, criterion='entropy',
max_depth=3, max_features=None, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=3,
min_weight_fraction_leaf=0, presort=False, random_state=13,
splitter='best'))])
Accuracy: 0.86240 Precision: 0.48533 Recall: 0.52950 F1: 0.50646 F2: 0.52004
Total predictions: 15000 True positives: 1059 False positives: 1123 False negatives: 941 True negatives: 11877
%%time
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('kbest', SelectKBest()),
('dtree', DecisionTreeClassifier(random_state=random))])
param_grid = {
'kbest__k':[12],
'dtree__max_features': [None],
'dtree__criterion': ['entropy'],
'dtree__max_depth': [3],
'dtree__min_samples_split': [1],
'dtree__min_samples_leaf': [1],
'dtree__min_weight_fraction_leaf': [0],
'dtree__class_weight': [{1: 0.8, 0: 0.3}, {1: 0.8, 0: 0.35}, {1: 0.8, 0: 0.25}, {1: 0.9, 0: 0.2}, {1: 0.85, 0: 0.15}],
'dtree__splitter': ['best']
}
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, scoring='f1', verbose=1) # f1 for binary targets
grid_search.fit(features, labels)
print grid_search.best_score_
print grid_search.best_params_
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.3s
[Parallel(n_jobs=1)]: Done 199 tasks | elapsed: 1.1s
[Parallel(n_jobs=1)]: Done 449 tasks | elapsed: 2.4s
[Parallel(n_jobs=1)]: Done 799 tasks | elapsed: 4.2s
[Parallel(n_jobs=1)]: Done 1249 tasks | elapsed: 6.5s
[Parallel(n_jobs=1)]: Done 1799 tasks | elapsed: 9.4s
[Parallel(n_jobs=1)]: Done 2449 tasks | elapsed: 12.8s
[Parallel(n_jobs=1)]: Done 3199 tasks | elapsed: 16.7s
[Parallel(n_jobs=1)]: Done 4049 tasks | elapsed: 21.0s
[Parallel(n_jobs=1)]: Done 4999 tasks | elapsed: 25.9s
Fitting 1000 folds for each of 5 candidates, totalling 5000 fits
0.509187445887
{'dtree__min_samples_leaf': 1, 'dtree__min_samples_split': 1, 'kbest__k': 12, 'dtree__splitter': 'best', 'dtree__max_features': None, 'dtree__max_depth': 3, 'dtree__min_weight_fraction_leaf': 0, 'dtree__class_weight': {0: 0.25, 1: 0.8}, 'dtree__criterion': 'entropy'}
CPU times: user 25.9 s, sys: 52 ms, total: 26 s
Wall time: 26 s
[Parallel(n_jobs=1)]: Done 5000 out of 5000 | elapsed: 26.0s finished
clf_fin = pipeline.set_params(**grid_search.best_params_)
test_classifier(clf_fin, my_dataset, features_list)
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('kbest', SelectKBest(k=12, score_func=<function f_classif at 0x7f1d1ebf17d0>)), ('dtree', DecisionTreeClassifier(class_weight={0: 0.25, 1: 0.8}, criterion='entropy',
max_depth=3, max_features=None, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=1,
min_weight_fraction_leaf=0, presort=False, random_state=13,
splitter='best'))])
Accuracy: 0.86133 Precision: 0.48203 Recall: 0.53650 F1: 0.50781 F2: 0.52464
Total predictions: 15000 True positives: 1073 False positives: 1153 False negatives: 927 True negatives: 11847
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.
clf = clf_fin
dump_classifier_and_data(clf, my_dataset, features_list)
5. Evaluation
validation strategy
Validation is a way of assessing whether my algorithm is actually doing what I want it to do. First and foremost spliting dataset into trainning set and testing set is important. Because it gives estimate of performance on an independent dataset, and serves as check on overfitting.
Classic mistake is in splitting dataset. One wants to have as many data points in the training sets to get the best learning results, and also wants the maximum number of data items in the test set to get the best validation. But there is an inherent trade-off between size of train and test data. In addition, original dataset might have patterns in the way that the classes are represented are not like big lumps of certian labels. In this case, we need to split and shuffle the dataset.
I used Stratified ShuffleSplit, which keeps the same target distribution in the training and testing datasets. Particularly, it is important in the case of imbalanced datasets like few POI in targets. It reduces the variability in models’ predictive performance when using a trian/test split. When using CV for model selection, all folds sould have the same target distribution by using Stratified Kfolds.
Evaluation metrics
As a result, I got 86% accuracy to classify correctly POI and non-POI based on their insider payments and email statistics.
In this project, I have a small number of POIs among a large number of employees. Because of this asymmetry, it is important to check both precision and recall.
Precision is a measure of exactness, e.g. How many selected employees are indeed POIs? - Precision was 48%, which means out of 100 employees the model predicted as POI, I can be sure that 48 of them are real POIs.
Recall is a measure of completeness, e.g. How many POIs are found? - Recall was 53%, which means given total 100 POIs in a dataset, the model can find 53 of them.
References
- http://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe
- http://stackoverflow.com/questions/27934885/how-to-hide-code-from-cells-in-ipython-notebook-visualized-with-nbviewer
- https://discussions.udacity.com/t/final-project-text-learning/28169/13
- http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
- http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html
- https://jaycode.github.io/enron/identifying-fraud-from-enron-email.html
- http://stats.stackexchange.com/questions/62621/recall-and-precision-in-classification
Appendix
-
text learning with Enron email
I accessed all of the emails in the Enron Corpus, I looped through all of the files in the folder ‘maildir’.
I saved word_data, name_data, email_data, poi_data into enron_word_data.pkl. word_data is words in emails of each person, name_data is owner of the emails, email_data is about from/to, poi_data is flag of poi of each person.
There were similar cases in the [forum][qna_text_learning]. However, one coach said that adding features to the dataset baed on the original text corpus goes beyond the expectations of the project. Anyway, I tried predict POI with email text. Accuracy is 0.777778, but precision, recall, f1 are not good. I think more advanced text learning techniques are required to improve the performance.
from os import listdir
from os.path import isfile
sys.path.append("../tools/")
from parse_out_email_text import parseOutText
word_data = []
name_data = []
email_data = []
poi_data = []
email_dataset_file = 'enron_word_data.pkl'
if (isfile(email_dataset_file)):
email_dataset = pickle.load(open(email_dataset_file, "r"))
word_data, name_data, email_data, poi_data = zip(*email_dataset)
else:
def parse_email(email_list, word_data):
for path in email_list:
path = '..' + path[19:-1]
email = open(path, "r")
text = parseOutText(email)
word_data.append(text)
email.close()
return word_data
email_columns = ['name', 'email_address', 'poi']
df_email = df[email_columns]
for idx, row in df_email.iterrows():
name = row['name']
email = row[ 'email_address']
poi = row['poi']
print name, email, poi
if email != 'NaN':
email_files = [ 'emails_by_address/from_'+email+'.txt', 'emails_by_address/to_'+email+'.txt' ]
for index, f in enumerate(email_files):
if isfile(f):
email_list = open(f, 'r')
word_data = parse_email(email_list, word_data)
name_data.append(name)
poi_data.append(poi)
if index == 0:
email_data.append('from')
else:
email_data.append('to')
email_list.close()
email_dataset = zip(word_data, name_data, email_data, poi_data)
pickle.dump( email_dataset, open('enron_word_data.pkl', 'w'))
# if False:
# email_messages = []
# for idx, message in enumerate(word_data):
# key = name_data[idx]
# if email_data[idx] == 'from':
# if 'from_messages_texts' not in data_dict[key].keys():
# data_dict[key]['from_messages_texts'] = []
# data_dict[key]['from_messages_texts'].append(message)
# elif email_data[idx] == 'to':
# if 'to_messages_texts' not in data_dict[key].keys():
# data_dict[key]['to_messages_texts'] = []
# data_dict[key]['to_messages_texts'].append(message)
# # For people without email messages:
# for key, value in data_dict.items():
# if 'from_messages_texts' not in value.keys():
# data_dict[key]['from_messages_texts'] = []
# if 'to_messages_texts' not in value.keys():
# data_dict[key]['to_messages_texts'] = []
# features_list.append('to_messages_texts')
# features_list.append('from_messages_texts')
# data_dict['SKILLING JEFFREY K']
from sklearn import cross_validation
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.feature_extraction.text import TfidfVectorizer
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, poi_data, test_size=0.1, random_state=42)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)
selector = SelectPercentile(f_classif, percentile=20)
selector.fit(features_train, labels_train)
features_train = selector.transform(features_train).toarray()
features_test = selector.transform(features_test).toarray()
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=40)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = clf.score(features_test, labels_test)
i = 0
vocab_list = vectorizer.get_feature_names()
for c in clf.feature_importances_:
#if c >= .2:
if c > 0.0 :
print ('feature importance: ', c, 'index: ', i, 'voca: ', vocab_list[i])
i += 1
result = pd.DataFrame([[accuracy, precision_score(labels_test, pred), recall_score(labels_test,pred), f1_score(labels_test,pred)]], \
columns=['accuracy', 'precision', 'recall', 'f1'])
result