之前对ML和DL学习已经有一段时间了,后面准备用简书对Kaggle上的学习和实践进行一些记录,方便技术的积累。同时也用它记录些AI相关的文章
Titanic项目地址:
https://www.kaggle.com/c/titanic/kernels?sortBy=voteCount&group=everyone&pageSize=20&language=Python&competitionId=3136
其中学习了其中的两个项目:
1.https://www.kaggle.com/startupsci/titanic-data-science-solutions
第一个项目讲得比较细,讲了解决此问题的思路,然后对其思路进行一步一步验证,建设特征工程,特征工程中的每一个特征的建设都会进行可视化的观察,最后用了多种机器学习方法进行模型训练,预测,并进行了对比
所以适合对机器学习实践不太了解的同学学习
下面是我对此文章学习的经验总结:
# 总结:
# 1.将数据分为数值型和字符型(分类型),然后分别看他们的分布和survived的关系,describe()看下总体分布
# 2.还有他们联合起来看和survived的关系
# 查看分布后,会有些假设,比如本利假设survived和性别,仓位等级,年龄等比较相关
# 3.对fea的drop,比如说空缺比较多的feature,和survived关系不大的fea,比如姓名,passid等
# 4.对空缺fea进行填充,数值型建议使用中位数进行填充,分类型使用种类最多的进行填充
# 5.新增fea,这个就需要看经验了,比如说本例中增加了familySize,IsAlone等
# 6.将分类型fea映射为数值型
# 技巧
# 对数值型进行统计,#应该是对每列进行排序后,计算最小,1/4位,2/4等的数值
# train_df.describe()
# # 对str型的数据进行统计:count,unique,top,freq
# train_df.describe(include=['O'])
# sns.FacetGrid
# 使用柱状图对数值型进行分析,使用散点图(sns.pointplot)对分类型进行分析 [14]
2.https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python
第二个项目基本以代码说话,特征工程和第一个项目类似,所以再对第一个工程熟悉后,看第二个工程会比较容易,同时也能够对学习方法进行学习总结
总体来说对特征工程建设这方面,要对可视化分析比较熟悉
网上代码有些错误,修正后特征工程代码如下:
# Load in our libraries
import pandas as pd
import numpy as np
import random as rnd
import re
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.cross_validation import KFold
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')
combine = [train_df, test_df]
train_df['Name_length'] = train_df['Name'].apply(len)
test_df['Name_length'] = test_df['Name'].apply(len)
# Feature that tells whether a passenger had a cabin on the Titanic
train_df['Has_Cabin'] = train_df["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test_df['Has_Cabin'] = test_df["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
# train_df.info()
# test_df.info()
# drop掉无用列
train_df = train_df.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1)
test_df = test_df.drop(['Cabin', 'PassengerId'], axis=1)
# print train_df.describe()
# print train_df.describe()
# print train_df.info()
# print train_df
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
full_df = []
for dataset in combine :
# 处理object类型
dataset = dataset.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1)
#Title
dataset['Title'] = dataset['Name'].apply(get_title)
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
# Mapping titles
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
dataset = dataset.drop('Name',axis=1)
# print dataset['Title'].groupby('Title')
dataset['Sex'] = dataset['Sex'].map({'male':0, 'female':1}).astype(int)
dataset['Embarked'] = dataset['Embarked'].fillna('S').map({'S':0, 'C':1, 'Q':2}).astype(int)
#处理数值类型
dataset['Pclass'] = dataset['Pclass'].fillna(dataset['Pclass'].mean())
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize']==1,'IsAlone'] = 1
dataset = dataset.drop(['SibSp'], axis=1)
#对Fare列进行处理,通过describe 看 25%,50%, 75%的数据:7.9104,14.4542,31
#sns.distplot(train_df['Fare'], bins=10, rug=True);
dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].dropna().mean())
dataset.loc[dataset['Fare']<7.9104, 'Fare'] = 0
dataset.loc[(dataset['Fare']>=7.9104) & (dataset['Fare']<14.4542), 'Fare'] = 1
dataset.loc[(dataset['Fare']>=14.4542) & (dataset['Fare']<31), 'Fare'] = 2
dataset.loc[dataset['Fare']>=31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
#age处理
dataset['Age'] = dataset['Age'].fillna(dataset['Age'].dropna().median())
# sns.distplot(dataset['Age'], bins=10, rug=True);
# Mapping Age
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;
dataset['Age'] = dataset['Age'].astype(int)
full_df.append(dataset)
# print dataset.head(3)
# print 'end ******'
train,test = full_df
# Some useful parameters which will come in handy later on
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 5 # set folds for out-of-fold prediction
kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)
rf_params = {
'n_jobs': -1,
'n_estimators': 500,
'warm_start': True,
#'max_features': 0.2,
'max_depth': 6,
'min_samples_leaf': 2,
'max_features' : 'sqrt',
'verbose': 0
}
# Extra Trees Parameters
et_params = {
'n_jobs': -1,
'n_estimators':500,
#'max_features': 0.5,
'max_depth': 8,
'min_samples_leaf': 2,
'verbose': 0
}
# AdaBoost parameters
ada_params = {
'n_estimators': 500,
'learning_rate' : 0.75
}
# Gradient Boosting parameters
gb_params = {
'n_estimators': 500,
#'max_features': 0.2,
'max_depth': 5,
'min_samples_leaf': 2,
'verbose': 0
}
# Support Vector Classifier parameters
svc_params = {
'kernel' : 'linear',
'C' : 0.025
}
# Class to extend the Sklearn classifier
class SklearnHelper(object):
def __init__(self, clf, seed=0, params=None):
params['random_state'] = seed
self.clf = clf(**params)
def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)
def predict(self, x):
return self.clf.predict(x)
def fit(self,x,y):
return self.clf.fit(x,y)
def feature_importances(self,x,y):
print(self.clf.fit(x,y).feature_importances_)
# Class to extend XGboost classifer
# Put in our parameters for said classifiers
# Random Forest parameters
def get_oof(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS, ntest))
for i, (train_index, test_index) in enumerate(kf):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]
clf.train(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
print '*'*50
print rf_params
a={'a':'test'}
# Create 5 objects that represent our 4 models
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data
# Create our OOF train and test predictions. These base results will be used as new features
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
print("Training is complete")
rf_feature = rf.feature_importances(x_train,y_train)
et_feature = et.feature_importances(x_train, y_train)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train,y_train)