我们使用均值漂移,继续聚类和非监督学习的话题,这次将其用于我们的泰坦尼克数据集。
这里有一些随机度,所以你的结果可能并不相同,然而你可以重新运行程序来获取相似结果,如果你没有得到相似结果的话。
我们打算通过均值漂移聚类来看一看泰坦尼克数据集。我们感兴趣的是,是否均值漂移能够自动将乘客分离为分组。如果能,检查它创建的分组就很有趣了。第一个明显的兴趣点就是,所发现分组的幸存率,但是,我们也会深入这些分组的属性,来观察我们是否能够理解,均值漂移为什么决定了特定的分组。
首先,我们使用已经看过的代码:
import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, cross_validation
import pandas as pd
import matplotlib.pyplot as plt
'''
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
'''
# https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls
df = pd.read_excel('titanic.xls')
original_df = pd.DataFrame.copy(df)
df.drop(['body','name'], 1, inplace=True)
df.fillna(0,inplace=True)
def handle_non_numerical_data(df):
# handling non-numerical data: must convert.
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
#print(column,df[column].dtype)
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
#finding just the uniques
unique_elements = set(column_contents)
# great, found them.
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
# creating dict that contains new
# id per unique string
text_digit_vals[unique] = x
x+=1
# now we map the new "id" vlaue
# to replace the string.
df[column] = list(map(convert_to_int,df[column]))
return df
df = handle_non_numerical_data(df)
df.drop(['ticket','home.dest'], 1, inplace=True)
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])
clf = MeanShift()
clf.fit(X)
除了两个例外,一个是original_df = pd.DataFrame.copy(df)
,在我们将csv
文件读取到df
对象之后。另一个是从sklearn.cluster
导入MeanShift
,并且用其作为我们的聚类器。我们生成了副本,以便之后引用原始非数值形式的数据。
既然我们创建了拟合,我们可以从clf
对象获取一些属性。
labels = clf.labels_
cluster_centers = clf.cluster_centers_
下面,我们打算向我们的原始数据帧添加新的一项。
original_df['cluster_group']=np.nan
现在,我们可以迭代标签,并向空列添加新的标签。
for i in range(len(X)):
original_df['cluster_group'].iloc[i] = labels[i]
现在我们可以检查每个分组的幸存率:
n_clusters_ = len(np.unique(labels))
survival_rates = {}
for i in range(n_clusters_):
temp_df = original_df[ (original_df['cluster_group']==float(i)) ]
#print(temp_df.head())
survival_cluster = temp_df[ (temp_df['survived'] == 1) ]
survival_rate = len(survival_cluster) / len(temp_df)
#print(i,survival_rate)
survival_rates[i] = survival_rate
print(survival_rates)
如果我们执行它,我们会得到:
{0: 0.3796583850931677, 1: 0.9090909090909091, 2: 0.1}
同样,你可能获得更多分组。我这里获得了三个,但是我在这个数据集上获得过六个分组。现在,我们看到分组 0 的幸存率是 38%,分组 1 是 91%,分组 2 是 10%。这就有些奇怪了,因为我们知道船上有三个真实的“乘客分类”。我想知道是不是 0 就是二等舱,1 就是头等舱, 2 是三等舱。船上的舱是,3 等舱在最底下,头等舱在最上面,底部首先淹没,然后顶部是救生船的地方。我可以深入看一看:
print(original_df[ (original_df['cluster_group']==1) ])
我们获取cluster_group
为 1 的original_df
。
打印出来:
pclass survived name \
17 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput)
49 1 1 Cardeza, Mr. Thomas Drake Martinez
50 1 1 Cardeza, Mrs. James Warburton Martinez (Charlo...
66 1 1 Chaudanson, Miss. Victorine
97 1 1 Douglas, Mrs. Frederick Charles (Mary Helene B...
116 1 1 Fortune, Mrs. Mark (Mary McDougald)
183 1 1 Lesurer, Mr. Gustave J
251 1 1 Ryerson, Miss. Susan Parker "Suzette"
252 1 0 Ryerson, Mr. Arthur Larned
253 1 1 Ryerson, Mrs. Arthur Larned (Emily Maria Borie)
302 1 1 Ward, Miss. Anna
sex age sibsp parch ticket fare cabin embarked \
17 female 50.0 0 1 PC 17558 247.5208 B58 B60 C
49 male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C
50 female 58.0 0 1 PC 17755 512.3292 B51 B53 B55 C
66 female 36.0 0 0 PC 17608 262.3750 B61 C
97 female 27.0 1 1 PC 17558 247.5208 B58 B60 C
116 female 60.0 1 4 19950 263.0000 C23 C25 C27 S
183 male 35.0 0 0 PC 17755 512.3292 B101 C
251 female 21.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
252 male 61.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C
253 female 48.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C
302 female 35.0 0 0 PC 17755 512.3292 NaN C
boat body home.dest cluster_group
17 6 NaN Montreal, PQ 1.0
49 3 NaN Austria-Hungary / Germantown, Philadelphia, PA 1.0
50 3 NaN Germantown, Philadelphia, PA 1.0
66 4 NaN NaN 1.0
97 6 NaN Montreal, PQ 1.0
116 10 NaN Winnipeg, MB 1.0
183 3 NaN NaN 1.0
251 4 NaN Haverford, PA / Cooperstown, NY 1.0
252 NaN NaN Haverford, PA / Cooperstown, NY 1.0
253 4 NaN Haverford, PA / Cooperstown, NY 1.0
302 3 NaN NaN 1.0
很确定了,整个分组就是头等舱。也就是说,这里实际上只有 11 个人。让我们看看分组 0,它看起来有些不同。这一次,我们使用 Pandas 的.describe()
方法。
print(original_df[ (original_df['cluster_group']==0) ].describe())
pclass survived age sibsp parch \
count 1288.000000 1288.000000 1027.000000 1288.000000 1288.000000
mean 2.300466 0.379658 29.668614 0.496118 0.332298
std 0.833785 0.485490 14.395610 1.047430 0.686068
min 1.000000 0.000000 0.166700 0.000000 0.000000
25% 2.000000 0.000000 21.000000 0.000000 0.000000
50% 3.000000 0.000000 28.000000 0.000000 0.000000
75% 3.000000 1.000000 38.000000 1.000000 0.000000
max 3.000000 1.000000 80.000000 8.000000 4.000000
fare body cluster_group
count 1287.000000 119.000000 1288.0
mean 30.510172 159.571429 0.0
std 41.511032 97.302914 0.0
min 0.000000 1.000000 0.0
25% 7.895800 71.000000 0.0
50% 14.108300 155.000000 0.0
75% 30.070800 255.500000 0.0
max 263.000000 328.000000 0.0
这里有 1287 个人,我们可以看到平均等级是二等舱,但是这里从头等到三等都有。
让我们检查最后一个分组,2,它的预期是全都是三等舱:
print(original_df[ (original_df['cluster_group']==2) ].describe())
pclass survived age sibsp parch fare \
count 10.0 10.000000 8.000000 10.000000 10.000000 10.000000
mean 3.0 0.100000 39.875000 0.800000 6.000000 42.703750
std 0.0 0.316228 1.552648 0.421637 1.632993 15.590194
min 3.0 0.000000 38.000000 0.000000 5.000000 29.125000
25% 3.0 0.000000 39.000000 1.000000 5.000000 31.303125
50% 3.0 0.000000 39.500000 1.000000 5.000000 35.537500
75% 3.0 0.000000 40.250000 1.000000 6.000000 46.900000
max 3.0 1.000000 43.000000 1.000000 9.000000 69.550000
body cluster_group
count 2.000000 10.0
mean 234.500000 2.0
std 130.814755 0.0
min 142.000000 2.0
25% 188.250000 2.0
50% 234.500000 2.0
75% 280.750000 2.0
max 327.000000 2.0
很确定了,我们是对的,这个分组全是三等舱,所以有最坏的幸存率。
足够有趣,在查看所有分组的时候,分组 2 的票价范围的确是最低的,从 29 到 69 磅。
在我们查看簇 0 的时候,票价最高为 263 磅。这是最大的组,幸存率为 38%。
当我们回顾簇 1 时,它全是头等舱,我们看到这里的票价范围是 247 ~ 512 磅,均值为 350。尽管簇 0 有一些头等舱的乘客,这个分组是最精英的分组。
出于好奇,分组 0 的头等舱的生存率,与整体生存率相比如何呢?
>>> cluster_0 = (original_df[ (original_df['cluster_group']==0) ])
>>> cluster_0_fc = (cluster_0[ (cluster_0['pclass']==1) ])
>>> print(cluster_0_fc.describe())
pclass survived age sibsp parch fare \
count 312.0 312.000000 273.000000 312.000000 312.000000 312.000000
mean 1.0 0.608974 39.027167 0.432692 0.326923 78.232519
std 0.0 0.488764 14.589592 0.606997 0.653100 60.300654
min 1.0 0.000000 0.916700 0.000000 0.000000 0.000000
25% 1.0 0.000000 28.000000 0.000000 0.000000 30.500000
50% 1.0 1.000000 39.000000 0.000000 0.000000 58.689600
75% 1.0 1.000000 49.000000 1.000000 0.000000 91.079200
max 1.0 1.000000 80.000000 3.000000 4.000000 263.000000
body cluster_group
count 35.000000 312.0
mean 162.828571 0.0
std 82.652172 0.0
min 16.000000 0.0
25% 109.500000 0.0
50% 166.000000 0.0
75% 233.000000 0.0
max 307.000000 0.0
>>>
很确定了,它们的幸存率更高,约为 61%,但是仍然低于精英分组(根据票价和幸存率)的 91%。花费一些时间来深入挖掘,看看你是否能发现一些东西。