来源:barefootgirl - kesci.com
原文链接:携程用户预定房型预测
点击以上链接👆 不用配置环境,直接在线运行
数据集下载链接:携程房型产品用户行为数据集
问题介绍
数据集为携程用户预定携程房型的数据集,包含以下几个部分:用户数据、酒店数据、房型数据。大家可以根据在用户的历史信息,挖掘出用户对于某些房型偏好,预测哪一个售卖房型(roomid)是用户最终预订的。
业务场景说明:
本先对测试数据集的基本字段做一个简单的分析,再对数据集所描述的问题进行预测。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
file = '../input/ctrip/competition_train.txt'
一.先对测试数据集的基本字段做一个简单的分析。
首先将训练集中的13个基本字段orderid、uid、orderdate、hotelid、basicroomid、roomid、orderlabel、star、rank 、returnvalue、price_deduct、basic_minarea、basic_maxarea信息提取出来进行分析。
basic_colnames = ['orderid','uid','orderdate','hotelid','basicroomid','roomid', 'orderlabel', 'star',
'rank', 'returnvalue', 'price_deduct', 'basic_minarea', 'basic_maxarea']
basic_data = pd.read_csv(file, sep='\t', header=0, usecols=basic_colnames)
#basic_data.info()
#basic_data['orderid'].value_counts()
#basic_data['orderlabel'].value_counts()
#basic_data['uid'].value_counts()
#basic_data['hotelid'].value_counts()
#basic_data['basicroomid'].value_counts()
#basic_data['roomid'].value_counts()
#basic_data['star'].value_counts()
#basic_data['orderdate'].value_counts()
训练集一共有7724875条记录;
一共有211580个订单号(oederid),其中ORDER_12473322的纪录最多有686条;
成功订购211580份订单(orderlable=1);
一共有187162个用户(uid),其中USER_615018的纪录最多有2105条;
一共有51209家酒店(hotelid),其中HOTEL_135032的记录最多有34376条;
一共有277373种物理房型号(basicroomid),其中BASIC_205234的纪录最多有6589条;
一共有2924400种房型号(roomid),其中ROOM_18387063的纪录最多有204条。
酒店的级别只有4种,5、7、9、11,并且纪录数依次上升;
数据集的订单订购时间是从2013年4月14号至4月20号一周的时间。
basic_data.describe()
发现异常数据,basic_minarea和basic_maxarea这两列数据缺失,且其最小值小于0,这是不合理的。
#basic_data.isnull().any()
#basic_data[train_data['basic_minarea']<=0]
#basic_data[train_data['basic_maxarea']<=0]
basic_minarea小于0的时候,basic_maxarea也小于0,且同时等于-1,说明应该是数据缺失的情况下用-1进行填补。将所有有无效值的纪录丢弃,并且将成功预定的数据提取出来(orderlabel=1)
basic_data1 = basic_data.loc[(basic_data['basic_minarea']>0) & (basic_data['basic_maxarea']>0)]
basic_data1 = basic_data1[basic_data1['orderlabel']==1]
#basic_data1.info()
#basic_data1.describe()
#basic_data1['uid'].value_counts()
#basic_data1['hotelid'].value_counts()
#basic_data1['basicroomid'].value_counts()
#basic_data1['roomid'].value_counts()
#basic_data1['star'].value_counts()
经过数据清洗可以看出成功订购的有效纪录为179667条;
一共有160471个用户(uid),其中USER_609501成功订购次数最多为48次;
一共有41837家酒店(hotelid),其中HOTEL_132727被成功订购次数最多为225次;
一共有76861种物理房型号(basicroomid),其中BASIC_463407房型被成功订购次数最多为209次;
一共有134512种房型号(roomid),其中ROOM_22089993房间型号被成功订购次数最多为76次。
查看订购房型的价格分布
plt.figure(figsize=(15, 5))
plt.subplot(1,2,1)
plt.hist(basic_data1['price_deduct'])
plt.xlabel('价格')
plt.ylabel('订单数')
plt.title('房型订单价格分布')
#观察到价格主要在5000以内,看5000以内的分布
price_data = basic_data1[basic_data1['price_deduct']<=5000]['price_deduct']
plt.subplot(1,2,2)
plt.hist(price_data,bins=100)
plt.xlabel('价格')
plt.ylabel('订单数')
plt.title('房型订单价格分布')
plt.show()
房型价格主要在5000以内,其中1000附近的房型最多。
#预定酒店的星级分布
plt.figure(figsize=(15, 5))
plt.subplot(1,2,1)
basic_data1['star'].value_counts().plot(kind='pie',autopct='%.1f%%')
plt.title('酒店星级分布')
#分析酒店星级与价格之间的关系
star = [5,7,9,11]
price = []
for i in star:
tmp = basic_data1.loc[basic_data1['star']==i]['price_deduct'].mean()
price.append(tmp)
plt.subplot(1,2,2)
plt.bar(star, price)
for (a, b) in zip(star, price):
plt.text(a, b+0.001, '%.1f'% b, ha='center', va='bottom', fontsize=10)
plt.xlabel('酒店星级')
plt.xticks(star)
plt.ylabel('平均订购价格')
plt.title('酒店星级 VS 订购价格')
plt.show()
酒店星级越高,订购价格越贵。
二.房型预测模型
以上都是一些基本字段的描述性分析,并未考虑到用户的历史行为数据。
在训练过程中,需要考虑的特征量是很多的。
1.数据选取(训练集与测试集)
因为比赛给出的测试数据集中没有orderlabel这一列,即无法知道每个orderid对应的orderlabel=1的记录,无法判断算法的准确率,所以我们的训练集和测试集都从‘competition_train.txt’中选取。从前面的描述性分析中得知,测试数据中有接近800万条记录,没法处理这么大的数据量,所以我们仅选取50万条记录,并按照8:2的原则划分为训练集和测试集。
import pandas as pd
import numpy as np
import re
import lightgbm as lgb
#数据
data = pd.read_csv(file, sep = '\t',nrows = 500000)
#data.info()
2.数据清洗
由于数据量庞大,特征字段就达到157个。所以我们数据清洗的主要任务:
1.将数据集中某一个属性全部缺失的去掉;
2.将缺失值全部补充为-1;
3.纠正数据集中的错误属性。
4.为了方便算法训练,去除数据中的字母等文字,并全部转化为int或float类型;
#data.describe()
#data.isnull.all()
#将数据集中某一个属性全部缺失的去掉
data = data.dropna(how='all',axis=1)
#data.info()
#data.isnull().any()
#将缺失值全部补充为-1
data = data.fillna(-1)
#data.head()
去掉的属性一共有25个。
'orderbehavior_3_ratio_1month','orderbehavior_4_ratio_1month','orderbehavior_5_ratio_1month','orderid_lastord',
'orderdate_lastord','hotelid_lastord','roomid_lastord','basicroomid_lastord','rank_lastord','return_lastord',
'price_last_lastord','roomservice_2_lastord','roomservice_3_lastord','roomservice_4_lastord','roomservice_5_lastord',
'roomservice_6_lastord','roomservice_8_lastord','roomtag_2_lastord','roomtag_3_lastord','roomtag_4_lastord',
'roomtag_5_lastord','roomtag_6_lastord','star_lastord','hotel_minprice_lastord','basic_minprice_lastord'
#数据中有一处有错,因为roomservice_5的取值只有0,1,2,所以user_roomservice_5_345ratio应该是user_roomservice_8_345ratio
data["user_roomservice_8_345ratio"]=data["user_roomservice_5_345ratio"]
del data["user_roomservice_5_345ratio"]
#数据处理--去除字母等文字
def remove(x):
try:
return re.search("\d+", x).group()
except:
return 0
dt1=pd.to_datetime(data["orderdate"])
data["orderdate"]=dt1.dt.dayofyear
for c in ["orderid","uid","hotelid","basicroomid","roomid"]:
data[c] = data[c].apply(remove)
data = data.apply(pd.to_numeric, errors='ignore')
3.构造特征
经过数据清洗之后,原有的157个特征量剩下了132个。我们可以构建一些新的特征量来提高预测模型的准确率。
# 每个basicid价格的中位数
def df_median(df):
add = pd.DataFrame(df.groupby(["orderid", "basicroomid"]).price_deduct.median()).reset_index()
add.columns = ["orderid", "basicroomid", "basicroomid_price_deduct_median"]
df = df.merge(add, on=["orderid", "basicroomid"], how="left")
return df
# 每个basicid价格的最小值
def df_min(df):
add = pd.DataFrame(df.groupby(["orderid", "basicroomid"]).price_deduct.min()).reset_index()
add.columns = ["orderid", "basicroomid", "basicroomid_price_deduct_min"]
df = df.merge(add, on=["orderid", "basicroomid"], how="left")
return df
# 每个orderid价格的最小值
def df_min_orderid(df):
add = pd.DataFrame(df.groupby(["orderid"]).price_deduct.min()).reset_index()
add.columns = ["orderid", "orderid_price_deduct_min"]
df = df.merge(add, on=["orderid"], how="left")
return df
#排序特征
def df_rank_mean(df):
add = pd.DataFrame(df.groupby(["basicroomid"]).orderid_price_deduct_min_rank.mean()).reset_index()
add.columns = ["basicroomid","orderid_price_deduct_min_rank_mean"]
df = df.merge(add, on=["basicroomid"], how="left")
return df
def df_roomrank_mean(df):
add = pd.DataFrame(df.groupby(["roomid"]).basicroomid_price_rank.mean()).reset_index()
add.columns = ["roomid","basicroomid_price_rank_mean"]
df = df.merge(add, on=["roomid"], how="left")
return df
def merge_count(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].count()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
def merge_nunique(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].nunique()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
def merge_median(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].median()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
def merge_mean(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].mean()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
def merge_sum(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].sum()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
def merge_max(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].max()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
def merge_min(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].min()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
def merge_std(df,columns,value,cname):
add = pd.DataFrame(df.groupby(columns)[value].std()).reset_index()
add.columns=columns+[cname]
df=df.merge(add,on=columns,how="left")
return df
all = data.copy()
all["user_roomservice_2_0ratio"]=1-all["user_roomservice_2_1ratio"]
all["user_roomservice_3_0ratio"]=1-all["user_roomservice_3_123ratio"]
all["user_roomservice_5_0ratio"]=1-all["user_roomservice_5_1ratio"]
all["user_roomservice_7_1ratio"]=1-all["user_roomservice_7_0ratio"]
all["user_roomservice_8_2ratio"]=1-all["user_roomservice_8_345ratio"]-all["user_roomservice_8_1ratio"]
all["user_roomservice_4_1ratio_3month"] = 1 - all["user_roomservice_4_0ratio_3month"] - all["user_roomservice_4_2ratio_3month"] - all["user_roomservice_4_3ratio_3month"] - all["user_roomservice_4_4ratio_3month"] - all["user_roomservice_4_5ratio_3month"]
all["user_roomservice_4_1ratio_1month"] = 1 - all["user_roomservice_4_0ratio_1month"] - all["user_roomservice_4_2ratio_1month"] - all["user_roomservice_4_3ratio_1month"] - all["user_roomservice_4_4ratio_1month"] - all["user_roomservice_4_5ratio_1month"]
all["user_roomservice_4_1ratio_1week"] = 1 - all["user_roomservice_4_0ratio_1week"] - all["user_roomservice_4_2ratio_1week"] - all["user_roomservice_4_3ratio_1week"] - all["user_roomservice_4_4ratio_1week"] - all["user_roomservice_4_5ratio_1week"]
all["user_roomservice_2_max"] = np.argmax(all[["user_roomservice_2_%sratio" % i for i in range(2)]].values, axis=1)
all["user_roomservice_3_max"] = np.argmax(all[["user_roomservice_3_%sratio" % i for i in [0,123]]].values, axis=1)
all["user_roomservice_4_max"] = np.argmax(all[["user_roomservice_4_%sratio" % i for i in range(6)]].values, axis=1)
all["user_roomservice_5_max"] = np.argmax(all[["user_roomservice_5_%sratio" % i for i in range(2)]].values, axis=1)
all["user_roomservice_6_max"] = np.argmax(all[["user_roomservice_6_%sratio" % i for i in range(3)]].values, axis=1)
all["user_roomservice_7_max"] = np.argmax(all[["user_roomservice_7_%sratio" % i for i in range(2)]].values, axis=1)
all["user_roomservice_8_max"] = np.argmax(all[["user_roomservice_8_%sratio" % i for i in [1,2,345]]].values, axis=1)
all["user_roomservice_4_max_1week"]=np.argmax(all[["user_roomservice_4_%sratio_1month"%i for i in range(6)]].values,axis=1)
all["user_roomservice_4_max_1month"]=np.argmax(all[["user_roomservice_4_%sratio_1month"%i for i in range(6)]].values,axis=1)
all["user_roomservice_4_max_3month"]=np.argmax(all[["user_roomservice_4_%sratio_3month"%i for i in range(6)]].values,axis=1)
all["roomservice_8"]=all["roomservice_8"].apply(lambda x:2 if x>2 else x-1)
all["roomservice_3"]=all["roomservice_3"].apply(lambda x:1 if x>0 else 0)
for i in range(2,9):
all["service_equal_%s"%i] = list(map(lambda x, y: 1 if x == y else 0, all["roomservice_%s"%i], all["user_roomservice_%s_max"%i]))
del all["user_roomservice_2_0ratio"]
del all["user_roomservice_3_0ratio"]
del all["user_roomservice_5_0ratio"]
del all["user_roomservice_7_1ratio"]
#添加转化率特征
#提取basicroomid的转化率
feature_df=all[["orderid","basicroomid","orderlabel"]].copy()
feature_df.sort_values("orderlabel")
feature_df=feature_df.drop_duplicates(["orderid","basicroomid"],keep="last")
basicroom_mean=pd.DataFrame(feature_df.groupby("basicroomid").orderlabel.mean()).reset_index()
basicroom_mean.columns=["basicroomid","basicroomid_mean"]
basicroom_sum=pd.DataFrame(feature_df.groupby("basicroomid").orderlabel.sum()).reset_index()
basicroom_sum.columns=["basicroomid","basicroomid_sum"]
all = all.merge(basicroom_mean, on="basicroomid", how="left").fillna(0)
all = all.merge(basicroom_sum, on="basicroomid", how="left").fillna(0)
all=df_median(all)
all=df_min(all)
all=df_min_orderid(all)
all["basicroomid_price_rank"] = all['price_deduct'].groupby([all['orderid'], all['basicroomid']]).rank()
all["orderid_price_deduct_min_rank"] = all['orderid_price_deduct_min'].groupby(all['orderid']).rank()
all = df_rank_mean(all)
all = df_roomrank_mean(all)
all=merge_mean(all,["basicroomid"],"basic_week_ordernum_ratio","basic_week_ordernum_ratio_mean")
all=merge_mean(all,["basicroomid"],"basic_recent3_ordernum_ratio","basic_recent3_ordernum_ratio_mean")
all=merge_mean(all,["basicroomid"],"basic_comment_ratio","basic_comment_ratio_mean")
all=merge_mean(all,["basicroomid"],"basic_30days_ordnumratio","basic_30days_ordnumratio_mean")
all=merge_mean(all,["basicroomid"],"basic_30days_realratio","basic_30days_realratio_mean")
all=merge_mean(all,["roomid"],"room_30days_ordnumratio","room_30days_ordnumratio_mean")
all=merge_mean(all,["roomid"],"room_30days_realratio","room_30days_realratio_mean")
all["city_num"]=all["user_ordernum"]/all["user_citynum"]
all["area_price"]=all["user_avgprice"]/all["user_avgroomarea"]
all["price_max_min_rt"]=all["user_maxprice"]/all["user_minprice"]
all["basicroomid_price_deduct_min_minprice_rt"]=all["basicroomid_price_deduct_min"]/all["user_minprice"]
all["price_dif"]=all["basicroomid_price_deduct_min"]-all["price_deduct"]
all["price_dif_rt"]=all["basicroomid_price_deduct_min"]/all["price_deduct"]
all["price_dif_hotel"]=all["orderid_price_deduct_min"]-all["price_deduct"]
all["price_dif_hotel_rt"]=all["orderid_price_deduct_min"]/all["price_deduct"]
all["order_basic_minprice_dif"]=all["basicroomid_price_deduct_min"]-all["orderid_price_deduct_min"]
all["order_basic_minprice_rt"]=all["basicroomid_price_deduct_min"]/all["orderid_price_deduct_min"]
all["price_tail1"]=all["price_deduct"]%10
all["price_tail1"]=list(map(lambda x:1 if x==4 or x==7 else 0,all["price_tail1"]))
all["price_tail2"]=all["price_deduct"]%100
all["price_ori"] = list(map(lambda x, y:x+y, all["price_deduct"], all["returnvalue"]))
for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]:
all["ordertype_%s_num"%i] = list(map(lambda x, y: x*y, all["ordertype_%s_ratio"%i], all["user_ordernum"]))
del all["ordertype_%s_ratio" % i]
# 所有的
for c in ["orderbehavior_1_ratio", "orderbehavior_2_ratio", "orderbehavior_6_ratio", "orderbehavior_7_ratio"]:
all[c] = list(map(lambda x, y: x*y, all[c], all["user_ordernum"]))
# 一周的
for c in ["orderbehavior_3_ratio_1week", "orderbehavior_4_ratio_1week", "orderbehavior_5_ratio_1week"]:
all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_1week"]))
# 一个月的
#for c in ["orderbehavior_3_ratio_1month", "orderbehavior_4_ratio_1month", "orderbehavior_5_ratio_1month"]:
# all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_1month"]))
# 三个月的
for c in ["orderbehavior_3_ratio_3month", "orderbehavior_4_ratio_3month", "orderbehavior_5_ratio_3month"]:
all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_3month"]))
all["price_star"] = all["price_deduct"]/(all["star"]-1)
all["star_dif"] = all["user_avgstar"]-all["star"]
all["price_ave_dif_rt"] = all["price_deduct"]/all["user_avgdealprice"]
all["price_ave_star_dif"] = all["price_deduct"]/all["user_avgprice_star"]
all["price_h_w_rt"] = all["user_avgdealpriceholiday"] / all["user_avgdealpriceworkday"]
all["price_ave_dif"] = all["price_deduct"] - all["user_avgdealprice"]
all["user_roomservice_4_32_rt"] = all["user_roomservice_4_3ratio"] / all["user_roomservice_4_2ratio"]
all["user_roomservice_4_43_rt"] = all["user_roomservice_4_4ratio"] / all["user_roomservice_4_3ratio"]
4.建立模型
本项目采用的是lgb算法。用40万的数据进行训练。
train = all.iloc[:400000]
test = all.iloc[400000:]
#算法测试
train_data = train.copy()
train_y=train_data["orderlabel"].values
del train_data["orderlabel"]
#lgb算法
train_data1 = lgb.Dataset(train_data, label=train_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',
'min_child_weight': 1.5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'learning_rate': 0.05,
'tree_method': 'exact',
'seed': 2019,
'nthread': 12}
num_round = 500
model = lgb.train(params, train_data1, num_round)
5.预测结果
用10万数据进行测试,预测orderlabel为 0或1的概率,最后依据orderid分组,选择orderid组别中预测为1概率最大的roomid作为最终预测结果。
test_data = test.copy()
del test_data['orderlabel']
test_result = model.predict(test_data.values)
test_result = pd.DataFrame(test_result)
test_result.columns = ["prob"]
test_result["orderid"] = test_data["orderid"].values
test_result["pre_roomid"] = test_data["roomid"].values
result = test_result.sort_values(by=['orderid',"prob"],ascending = False)
result = result.drop_duplicates("orderid", keep="first")
#预测房型结果
test_predict = result.pivot_table(index='orderid', values='pre_roomid').copy()
#真实房型结果
test_data_tmp = test[test['orderlabel'] == 1][['orderid','roomid']]
test_truth = test_data_tmp.pivot_table(index='orderid', values='roomid').copy()
#测试集中orderid的总数为i,房型预测正确的orderid总数为j。
i = 0;j = 0
orderid = test_data_tmp['orderid']
for k in orderid:
i = i+1
if test_predict['pre_roomid'][k] == test_truth['roomid'][k]:
j = j+1
print(i, j)
2924 1452
可以看到在测试集中一共有2924个orderid,其中订购房型预测正确的有1452个,准确率为49.66%。