这是个销售型公司的业务需求,一般来说属于决策型数据运营,比如最一般的通过过去的销售量数据预测某个时间的销售量,当然还有根据具体需求变种,比如考虑季节性的分布统计量,增长率预测,考虑促销情况下的销售量,比如增加促销成本获益,也可能预测促销费对于销售量的提升显著性,看带来多大的商品销售量。
这里基于一个经典的预测,kaggle的经典时间序列预测建模,用分销商历史销售产品数据来预测未来短期内的销售情况。考虑预测各个销售店各个产品的月销售量。
https://www.kaggle.com/c/competitive-data-science-predict-future-sales
主要由以下数据
- sales_train.csv - 训练集. 从2013年一月到2015年十月的每日历史数据。
- test.csv - 测试集. 要预测这些商店和产品的2015年十一月销售量。
- sample_submission.csv - 提交样例。
- items.csv - 商品项的信息.
- item_categories.csv - 产品类别信息.
- shops.csv- 商店信息.
看下具体的数据字段:
- ID - 代表test集的一个(Shop, Item) 二元组ID
- shop_id - 商店的唯一标识
- item_id - 产品的唯一标识
- item_category_id - 产品类别唯一标识
- item_cnt_day - 产品每日销量.
- item_price - 当前产品项的价格
- date - dd/mm/yyyy格式的日期
- date_block_num - 一个连续的月数, 主要是为了连续标识月份. January 2013 为0, February 2013 为1,..., October 2015 为 33
- item_name - 项目名
- shop_name - 分销店名
- item_category_name - 产品项分类名称
## 数据导入
# -*- coding: UTF-8 -*-
"""
数据合并
组合为宽表
"""
# 保证脚本与Python3兼容
from __future__ import print_function
import os #读取数据文件
import sys
import pymysql
from sqlalchemy import create_engine
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split #划分训练集测试集使用
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.mosaicplot import mosaic
from sklearn.linear_model import LogisticRegression ,LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn.feature_extraction import DictVectorizer #特征转换器
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
def readData(path):
"""
使用pandas读取数据
"""
data = pd.read_csv(path)
cols = list(data.columns.values)
return data[cols]
if __name__ == "__main__":
# 设置显示格式
pd.set_option('display.width', 1000)
homePath = os.path.dirname(os.path.abspath('__file__'))
# Windows下的存储路径与Linux并不相同
if os.name == "nt":
dataPath = "%s\\sales\item_categories.csv" % homePath
else:
dataPath = "%s/sales/item_categories.csv" % homePath
df_category = readData(dataPath)
if os.name == "nt":
dataPath = "%s\\sales/items.csv" % homePath
else:
dataPath = "%s/sales/items.csv" % homePath
df_items = readData(dataPath)
if os.name == "nt":
dataPath = "%s\\sales/sales_train_v2.csv" % homePath
else:
dataPath = "%s/sales/sales_train_v2.csv" % homePath
df_train = readData(dataPath)
if os.name == "nt":
dataPath = "%s\\sales/shops.csv" % homePath
else:
dataPath = "%s/sales/shops.csv" % homePath
df_shop = readData(dataPath)
if os.name == "nt":
dataPath = "%s\\sales/test.csv" % homePath
else:
dataPath = "%s/sales/test.csv" % homePath
df_test = readData(dataPath)
看下具体字段
df_train.head()
好吧看类别信息,就是商品大类,额是俄文,而且百度翻译也不理想,将就看看,用id就好
- data:时间信息;
- date_block_num:月数,例如:将 2013.2 看做是 1 月,2014.2 看作是 13 月;
- shop_id:商店的 ID;
- item_id:商品的 ID;
- item_price:商品的价格;
- item_cnt_day:上面每天的销售量。
合并数据
正如股票分析一样,一般单一的时间序列是比较片面的,很多时候需要考虑跟多特征和衍生特征。
合并数据组成宽表以便特征处理
df_train = pd.merge(df_train, df_items, on='item_id', how='inner')
df_train = pd.merge(df_train, df_category,
on='item_category_id', how='inner')
df_train = pd.merge(df_train, df_shop, on='shop_id', how='inner')
df_test = pd.merge(df_test, df_items, on='item_id', how='inner')
df_test = pd.merge(df_test, df_category, on='item_category_id', how='inner')
df_test = pd.merge(df_test, df_shop, on='shop_id', how='inner')
df_train.head()
探索可视化分析
画出每天的销售量分布图。
plt.figure(figsize=(14, 4))
# 这里使用了对数函数平滑,所以只选大于 0 的数据据
g = sns.distplot(
np.log(df_train[df_train['item_cnt_day'] > 0]['item_cnt_day']))
g.set_title("Item Sold Count Distribuition", fontsize=18)
g.set_ylabel("Frequency", fontsize=12)
plt.figure(figsize=(14, 4))
# 这里使用了对数函数平滑,所以只选大于 0 的数据据
g2 = sns.distplot(np.log(df_train[df_train['item_price'] > 0]['item_price']))
g2.set_title("Items Price Log Distribuition", fontsize=18)
g2.set_xlabel("")
g2.set_ylabel("Frequency", fontsize=15)
#我们可以使用商品的销售量乘以商品的价格得到商品一天总的营业额
df_train['total_amount'] = df_train['item_price'] * df_train['item_cnt_day']
#查看一下这商品价格、商品销售额、商品销售数据的分布情况,这里使用 四分位数 来描述。
def quantiles(df, columns):
for name in columns:
print(name + " quantiles")
# 打印出四分位数
print(df[name].quantile([.01, .25, .5, .75, .99]))
print("")
quantiles(df_train, ['item_cnt_day', 'item_price', 'total_amount'])
绘制出商店所有商品的价格情况。
temp = df_train.groupby('shop_name')['item_price'].sum()
# 画出柱状图
trace = [go.Bar(x=temp.index, y=temp.values,)]
# 设置图的字体颜色等
layout = go.Layout(
title="TOP 25 Shop Name by Total Amount Sold ",
yaxis=dict(title='Total Sold')
)
# 画出图形
fig = go.Figure(data=trace, layout=layout)
iplot(fig, filename='schoolStateNames')
特征工程
#预测是某个商店的某个商品的销售量
df_test.head()
#重新提取时间信息,只提取年和月。
df_train['date'] = pd.to_datetime(df_train['date'], format='%d.%m.%Y')
df_train['month'] = df_train['date'].dt.month
df_train['year'] = df_train['date'].dt.year
#删除不要的列
df_train1 = df_train.drop(['date', 'item_price'], axis=1)
# 求出商品每个月的销售总量
df_train1 = df_train1.groupby([c for c in df_train1.columns if c not in [
'item_cnt_day']], as_index=False)[['item_cnt_day']].sum()
df_train2 = df_train1.rename(columns={'item_cnt_day': 'item_cnt_month'})
df_train2.head()
#取每个产品在每个商店销售的平均值
shop_item_monthly_mean = df_train2[['shop_id', 'item_id', 'item_cnt_month']].groupby(
['shop_id', 'item_id'], as_index=False)[['item_cnt_month']].mean()
shop_item_monthly_mean1 = shop_item_monthly_mean.rename(
columns={'item_cnt_month': 'item_cnt_month_mean'})
shop_item_monthly_mean1.head()
#加入到新的列
df_train3 = pd.merge(df_train2, shop_item_monthly_mean1,
how='left', on=['shop_id', 'item_id'])
df_train3.head()
#对测试集做相对处理
df_test['month'] = 11
df_test['year'] = 2015
df_test['date_block_num'] = 34
df_test.head()
df_test1 = pd.merge(df_test, shop_item_monthly_mean1,
how='left', on=['shop_id', 'item_id'])
df_test1.head()
df_test1 = df_test1.fillna(0.)
df_test1.head()
df_train3.to_csv('train_mod.csv')
df_test1.to_csv('test_mod.csv')
划分训练验证集
feature_list = [c for c in df_train.columns if c not in 'item_cnt_month']
x_train = df_train[df_train['date_block_num'] < 33]
y_train = np.log1p(x_train['item_cnt_month'].clip(0., 20.)) #clip将边界外的值指定给边界值。 即调到0到20内,类似上下界法
x_train = x_train[feature_list]
x_val = df_train[df_train['date_block_num'] == 33]
y_val = np.log1p(x_val['item_cnt_month'].clip(0., 20.))
x_val = x_val[feature_list]
建模
随机森林
时间序列可以当做一个回归问题,这里就用随机森林回归树来建模
#参数调优,交叉验证
from sklearn.model_selection import GridSearchCV
param_grid = {'min_samples_split':list((3,6,9)),
'n_estimators':list((10,50,100))}
grid = GridSearchCV(RandomForestRegressor(),param_grid = param_grid,cv = 3)
grid.fit(x_train,y_train)
grid.grid_scores_ ,grid.best_params_ ,grid.best_score_
全集训练测试集预测
model.fit(df_train[feature_list], df_train['item_cnt_month'].clip(0., 20.))
df_test['item_cnt_month'] = model.predict(
df_test[feature_list]).clip(0., 20.)
df_test.head()
lightbgm
# 训练
from lightgbm import LGBMRegressor
lgbm=LGBMRegressor(
n_estimators=100,
learning_rate=0.03,
num_leaves=32,
colsample_bytree=0.9497036,
subsample=0.8715623,
max_depth=8,
reg_alpha=0.04,
reg_lambda=0.073,
min_split_gain=0.0222415,
min_child_weight=40)
lgbm.fit(x1,y1)
原生形式使用lightgbm
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# 转换为Dataset数据格式
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_val, y_val, reference=lgb_train)
# 参数
params = {
'task': 'train',
'boosting_type': 'gbdt', # 设置提升类型
'objective': 'regression', # 目标函数
'metric': {'l2', 'auc'}, # 评估函数
'num_leaves': 31, # 叶子节点数
'learning_rate': 0.05, # 学习速率
'feature_fraction': 0.9, # 建树的特征选择比例
'bagging_fraction': 0.8, # 建树的样本采样比例
'bagging_freq': 5, # k 意味着每 k 次迭代执行bagging
'verbose': 1 # <0 显示致命的, =0 显示错误 (警告), >0 显示信息
}
# 模型训练
gbm = lgb.train(params, lgb_train, num_boost_round=20, valid_sets=lgb_eval, early_stopping_rounds=5)
# 模型保存
gbm.save_model('model.txt')
# 模型加载
gbm = lgb.Booster(model_file='model.txt')
# 模型预测
y_pred = gbm.predict(x_val, num_iteration=gbm.best_iteration)
# 模型评估
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
sklearn 接口
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
# 模型训练
gbm = LGBMRegressor(objective='regression', num_leaves=31, learning_rate=0.05, n_estimators=20)
gbm.fit(x_train, y_train, eval_set=[(x_val, y_val)], eval_metric='l1', early_stopping_rounds=5)
# 模型存储
joblib.dump(gbm, 'loan_model.pkl')
# 模型加载
gbm = joblib.load('loan_model.pkl')
# 模型预测
y_pred = gbm.predict(x_val, num_iteration=gbm.best_iteration_)
# 模型评估
print('The rmse of prediction is:', mean_squared_error(y_val, y_pred) ** 0.5)
# 特征重要度
print('Feature importances:', list(gbm.feature_importances_))
# 网格搜索,参数优化
estimator = LGBMRegressor(num_leaves=31)
param_grid = {
'learning_rate': [0.01, 0.1, 1],
'n_estimators': [20, 40]
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(x_train, y_train)
print('Best parameters found by grid search are:', gbm.best_params_)
lstm
from keras import backend as K
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
#定义目标函数
def rmse (y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred -y_true), axis=-1))
#搭建网络
model_lstm = Sequential()
model_lstm.add(LSTM(60, input_shape=(1,5)))
model_lstm.add(Dense(1)) #全连接层
model_lstm.compile(loss='mean_squared_error', optimizer='adam', metrics=[rmse])
#标准化
from sklearn.preprocessing import StandardScaler,MinMaxScaler
scaler = StandardScaler()
scaler = MinMaxScaler(feature_range=(-1,1))
x_train_scaled = scaler.fit_transform(x1)
x_valid_scaled = scaler.fit_transform(x2)
x_test_scaled = scaler.fit_transform(df_test[feature_list])
x_train_reshaped = x_train_scaled.reshape((x_train_scaled.shape[0], 1, x_train_scaled.shape[1]))
x_val_reshaped = x_valid_scaled.reshape((x_valid_scaled.shape[0], 1, x_valid_scaled.shape[1]))
x_test_reshaped = x_test_scaled.reshape((x_test_scaled.shape[0], 1, x_test_scaled.shape[1]))
#训练
history = model_lstm.fit(x_train_reshaped, y1, validation_data=(x_val_reshaped, y2),epochs=10, batch_size=4000, verbose=2, shuffle=False)
#预测
y_pre = model_lstm.predict(x_val_reshaped)
df_test['item_cnt_month'] = model_lstm.predict(x_test_reshaped.clip(0., 20.))
reference :
https://www.shiyanlou.com/courses/1363/learning/
https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts
https://www.kaggle.com/sanket30/predicting-sales-using-lightgbm
https://www.kaggle.com/lilisako/time-series-covering-all-basic-models
https://www.kaggle.com/karanjakhar/simple-and-easy-aprroach-using-lstm
https://www.kaggle.com/bombatkarvivek/pyspark-tensorflow-predict-future-sales-v1