2019年11月自行车业务分析报告——PART1

前言

2019年11月自行车业务分析报告

目录：

一、自行车整体销售表现
二、2019年11月自行车地域销售表现
三、2019年11月自行车产品销售表现
四、用户行为分析
五、2019年11月热品销售分析

计算结果存入数据库_对应表名：

自行车整体销售表现:pt_overall_sale_performance_1
2019年11月自行车地域销售表现:pt_bicy_november_territory_2、pt_bicy_november_october_city_3
2019年11月自行车产品销售表现：pt_bicycle_product_sales_month_4、pt_bicycle_product_sales_order_month_4、pt_bicycle_product_sales_order_month_11
用户行为分析：pt_user_behavior_november
2019年11月热品销售分析：pt_hot_products_november

#导入模块
import pandas as pd
import numpy as np
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine

一、自行车整体销售表现

1.1、从数据库读取源数据：dw_customer_order

1.1.0 把数据从远程的数据库同步到本地
在navicat中可以把数据同步到本地的数据库，而且表结构都会自动的帮我们同步过来。

image.png

导过来之后，从本地sql导入的时间从5分钟缩短到30秒，大大的提高了效率。
1.1.1 连接数据库

#读取源数据。不同城市，每天产品销售信息
#创建数据库引擎
engine = create_engine(r'mysql://root@localhost:3306/adventure_local?charset=gbk',echo=False)
gather_customer_order=pd.read_sql_query("select * from dw_customer_order", engine)

#查看源数据前5行，观察数据，判断数据是否正常识别
gather_customer_order.head()

image-20200630194940701

1.1.2查看源数据类型：dw_customer_order

gather_customer_order.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302726 entries, 0 to 302725
Data columns (total 16 columns):
create_date           302726 non-null object
product_name          302726 non-null object
cpzl_zw               302726 non-null object
cplb_zw               302726 non-null object
order_num             302726 non-null int64
customer_num          302726 non-null int64
sum_amount            302726 non-null float64
is_current_year       302726 non-null object
is_last_year          302726 non-null object
is_yesterday          302726 non-null object
is_today              302726 non-null object
is_current_month      302726 non-null object
is_current_quarter    302726 non-null object
chinese_province      302726 non-null object
chinese_city          302726 non-null object
chinese_territory     302726 non-null object
dtypes: float64(1), int64(2), object(13)
memory usage: 37.0+ MB

1.1.3利用create_date字段增加create_year_month月份字段

#增加create_year_month月份字段。按月维度分析时使用  strftime("%Y-%m")
gather_customer_order['create_year_month']=gather_customer_order.create_date.apply(lambda x:x.strftime("%Y-%m"))

gather_customer_order['create_year_month'].loc[ 163496:163500]

163496    2020-03
163497    2020-03
163498    2020-03
163499    2020-03
163500    2020-03
Name: create_year_month, dtype: object

1.1.4、筛选产品类型cplb_zw中的自行车作为新的gather_customer_order，目前存在如图情况，你需要剔除掉其余2个

#筛选产品类别为自行车的数据
# print(gather_customer_order.cplb_zw.unique())

gather_customer_order = gather_customer_order.loc[ gather_customer_order.cplb_zw=='自行车']
gather_customer_order.head()

image-20200630200125505

1.1.5、总结知识点：

1.python连接mysql的几种方式

create_engine建立连接
pd.read_sql_query在连接上执行sql语句

2.查看数据类型的几种方式

type, info

3.时间转字符串类型

时间转为字符串：strftime("%Y-%m-%d %H%M%s")

数字转为时间：pd.to_datetime(x, format="%Y%m%d")

日期只保留月份：d.astype('datetime64[M]')

查看时间差距几天 (d1-d2)/np.deltatime64(1, 'D')

4.可能DataFrame中存在科学计数法，请自行探索如何解决

sql中是CAST('12.5' AS decimal(9,2)) CAST(12.5 AS int)

pandas中是pd.set_option('display.float_format',lambda x:'%.3f' % x)

1.2、自行车整体销售量表现

1.2.1、聚合每月订单数量和销售金额，具体groupby创建一个新的对象，需要将order_num、sum_amount求和，对日期降序排序，记得重置索引

#每月订单数量和销售金额，用groupby创建一个新的对象，需要将order_num、sum_amount求和
overall_sales_performance = gather_customer_order.groupby('create_year_month').agg({'order_num':'sum', 'sum_amount':'sum'}).reset_index()
overall_sales_performance.head()

image-20200630201311522

#按日期降序排序，方便计算环比
overall_sales_performance.sort_values('create_year_month', ascending=False, inplace=True)
overall_sales_performance.reset_index(inplace=True)
del overall_sales_performance['index']
overall_sales_performance.head()

image-20200630201335285

1.2.2、新增一列order_num_diff，此为每月自行车销售订单量环比，本月与上月对比，例如本期2019-02月销售额与上一期2019-01月销售额做对比

步骤：

1.利用diff()函数

2.使用列表形式方便加工

3.加工形成同样长度的列表，转为Series或DataFrame与本身数据合并


#求每月自行车销售订单量环比，观察最近一年数据变化趋势
#环比是本月与上月的对比，例如本期2019-02月销售额与上一期2019-01月销售额做对比
order_num_diff = list((overall_sales_performance.order_num.diff() * -1 / overall_sales_performance.order_num))    #s.diff() = s - s.shift()
order_num_diff.pop(0) #删除列表中第一个元素
order_num_diff.append(0) #将0新增到列表末尾
order_num_diff

[4.2236591053635788,
 1.7730746619635509,
 -0.76109550561797756,
 0.85271922976841008,
 0.21460176991150443,
 0.032637075718015669,
 -0.075995174909529548,
 0.0710594315245478,
 0.060637204522096609,
 -0.052580331061343723,
 -0.01027947317699968,
 0.029771749917300694,
 -0.02263174911089557,
 0.038267875125881166,
 -0.063796354494028915,
 0.12358757062146893,
 -0.067193675889328064,
 0]

#将环比转化为DataFrame:
overall_sales_performance_ret = pd.concat([overall_sales_performance, pd.DataFrame({'order_num_diff_222':order_num_diff})], axis=1)

overall_sales_performance_ret

image-20200630201432115

1.2.3、新增一列sum_amount_diff，此为每月自行车销售金额环比，原理一样，但是所需字段不同，最后形成按照日期升序排列

#求每月自行车销售金额环比
sum_amount_diff = list(overall_sales_performance.sum_amount.diff() * -1/ overall_sales_performance.sum_amount)
sum_amount_diff.pop(0) #删除列表中第一个元素
sum_amount_diff.append(0) #将0新增到列表末尾
sum_amount_diff
#将环比转化为DataFrame
overall_sales_performance_ret2 = pd.concat( [overall_sales_performance_ret, 
                                       pd.DataFrame({'sum_amount_dif':sum_amount_diff})], axis=1)

overall_sales_performance_ret2.head()

image-20200630201719444

#销量环比字段名order_diff，销售金额环比字段名amount_diff
#按照日期排序，升序
overall_sales_performance_ret2.sort_values('create_year_month', ascending=True, inplace=True)
overall_sales_performance_ret2.head()

image-20200630201741535

#查看每月自行车订单量、销售金额、环比、前5行
overall_sales_performance_ret2.head(5)

image-20200630202152956

字段注释：

create_year_month：时间，order_num：本月累计销售数量，sum_amount：本月累计销售金额，order_diff：本月销售数量环比，

sum_amount_diff：本月销售金额环比，dw_customer_order:用户订单表

1.2.4、将最终的overall_sales_performance的DataFrame存入Mysql的pt_overall_sale_performance_1当中，请使用追加存储。

#将数据存入数据库
engine_write = create_engine(r'mysql://root@localhost:3306/adventure_write?charset=gbk',echo=False)
overall_sales_performance_ret2.to_sql('pt_overall_sale_performance_1', con=engine_write, if_exists='append')

s = pd.Series(np.arange(1, 5))
s.diff()
# s - s.shift()

0    NaN
1    1.0
2    1.0
3    1.0
dtype: float64

1.2.5、总结知识点：

1、计算上下两行之间相除或相乘等一系列方法的函数使用方法罗列

s.mul(s.shift()) s.div(s.shift())

2、总结DataFrame与Series类型区别

Series:键值对，key就是索引,key可以是一个数值
DataFrame：value应该是一个list，series，ndarray

3、总结DataFrame专用的排序函数和列表的排序函数

sort_index(), sort_values()

list1.sort(reverse=True)

4、总结存储到mysql的几种形式（覆盖、追加等）

{'fail', 'replace', 'append'}

二、2019年11月自行车地域销售表现

2.1、源数据dw_customer_order，数据清洗筛选10月11月数据

#gather_customer_order在分析自行车整体表现时已从数据库导入表（dw_customer_order），并筛选仅自行车数据，这里不再导入
gather_customer_order.head()

image-20200630202446479

2.1.1 筛选10、11月的自行车数据，赋值变量为gather_customer_order_10_11

#筛选10月11月自行车数据
#gather_customer_order.head().create_date.apply(lambda x:x.month==10) 

gather_customer_order_10_11 = gather_customer_order.loc[ gather_customer_order.create_date.apply(lambda x:(x.month==10 or x.month==11))]
gather_customer_order_10_11.tail()

image-20200630202539985

2.1.2 查看10月、11月的自行车订单数量

#10月11月自行车订单数据共6266条
len(gather_customer_order_10_11)

2.2、2019年10月和11月自行车区域销售量表现

2.2.1 请按照'chinese_territory','create_year_month'，区域、月份分组，订单量求和、销售金额求和，赋予变量gather_customer_order_10_11_group，记得重置索引

#按照区域、月分组,订单量求和，销售金额求和
gather_customer_order_10_11_group= gather_customer_order_10_11.groupby( by=['chinese_territory', 
                                                                            'create_year_month']).agg({'order_num':'sum',
                                                                                                    'sum_amount':'sum'}).reset_index()

gather_customer_order_10_11_group.head(6)

image-20200630202745724

2.2.2 求不同区域10月11月的环比

步骤

1、获得去重区域的列表region_list
2、利用for循环区域列表，结合loc定位符合区域
3、利用pct_change()函数实现环比效果
4、形成新的Series
5、利用Series带的索引，合并到gather_customer_order_10_11_group中

#将区域存为列表
region_list=gather_customer_order_10_11_group.chinese_territory.unique()

region_list

array(['东北', '华东', '华中', '华北', '华南', '台港澳', '西北', '西南'], dtype=object)

#pct_change()当前元素与先前元素的相差百分比,求不同区域10月11月环比
order_x = pd.Series([])
amount_x = pd.Series([])
for i in region_list:
    a=gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['order_num']
    a = a.diff()/a.shift()
    a.fillna(0, inplace=True)
    b=gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['sum_amount']
    b = b.diff()/b.shift()
    b.fillna(0, inplace=True)
    order_x=order_x.append(a)
    amount_x = amount_x.append(b)

gather_customer_order_10_11_group['order_diff']=order_x
gather_customer_order_10_11_group['amount_diff']=amount_x

#10月11月各个区域自行车销售数量、销售金额环比
gather_customer_order_10_11_group.head()

image-20200630203303370

字段注释：

chinese_territory：区域，create_year_month：时间，order_num：区域销售数量，sum_amount：区域销售金额，order_diff：本月销售数量环比，

amount_diff：本月销售金额环比

2.2.3、将最终的gather_customer_order_10_11_group的DataFrame存入Mysql的pt_bicy_november_territory_2当中，请使用追加存储。

#将数据存入数据库
gather_customer_order_10_11_group.to_sql('pt_bicy_november_territory_2', con=engine_write, if_exists='append')

l = [1, 2, 4]
dict = {'a':l}
print(dict)
l.append(5)
print(dict)

{'a': [1, 2, 4]}
{'a': [1, 2, 4, 5]}

2.2.4、总结知识点：

1、总结DataFrame的去重方法，多种

groupby, unique,

2、总结Series转化为列表的方式

list(s) s.tolist()

3、总结Series与列表的区别

Series是带索引的，不可变变量
List是不带索引的，在内存中是可变变量。

4、总结append\extend等列表追加元素的区别及适用范围

image-20200630203606346

2.3、2019年11月自行车销售量TOP10城市环比

2.3.1、筛选11月自行车交易数据赋予变量为gather_customer_order_11

#筛选11月自行车交易数据
gather_customer_order_11 = gather_customer_order.loc[ gather_customer_order.create_date.apply(lambda x:x.month==11)]

2.3.2、将gather_customer_order_11按照chinese_city城市分组，求和销售数量order_num，最终查看11月自行车销售数量前十城市，赋予变量gather_customer_order_city_head

gather_customer_order_city_11= gather_customer_order_11.groupby('chinese_city').agg({'order_num':'sum'}).reset_index()

#11月自行车销售数量前十城市
gather_customer_order_city_head = gather_customer_order_city_11.sort_values('order_num', ascending=False).head(10)

#查看11月自行车销售数量前十城市
gather_customer_order_city_head

image-20200630203654708

2.3.3、根据gather_customer_order_city_head的前十城市，查看10月11月自行车销售数据gather_customer_order_10_11，赋予变量gather_customer_order_10_11_head

#筛选销售前十城市，10月11月自行车销售数据
interested_cities = gather_customer_order_city_head.chinese_city.unique()

gather_customer_order_10_11_head = gather_customer_order_10_11.loc[ gather_customer_order_10_11.chinese_city.apply(lambda x:
                                                                                                     x in interested_cities)]

gather_customer_order_10_11_head.head(10)

image-20200630203728947

2.3.4、根据gather_customer_order_10_11_head，分组计算前十城市，自行车销售数量销售金额，赋予变量gather_customer_order_city_10_11

#分组计算前十城市，自行车销售数量销售金额
gather_customer_order_city_10_11 = gather_customer_order_10_11_head.groupby(by=
                                                                            ['chinese_city', 'create_year_month']).agg({'order_num':'sum', 
                                                                                                                    'sum_amount':'sum'}).reset_index()

gather_customer_order_city_10_11

image-20200630203806774

2.3.5、根据gather_customer_order_city_10_11，计算前10的销售金额及销售量环比，最终形成结果gather_customer_order_city_10_11如图,方法参照作业2.2.2

#计算前十城市环比
city_top_list = gather_customer_order_city_10_11.chinese_city.unique()
order_top_x = pd.Series([])
amount_top_x = pd.Series([])
for i in city_top_list:
    #print(i)
    a=gather_customer_order_city_10_11[ gather_customer_order_city_10_11.chinese_city==i].order_num
    a = a.diff() / a.shift()
    a.fillna(0, inplace=True)
    b = gather_customer_order_city_10_11[ gather_customer_order_city_10_11.chinese_city==i].sum_amount
    b= b.diff() / b.shift()    
    b.fillna(0, inplace=True)
    order_top_x=order_top_x.append(a)
    amount_top_x =amount_top_x.append(b)

#order_diff销售数量环比，amount_diff销售金额环比
gather_customer_order_city_10_11['order_diff']=order_top_x
gather_customer_order_city_10_11['amount_diff']=amount_top_x

gather_customer_order_city_10_11.head(5)

image-20200630203834334

字段注释

chinese_city：城市，create_year_month：时间，order_num：本月销售数量，sum_amount：本月销售金额，order_diff：本月销售数量环比，

amount_diff：本月销售金额环比

2.3.6、将最终的gather_customer_order_city_10_11的DataFrame存入Mysql的pt_bicy_november_october_city_3当中，请使用追加存储。

#11月销售数量前十城市。销售数量、销售金额环比
#print(gather_customer_order_city_10_11.head())

#存入数据库, ife
engine2 = create_engine(r'mysql://root@localhost/adventure_write?charset=gbk')
gather_customer_order_city_10_11.to_sql('pt_bicy_november_october_city_3', con=engine2, if_exists='append')

2.3.7、总结知识点

1、函数agg的用法及适用范围

groupby分类汇总的时候

2、DataFrame中的loc,iloc,at,iat,ix的用法和区别

loc,at 使用索引查找
iloc,iat 使用数字，位置查找
loc,iloc除了跟单一的值，还可以跟切片，list，true/false的series
at,iat 只能跟单一的数值
ix 是比前面的几个更强大，但是也容易混淆

3、筛选函数的总结，如between \ & \ ~ \ isin \ ==

df.loc[df.s.isin(['lama'])] # isin
h1 = df1[(df1['nat_actn_2_3'].isin['100','102']) | ((df1['nat_acctn_2_3'] > 500) & (df1['nat_actn_2_3'] < 599))] #& |
df.loc[~df.s.isin(['lama'])] #~
df.loc[df.s.between(1, 4) ]

4、个人总结第二部分目的及需求
1. 对于区域的表现我们是先unique得到区域的集合，然后再集合上面迭代筛选这个区域内的数据，在这个数据里面计算环比
2. 对于TOPS10的问题，其实和区域类似，唯一的不同在于先要筛选前十名的城市，其中用到了排序以及筛选然后再unique得到区域的集合。然后就可以筛选数据并且分类汇总
5、容易出错的地方

排序之后需要reset_index否则后面concat的时候顺序又会搞错。

最后编辑于：2020.06.30 21:05:49

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 199,711评论 5赞 468
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 83,932评论 2赞 376
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 146,770评论 0赞 330
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,799评论 1赞 271
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,697评论 5赞 359
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,069评论 1赞 276
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,535评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,200评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,353评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,290评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,331评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,020评论 3赞 315
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,610评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,694评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,927评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,330评论 2赞 346
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 41,904评论 2赞 341

2019年11月自行车业务分析报告——PART1

前言