数据集来自 Bike Sharing Demand
代码实例来自 EDA & Ensemble Model (Top 10 Percentile)
1.数据观察与处理
1.1 观察
shape
data.head(2)
data type
1.2 feature engineering
1.2.1 create new features from "DateTime"
将 datetime 分成日期、小时、工作日、月份
示例如下:
dailyData["hour"] = dailyData.datetime.apply(lambda x : x.split()[1].split(":")[0])
dailyData["weekday"] = dailyData.date.apply(lambda dateString : calendar.day_name[datetime.strptime(dateString,"%Y-%m-%d").weekday()])
strptime - date and time conversion,str to time
weekday: 0–6 意味着周一到周日
calendar.day_name: 通过 weekday 得到具体的 Monday/Tuesday/...
Difference between apply, map, applymap: These are techniques to apply function to element, column or dataframe.
Map: It iterates over each element of a series.
df[‘column1’].map(lambda x: 10+x), this will add 10 to each element of column1.
df[‘column2’].map(lambda x: ‘AV’+x), this will concatenate “AV“ at the beginning of each element of column2 (column format is string).
Apply: As the name suggests, applies a function along any axis of the DataFrame.
df[[‘column1’,’column2’]].apply(sum), it will returns the sum of all the values of column1 and column2.
ApplyMap: This helps to apply a function to each element of dataframe.
func = lambda x: x+2
df.applymap(func), it will add 2 to each element of dataframe (all columns of dataframe must be numeric type)
1.2.2.将本应为 categorical 的 features 如季节、是否为街价值、是否为工作日改为 categorical
categoryVariableList = ["hour","weekday","month","season","weather","holiday","workingday"]
for var in categoryVariableList:
dailyData[var] = dailyData[var].astype("category")
1.2.3.缺失值处理
使用的工具包是 missingno,a quiet handy library to quickly visualize variables for missing values.
msno.matrix(dailyData,figsize=(12,5))
1.2.4.通过去除 outliers 进行偏度处理 skewness
画盒图
fig, axes = plt.subplots(nrows=2,ncols=2) #两行两列的图
fig.set_size_inches(12, 10)
sn.boxplot(data=dailyData,y="count",orient="v",ax=axes[0][0])
sn.boxplot(data=dailyData,y="count",x="season",orient="v",ax=axes[0][1])
#orient : “v” | “h”, optional, Orientation of the plot (vertical or horizontal).
去除 outliers
dailyDataWithoutOutliers = dailyData[np.abs(dailyData["count"]-dailyData["count"].mean())<=(3*dailyData["count"].std())]
这个语句有点意思,也很有借鉴意义,如上语句只保留了在数据列 count 的平均值上下各三个标准差范围的数据。事实上,在做完该操作之后,仍应画盒图看偏度。
如上这一步的操作应该是和采用 log transform 为同一目标,可后续继续检验哪种方式更佳。
1.2.5. 相关度分析
corrMatt = dailyData[["temp","atemp","casual","registered","humidity","windspeed","count"]].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sn.heatmap(corrMatt, mask=mask,vmax=.8, square=True,annot=True)
有如下几个问题:
- 为什么计算 corrMatt 时 dailyData[[XX]] 要用双层括号?
- mask[np.tril_indices_from(mask)] = False 设对角线左下方都为 0, 为何?
- heatmap 函数具体使用?
- "Casual" and "Registered" are also not taken into account since they are leakage variables in nature and need to dropped during model building. Why?
画相关度图可以得到一些结论:
- temp and humidity features has got positive and negative correlation with count respectively.Although the correlation between them are not very prominent still the count variable has got little dependency on "temp" and "humidity".
- windspeed is not gonna be really useful numerical feature and it is visible from it correlation value with "count"
- "atemp" is variable is not taken into since "atemp" and "temp" has got strong correlation with each other. During model building any one of the variable has to be dropped since they will exhibit multicollinearity in the data.
1.2.6. regression plot
1.2.7 Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)
柱状图
monthAggregated = pd.DataFrame(dailyData.groupby("month")["count"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="count",ascending=False)
sn.barplot(data=monthSorted,x="month",y="count",ax=ax1,order=sortOrder)
ax1.set(xlabel='Month', ylabel='Avearage Count',title="Average Count By Month")
点图
hourAggregated = pd.DataFrame(dailyData.groupby(["hour","season"],sort=True)["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["season"], data=hourAggregated, join=True,ax=ax2)
ax2.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Season",label='big')