数据预处理
Day 1的任务是数据预处理。开始任务~
Step1 Import the libs
Numpy包含数学函数, Pandas是用来管理导入数据集和对数据集进行操作.
如果和我一样对Pandas一窍不通. 可以用这篇文章来学习.pandas学习
code如下:
#Step 1: Import the libs
import numpy as numpy
import pandas as pd
Step2 Import dataset
数据集一般是csv格式. 每行为一条记录. read_csv后csv中的数据被保存到dataframe中.
然后我们从dataframe中分离出自变量和因变量, 分别为矩阵和向量.
code如下:
#Step 2: Import dataset
dataset = pd.read_csv('../datasets/Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values
print("Step 2: Importing dataset")
print("X")
print(X)
print("Y")
print(Y)
Step3 Handling the missing data
因为数据极少很规范, 所以我们通常需要对缺失的数据进行处理. 这样就不会在机器学习的时候被bad data所影响. 一般用Imputer来处理. 而且我们一般用平均数或者中位数来替换缺失的值. 例子里缺失值的占位表现形式是NaN.
code如下:
#step 3: Handling the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
print("---------------------")
print("Step 3: Handling the missing data")
print("step2")
print("X")
print(X)
Step4 Encoding categorical data
分类数据一般不能是label. 需要是数字. 像例子中的因变量为YES和NO.我们需要用LabelEncoder类来转换.
- LabelEncoder: 编码值介于0和n_classes-1之间的标签, 还可用于将非数字标签(只要它们可比较)转换为数字标签.
- OneHotEncoder: 使用K-K方案对分类整数特征进行编码.
code如下:
#Step 4: Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])
#Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print("---------------------")
print("Step 4: Encoding categorical data")
print("X")
print(X)
print("Y")
print(Y)
Step5 Splitting the datasets into training sets and Test sets
数据集会被拆分成两部分, 一部分为训练集, 用来训练模型. 一部分为测试集, 用来测试训练模型的性能. 一般为80:20的原则.
code如下:
#Step 5: Splitting the datasets into training sets and Test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
print("---------------------")
print("Step 5: Splitting the datasets into training sets and Test sets")
print("X_train")
print(X_train)
print("X_test")
print(X_test)
print("Y_train")
print(Y_train)
print("Y_test")
print(Y_test)
Step6 Feature Scaling
在机器学习中, 高数量级特征比低数量级特征有更高的权重.
我们用特征标准化或Z分布解决这个问题.
code如下:
#Step 6: Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("---------------------")
print("Step 6: Feature Scaling")
print("X_train")
print(X_train)
print("X_test")
print(X_test)