1 数值型特征处理
1.1 均值移除(mean removal)
对不同样本的同一特征值进行处理,最终均值为0,标准差为1
import numpy as np
from sklearn import preprocessing
# each column is a sample, and features stack vertically
# i.e, here are 4 examples, each one has 3 features.
data = np.array([[3, -1.5, 2, -5.4],
[0, 4, -0.3, 2.1],
[1, 3.3, -1.9, -4.3]])
# mean removal
data_standardized = preprocessing.scale(data, axis=1)
print(data_standardized)
print("\nMean = ", data_standardized.mean(axis=1))
print("Std deviation", data_standardized.std(axis=1))
结果为:
[[ 1.05366545 -0.31079341 0.75045237 -1.49332442]
[-0.8340361 1.46675314 -1.00659529 0.37387825]
[ 0.51284962 1.31254733 -0.49546489 -1.32993207]]
Mean = [ -5.55111512e-17 -1.11022302e-16 0.00000000e+00]
Std deviation [ 1. 1. 1.]
1.2 范围缩放(scale)
对不同样本的同一特征值,减去其最大值,除以(最大值-最小值), 最终原最大值为1,原最小值为0
# scaling
data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = data_scaler.fit_transform(data)
print(data_scaled)
结果为:
[[ 1. 0. 1. 0. ]
[ 0. 1. 0.41025641 1. ]
[ 0.33333333 0.87272727 0. 0.14666667]]
1.3 归一化(normalization)
归一化可以保持数据的正负、比例大小不变,同时可以收缩都范数为1的范围内。
data_normalized_l1 = preprocessing.normalize(data, norm='l1', axis=1)
data_normalized_l2 = preprocessing.normalize(data, norm='l2', axis=1)
print("L1 norm")
print(data_normalized_l1)
print("\n L2 norm")
print(data_normalized_l2)
结果为:
L1 norm
[[ 0.25210084 -0.12605042 0.16806723 -0.45378151]
[ 0. 0.625 -0.046875 0.328125 ]
[ 0.0952381 0.31428571 -0.18095238 -0.40952381]]
L2 norm
[[ 0.45017448 -0.22508724 0.30011632 -0.81031406]
[ 0. 0.88345221 -0.06625892 0.46381241]
[ 0.17152381 0.56602858 -0.32589524 -0.73755239]]
1.4 二值化(binarization)
二值化用于数值特征向量转化为布尔类型,
data_binarized = preprocessing.Binarizer(threshold=0.4).transform(data)
print("\nBinarized data:")
print(data_binarized)
结果:
Binarized data:
[[ 1. 0. 1. 0.]
[ 0. 1. 0. 1.]
[ 1. 1. 0. 0.]]
当然numpy本身就支持condition index(我自己瞎编的词),也可以直接使用下面的代码,效果相同。
(data<0.4).astype(np.int32)
2 非数值型数据编码
2.1 普通编码
将字符串按照(0~n-1)进行编码
label_encoder = preprocessing.LabelEncoder()
input_classes = ['audi', 'ford', 'toyota', 'ford', 'bwm']
label_encoder.fit(input_classes)
print("\nClass mapping:")
for i, item in enumerate(label_encoder.classes_):
print(item, "-->", i)
编码器:
Class mapping:
audi --> 0
bwm --> 1
ford --> 2
toyota --> 3
对新数据进行编码:
labels = ['toyota', 'ford', 'audi']
encoded_labels = label_encoder.fit_transform(labels)
print("Labels: ", labels)
print("Encoded Labels: ", encoded_labels)
编码结果为:
Labels: ['toyota', 'ford', 'audi']
Encoded Labels: [2 1 0]
解码使用inverse_transform即可
encoded_labels = [2, 1, 0, 3, 1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("Encoded Labels: ", encoded_labels)
print("Decoded Labels: ", decoded_labels)
解码结果为:
Encoded Labels: [2, 1, 0, 3, 1]
Decoded Labels: ['ford' 'bwm' 'audi' 'toyota' 'bwm']
2.2 独热编码(one hot)
独热编码用于将非structure data进行编码,确保编码后的数据在常见的欧式空间中距离不变。独热编码的详细介绍可以参照这里。这里需要注意,one-hot是按照列进行编码的
data = np.array([[0, 2, 1, 12],
[1, 3, 5, 3],
[2, 3, 2, 12],
[1, 2, 4, 3]])
encoder = preprocessing.OneHotEncoder()
encoder.fit(data)
encoder_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print(encoder_vector)
结果为:
[[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]