tf.feature_column实用特征工程总结

tf.feature_column特征处理功能预览

tf.feature_column主要针对离散特征和连续特征, 概括而言, tf.feature_column可以实现四种场景

(1) 连续特征离散化, 分箱, hash, onehot, mutihot
(2) 高维离散特征映射成低维稠密向量
(3) 连续特征归一化
(4) 离散特征交叉组合

feature_column.png

离散特征的表征策略

tf.feature_column对离散变量的处理有四个接口, 分别是整数直接onehot, 指定词表onehot, hash降维onehot, embedding离散特征连续化, 其中前三种都会把字符串/整数特征处理成onehot 0/1向量,最后一种embedding_column会对onehot的结果之上在做低维稠密向量映射, 接口如下

tf.feature_column.categorical_column_with_identity
tf.feature_column.categorical_column_with_vocabulary_list
tf.feature_column.categorical_column_with_hash_bucket
tf.feature_column.embedding_column


1.整数连续值的特征直接映射成离散特征 tf.feature_column.categorical_column_with_identity

如果这一列离散特征本身就是用连续的整数表示的(从0开始),则可以直接映射为离散变量,提前指定最大取值数量,如果超出了用默认值填充,适合本来就是用整数ID编码,并且编码氛围不是很大的离散特征, 如果传入的值列表包含多个元素, 可以实现mutihot, 但是列表元素个数要相同

key, 特征名
num_buckets, 离散特征的离散取值规模
default_value=None, 出现新值的默认填充值

# 对col1列进行特征离散化
one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
features = {"col1": [[4], [1], [0]]}  # 必须是大于等于0的数,不能是负数
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:
    print(net.eval())
# [[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

# 测试mutihot
one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
features = {"col1": [[4, 2], [1, 0], [0, 1]]}  # 列表中包含多个元素
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:
    print(net.eval())
# [[0. 0. 1. 0. 1. 0. 0. 0. 0. 0.]
#  [1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

2.字符串的离散特征通过词表映射为离散特征 tf.feature_column.categorical_column_with_vocabulary_list

如果一列离散特征是字符串(也可以是整数), 并且取值范围不多, 可以使用这个接口, 定义好离散特征的取值就好, 当出现新值的时候可以分配新的索引位置,或者映射到已有的位置上, 词表映射也支持mutihot

key, 特征名
vocabulary_list, 所有词的list
dtype=None,
default_value=-1, 处理新词, 新词直接映射到一个已给词的索引上, 默认是-1,由于索引从0开始所以默认就是不处理
num_oov_buckets=0, 处理新词, 新词映射到新的索引上, 设置新词给多少索引

# 对col1列进行特征离散化
from tensorflow.python.ops import lookup_ops

# num_oov_buckets会新增2列,防止有新词分配给他们, 加上词表的4列onehot之后一共有6列
one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
            'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=2)
# onehot的索引顺序就是['a', 'x', 'ca', ''我'],所以只要指定了vocabulary_list的顺序,结果都是一样的
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  

features = {'col1': [['a'], ['我'], ['a'], ['ca']]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:
    sess.run(lookup_ops.tables_initializer())
    print(net.eval())
# [[1. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 1. 0. 0.]
#  [1. 0. 0. 0. 0. 0.]
#  [0. 0. 1. 0. 0. 0.]]

如果设置num_oov_buckets=0, 或者不设置默认, 新词会直接忽略掉

one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
            'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=0)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  

features = {'col1': [['a'], ['他'], ['a'], ['ca']]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:
    sess.run(lookup_ops.tables_initializer())
    print(net.eval())
# [[1. 0. 0. 0.]
# [0. 0. 0. 0.]
#  [1. 0. 0. 0.]
#  [0. 0. 1. 0.]]

另一种方式设置, 也可以处理新词问题, 直接映射到一个其他的词上面, 比如叫其他

one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
            'col1', vocabulary_list=['a', 'x', 'ca', '我'], default_value=1)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  

features = {'col1': [['a'], ['他'], ['a'], ['ca']]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:
    sess.run(lookup_ops.tables_initializer())
    print(net.eval())
# [[1. 0. 0. 0.]
#  [0. 1. 0. 0.]
#  [1. 0. 0. 0.]
#  [0. 0. 1. 0.]]

测试mutihot, 同一个元素在一行出现多次, 计数会超过1

from tensorflow.python.ops import lookup_ops

one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
            'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=2)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  

features = {'col1': [['a', 'a'], ['我', 'a'], ['a', 'ca'], ['ca', 'x']]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:
    sess.run(lookup_ops.tables_initializer())
    print(net.eval())
# [[2. 0. 0. 0. 0. 0.]
#  [1. 0. 0. 1. 0. 0.]
#  [1. 0. 1. 0. 0. 0.]
#  [0. 1. 1. 0. 0. 0.]]

3.字符串的离散特征哈希后映射成离散特征 tf.feature_column.categorical_column_with_hash_bucket

如果这一列是字符串离散变量(也可以是整数, 支持整数和字符串), 并且取值很多的情况下, 比如ID, 可以使用这个接口

key, 特征名
hash_bucket_size, 一个大于1至少是2的整数, 分为多少个桶
dtype=tf.string, 输入的特征类型, 支持字符串和整数
使用hash分桶将离散特征降维成离散特征表示

one_categorical_feature = tf.feature_column.categorical_column_with_hash_bucket(
            'col1', hash_bucket_size=3)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
features = {'col1': [['a'], ['x'], ['a'], ['b'], ['d'], ['h'], ['c'], ['k']]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:    
    print(net.eval())
# [[1. 0. 0.]
#  [0. 0. 1.]
#  [1. 0. 0.]
#  [0. 1. 0.]
#  [0. 0. 1.]
#  [0. 1. 0.]
#  [0. 1. 0.]
#  [0. 0. 1.]]

如果输入是整数数值, 使用dtype设置dtype=tf.int32, 内部会把整数先转化为字符串在做hash

one_categorical_feature = tf.feature_column.categorical_column_with_hash_bucket(
            'col1', hash_bucket_size=3, dtype=tf.int32)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
features = {'col1': [[1], [2]]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
with tf.Session() as sess:    
    print(net.eval())
# [[1. 0. 0.]
#  [0. 1. 0.]]

4.离散之后在做embedding连续化 tf.feature_column.embedding_column

整数,词表,hash之后通过indicator_column直接离散化, 进一步可以使用embedding_column将onehot矩阵通过一个中间的embedding随机词表, lookup成一个embedding稠密向量, 默认情况下embedding可以跟着模型继续训练, 即trainable=True, 对于mutihot, embedding支持向量组合方式mean, sqrtn和sum
1.hash之后做embedding

# 先hash
one_categorical_feature = tf.feature_column.categorical_column_with_hash_bucket(
            'col1', hash_bucket_size=3, dtype=tf.string)
# hash之后,对hash的新列再做embedding
embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)
# 和hash后直接onehot做对比
one_hot_cols = tf.feature_column.indicator_column(one_categorical_feature) 
# 输入特征的表头col1要和one_categorical_feature一致
features = {"col1": [['a'], ['x'], ['a'], ['ca']]}
# 分别查看embedding和onehot的结果
net = tf.feature_column.input_layer(features, embedding_cols)
net2 = tf.feature_column.input_layer(features, one_hot_cols)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(net.eval())
    print(net2.eval())
# [[-0.34977508  0.6984099   0.3818953 ]
#  [-0.15186837  0.8362309   0.03452474]
#  [-0.34977508  0.6984099   0.3818953 ]
#  [-0.15186837  0.8362309   0.03452474]]
# [[1. 0. 0.]
#  [0. 0. 1.]
#  [1. 0. 0.]
#  [0. 0. 1.]]

2.词表onehot之后做embedding

one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
            'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=2)
embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  

features = {'col1': [['a'], ['我'], ['a'], ['ca']]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
net2 = tf.feature_column.input_layer(features, embedding_cols)

with tf.Session() as sess:
    sess.run(lookup_ops.tables_initializer())
    sess.run(tf.global_variables_initializer())
    print(net.eval())
    print(net2.eval())

3.整数直接onehot之后再embedding

one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)

features = {"col1": [[4], [1], [0]]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
net2 = tf.feature_column.input_layer(features, embedding_cols)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(net.eval())
    print(net2.eval())
# [[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
#  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
# [[-0.25737718  0.7426282  -0.12589012]
#  [ 0.4047859   1.0059739   0.38896823]
#  [ 0.07904201 -0.10592438 -0.10732928]]

mutihot的embedding

one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)  # 默认组合方式是mean

features = {"col1": [[4, 2], [1, 0], [0, 3]]}
net = tf.feature_column.input_layer(features, one_categorical_feature_show)
net2 = tf.feature_column.input_layer(features, embedding_cols)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(net.eval())
    print(net2.eval())

连续特征的表征策略

tf.feature_column对连续变量的处理有两个个接口, 分别是连续值直接映射成连续变量, 连续值分箱离散离散化,接口如下

tf.feature_column.numeric_column
tf.feature_column.bucketized_column


1.连续值特征直接映射成连续特征,tf.feature_column.numeric_column

one_continuous_feature = tf.feature_column.numeric_column("col1")
feature = {"col1": [[1.], [5.]]}
net = tf.feature_column.input_layer(feature, one_continuous_feature)
with tf.Session() as sess:
    print(sess.run(net))
# [[1.]
#  [5.]]
可以使用normalizer_fn将连续列进行归一化, 需要指定自定义函数
one_continuous_feature = tf.feature_column.numeric_column("col1", normalizer_fn=lambda x: (x - 1.0) / 4.0)
feature = {"col1": [[1.], [5.]]}
net = tf.feature_column.input_layer(feature, one_continuous_feature)
with tf.Session() as sess:
    print(sess.run(net))
# [[0.]
#  [1.]]

2.连续值特征分箱转化为离散特征,tf.feature_column.bucketized_column

one_continuous_feature = tf.feature_column.numeric_column("col1")
# 不需要使用tf.feature_column.indicator_column, bucketized_column直接映射成onehot
bucket = tf.feature_column.bucketized_column(one_continuous_feature, boundaries=[3, 8])  # 小于3,≥3且<8,≥8
features = {"col1": [[2], [7], [13]]}
net = tf.feature_column.input_layer(features, [bucket])
with tf.Session() as sess:
    print(net.eval())
# 指定2个分割点, onehot成3列
# [[1. 0. 0.]
# [0. 1. 0.]
# [0. 0. 1.]]

连续变量分箱之后可以继续接embedding

one_continuous_feature = tf.feature_column.numeric_column("col1")
bucket = tf.feature_column.bucketized_column(one_continuous_feature, boundaries=[3, 8])  # 小于3,≥3且<8,≥8
embedding_cols = tf.feature_column.embedding_column(bucket, dimension=3)

features = {"col1": [[2], [7], [13]]}
net = tf.feature_column.input_layer(features, [bucket])
net2 = tf.feature_column.input_layer(features, embedding_cols)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(net.eval())
    print(net2.eval())
# [[1. 0. 0.]
#  [0. 1. 0.]
#  [0. 0. 1.]]
# [[ 0.16381341 -1.0849875  -1.0715116 ]
#  [ 0.02688654  0.4091867   0.4175648 ]
#  [ 0.61752146 -0.605965   -0.04298744]]    

离散特征交叉

tf.feature_column.crossed_column可以对离散特征进行交叉组合, 增加模型特征的表征能力

# 分别定义sex 0, 1代表男女, degree 0, 1, 2代表不同学历, 获得两个特征的交叉一共2 * 6
feature_a = tf.feature_column.categorical_column_with_identity("sex", num_buckets=2)
feature_b = tf.feature_column.categorical_column_with_identity("degree", num_buckets=3)
# hash_bucket_size必须指定, 是一个大于1的整数
feature_cross = tf.feature_column.crossed_column([feature_a, feature_b], hash_bucket_size=6)
feature_cross_show = tf.feature_column.indicator_column(feature_cross)

features = {"sex": [[1], [1], [0]], "degree": [[1], [0], [2]]}

net = tf.feature_column.input_layer(features, feature_cross_show)
with tf.Session() as sess:
    print(net.eval())

# [[0. 0. 0. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 1.]
#  [0. 0. 1. 0. 0. 0.]]

尝试用连续特征和离散特征进行交叉, 结果报错

feature_a = tf.feature_column.categorical_column_with_identity("sex", num_buckets=2)
feature_b = tf.feature_column.categorical_column_with_identity("degree", num_buckets=3)
feature_c = tf.feature_column.numeric_column("age")
feature_cross = tf.feature_column.crossed_column([feature_b, feature_c], hash_bucket_size=6)
feature_cross_show = tf.feature_column.indicator_column(feature_cross)

features = {"sex": [[1], [1], [0]], "degree": [[1], [0], [2]], "age": [[1], [2], [3]]}

net = tf.feature_column.input_layer(features, feature_cross_show)
with tf.Session() as sess:
    print(net.eval())
# Unsupported key type. All keys must be either string, or categorical column except HashedCategoricalColumn. Given: 
# NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

定义多种混合特征

定义多个特征时, 在features字典中定义多个key, tensor对象传入一个list, 其中list的特征顺序不影响特征组合结果, 以feature_a.name的字符串顺序决定组合的特征组合的顺序

feature_a = tf.feature_column.numeric_column("col1")
feature_b = tf.feature_column.categorical_column_with_hash_bucket(
        "col2", hash_bucket_size=3)
feature_c = tf.feature_column.embedding_column(
        feature_b, dimension=3)
feature_d = tf.feature_column.indicator_column(feature_b)
print(feature_a.name)
print(feature_c.name)
print(feature_d.name)
features = {
        "col1": [[9], [10]],
        "col2": [["x"], ["yy"]]
        }
net = tf.feature_column.input_layer(features, [feature_d, feature_c, feature_a])  # 跟此顺序无关
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(net.eval())

# col1
# col2_embedding
# col2_indicator
# [[ 9.          0.3664888   0.81272614 -0.45234588  0.          0.     1.        ]
#  [10.          0.9593401  -0.18347593 -0.06236586  0.          1.     0.        ]]
# 先是col1, 再是col2_embedding, 最后是col2_indicator, 一共7列特征
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,636评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 83,890评论 2 376
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,680评论 0 330
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,766评论 1 271
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,665评论 5 359
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,045评论 1 276
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,515评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,182评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,334评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,274评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,319评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,002评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,599评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,675评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,917评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,309评论 2 345
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,885评论 2 341