3_feature_columns

文本主要记录feature columns的相关内容。feature columns是原始数据鱼Estimator之间的媒介，其内容比较丰富，可以将各种各样的原始数据转换为Estimator可以用的格式。无论如何，深度神经网络可以处理的数据类型一定是数字。但是在实际使用时，原始输入的数据可能并不是数值型的，可能是类别或者其他非数值数据。为了让深度神经网络可以接收并处理各种各样的原始数据，就需要对使用tf.feature_column这个模块来创建模型可使用的各种feature columns。这里主要介绍下图涉及的九个方法：

feature_columns

1. Numeric column

tf.feature_column.numeric_column主要处理的是原始数据是实数（默认为tf.float32），这样的特征值模型可以直接使用，并不需要做其他任何转换。调用方法如下所示：

numeric_feature_column = tf.feature_column.numeric_column(key="SpalLength")

上述调用默认使用tf.float32作为数据类型，也可以通过dtype来指定数据类型，如下所示：

numeric_feature_column = tf.feature_column.numeric_column(
    key="SepalLength",
    dtype=tf.float64)

默认情况下，创建的是一个单值（标量）。可以使用shape参数来指定其他形状，如下所示：

# Represent a 10-element vector in which each cell contains a tf.float32.
vector_feature_column = tf.feature_column.numeric_column(
    key="Bowling",
    shape=10,
)
# Represent a 10x5 matrix in which each cell contains a tf.float32.
matrix_feature_column = tf.feature_column.numeric_column(
    key="MyMatrix",
    shape=[10, 5],
)

2. Bucketized column

当我们需要对数值在一定范围内将其分为不同的类型时，就需要创建Bucketized column，主要使用tf.feature_column.bucketized_column。例如以人的年龄数据为例，我们并不以其真实年龄值作为特征值，而是将年龄分为如下几个bucket中：

年龄范围	特征表示
<18	0
>=18 && <30	1
>= 30 && <45	2
>= 45 && < 60	3
>=60	4

示例如下：

# 首先将原始数据转换成numeric column
numeirc_feature_column = tf.feature_column.numeric_column("age")
# 然后，将numeric column转换成buckeized column
bucketized_feature_column = tf.feature_column.bucketized_column(
    source_column = numeric_feature_column,
    boundries = [18, 30, 45, 60]
)

这里需要注意的是要分成5个buckets，boundries需要指定4个分界就可以了。

3. Categorical identity column

tf.feature_column.categorical_column_with_identity用来实现分类标识列。可以将其视为一同特殊的bucketized column，在bucketized column中一个bucket代表一系列的原始值，在categorical identity column中每个bucket表示唯一的整数。例如，要将2, 4, 6, 7四个数字对应的categorical identity column如下表所示：

原始数值	特征表示
2	0
4	1
6	2
7	3

示例代码如下：

identity_feature_column = tf.feature_column.categorical_column_with_identity(
    key = "my_feature",
    num_buckets = 4,
)

4. Categorical vocabulate column

有时候，特征的原始输入是字符串，模型并不能处理字符串，这里就需要将字符串映射到数组或者分类值。其中，categorical vocabulate column提供了一种将字符串表示为one-hot向量的方法，如下表所示：

原始字符串	特征表示
kitchenware	0
ele	[1
sports	2

这里主要可以通过如下两个接口来实现：

tf.feature_column.categorical_column_with_vocabulary_list
tf.feature_column.categorical_column_with_vocabulary_file
示例代码如下：

vocabulary_feature_column = tf.feature_column.categorical_column_with_vocabulary_list(
    key = "my_feature",
    vocabulary_list = ["kitchenware", "ele", "sports"]
)

当词汇表非常长的时候，vocabulary_list将会是一个非常长的list。这里就可以改用tf.feature_column.categorical_column_with_vocabulary_file，tensorflow会从一个单独的文件中获取词汇列表，代码如下：

vocabulary_feature_column = tf.feature_column.categorical_column_with_vocabulary_file(
    key = "my_feature",
    vocabulary_file = "vocabulary_file.txt",
    vocabulary_size = 3,

其中vocabulary_file.txt中每一个单词占一行，一共3行，内容如下：

kitchenware
ele
sports

5. Hashed Column

上述处理的类别都是比较少的类别，当类别非常多时，无法为每个词汇单独设置一个one-hot向量。于是就可以使用tf.feature_column.categorical_column_with_hash_bucket将各个类别映射到不同的整数。这种方式不可避免的将不同的类别映射到同一整数，不过这并不会对模型的处理产生负面影响。具体做法是，首先计算原始数据的hash值；然后用hash值对hash_buckets_size取模运算，这样就将原始数据映射到了hash_buckets_size范围内的整数类别中了。如下图所示：

categorical hash bucket

代码如下：

hash_feature_column = tf.feature_column.categorical_column_with_hash_bucket(
    key = "my_feature",
    hash_buckets_size = 100,
)

6. Crossed column

tf.feature_column.crossed_columns用来将多个特征组合成一个特征。假设我们希望模型计算佐治亚州亚特兰大的房地产价格。这个城市的房地产价格在不同位置差异很大。在确定对房地产位置的依赖性方面，将纬度和经度表示为单独的特征用处不大；但是，将纬度和经度组合为一个特征则可精确定位位置。假设我们将亚特兰大表示为一个 100x100 的矩形网格区块，按纬度和经度的特征组合标识全部 10000 个区块。借助这种特征组合，模型可以针对与各个区块相关的房价条件进行训练，这比单独的经纬度信号强得多。具体代码如下：

def make_dataset(latitude, longitude, labels):
    assert latitude.shape == longitude.shape == labels.shape

    features = {'latitude': latitude.flatten(),
                'longitude': longitude.flatten()}
    labels=labels.flatten()

    return tf.data.Dataset.from_tensor_slices((features, labels))

# Bucketize the latitude and longitude usig the `edges`
latitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('latitude'),
    list(atlanta.latitude.edges))

longitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('longitude'),
    list(atlanta.longitude.edges))

# Cross the bucketized columns, using 5000 hash bins.
crossed_lat_lon_fc = tf.feature_column.crossed_column(
    [latitude_bucket_fc, longitude_bucket_fc], 5000)

fc = [
    latitude_bucket_fc,
    longitude_bucket_fc,
    crossed_lat_lon_fc]

# Build and train the Estimator.
est = tf.estimator.LinearRegressor(fc, ...)

crossed column可以对任意特这个进行组合，除了categorical_column_with_hash_bucket，因为crossed_column会对输入进行hash处理。简单来说其处理过程是这样的，将原特征组合在一起，然后计算hash，最后对hash值取hash_bucket_size模。

7. Indicator and embedding column

类别值	one-hot表示
0	[1, 0, 0, 0]
1	[0, 1, 0, 0]
2	[0, 0, 1, 0]
3	[0, 0, 0, 1]

indicator column和embedding column并不直接处理特征，而是将分类视为输入。
indicator column将每个类别转换成one-hot向量，如下表所示

类别值	one-hot表示
0	[1, 0, 0, 0]
1	[0, 1, 0, 0]
2	[0, 0, 1, 0]
3	[0, 0, 0, 1]

示例代码如下：

categorical_column = = tf.feature_column.categorical_column_with_vocabulary_list(
    key = "my_feature",
    vocabulary_list = ["kitchenware", "ele", "sports"]
)

indicator_column = tf.feature_column.indicator_column(categorical_column)

如果我们的输入有百万甚至亿级的类别时，使用indicator column就不合适了，模型也根本没办法处理（内存存不下了）。这是可以使用embedding column来解决这个问题。embedding column是将多维one-hot向量表示为低维普通向量，其中的元素并不是0和1而是一个确定的数值。同时embedding column的维度要远远小于indicator column，而且可以更加丰富的表现分类的内容。
接下来看一下embedding column和indicator column之间的区别。假设输入样本包含多个不同的词（取自仅有81个词的有限词汇表）。假设数据集在4个不同的样本中提供了如下输入词：

dog
spoon
scissors
guitar

下图分别说明了embedding column和indicator column的处理流程：

embedding_vs_indicator

embedding column中的值是在训练期间进行分配的。embedding column可以增强模型的功能，因为在训练过程中模型学习了类别之间的其他关系。关于embedding后续还会详细介绍。embedding column的维度是由原始类别的个数决定，一般采用如下公式进行确定：

embedding_dimensions =  number_of_categories**0.25

代码示例如下所示：

categorical_column = ... # Create any categorical column

# Represent the categorical column as an embedding column.
# This means creating a one-hot vector with one element for each category.
embedding_column = tf.feature_column.embedding_column(
    categorical_column=categorical_column,
    dimension=dimension_of_embedding_vector)

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,684评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,143评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,214评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,788评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,796评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,665评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,027评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,679评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 41,346评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,664评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,766评论 1赞 331
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,412评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,015评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,974评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,203评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,073评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,501评论 2赞 343