图神经网络：GCN源代码完全解读（tensorflow）

摘要：图神经网络，GCN，scipy

找了github上搜gcn排名第一的GCN项目分析一下它的代码实现。

快速开始

git clone下载代码后简单地修改调试一下，运行train.py

root@ubuntu:/home/git/gcn/gcn# python train.py
Epoch: 0001 train_loss= 1.95334 train_acc= 0.10000 val_loss= 1.95048 val_acc= 0.16400 time= 0.68464
Epoch: 0002 train_loss= 1.94804 train_acc= 0.27857 val_loss= 1.94673 val_acc= 0.37000 time= 0.01166
...
Epoch: 0200 train_loss= 0.56547 train_acc= 0.97857 val_loss= 1.04744 val_acc= 0.78600 time= 0.01720
Optimization Finished!
Test set results: cost= 1.00715 accuracy= 0.81600 time= 0.00546

可以跑，本地的tensorflow版本是1.14.0

数据源分析

train.py在上面定义了可配参数，接着读取数据源

adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask = load_data(FLAGS.dataset)

看下load_data方法，该工程数据源有3套，每套有8个数据文件，以文件后缀作为标识，以默认的croa数据集为例，包含以下数据文件

x：numpy的稀疏矩阵格式，size=(140, 1433)，代表训练集140个节点的特征向量，用稀疏矩阵的原因是特征向量以onehot形式展开
y：numpy array格式，size=(140, 7)，代表训练集140个节点的y值，以onehot的形式展开，有7个类别
tx：numpy的稀疏矩阵格式，size=(1000, 1433)，代表测试集1000个节点的特征向量
ty：numpy array格式，size=(1000, 7)，代表测试集1000个节点的y值
graph：图关系，字典格式，key为节点，value为邻居列表

Cora数据集由机器学习论文组成。这些论文分为以下七个类别之一：基于案例，遗传算法，神经网络，概率方法，强化学习，规则学习，理论。筛选出引用或被至少一篇其他论文引用（有关联关系），整个语料库中有2708篇论文，在词干堵塞和去除词尾后，只剩下1433（特征维度）个唯一的单词，文档频率小于10的所有单词都被删除。该数据源做GCN的目的是根据论文的引用关系（图）和论文中词出现的onehot矩阵（特征向量），预测出论文的类型（节点分类）。
load_data函数内部主要是将所有数据聚合在一起分割训练，验证和测试，训练集的索引是从0~140，验证集从140~640，测试集从1708～2707，如下代码

    # 获得三个数据集对应在总特征向量矩阵的索引值
    idx_test = test_idx_range.tolist()
    idx_train = range(len(y))
    idx_val = range(len(y), len(y) + 500)

最终返回所有节点的邻接矩阵（nx.adjacency_matrix实现）, 2708个节点的特征向量（lil_matrix稀疏矩阵）, 训练、验证、测试的y值矩阵（带有mask掩码）, 以及训练、验证、测试的掩码。

节点特征处理

下一步进入以下代码，默认模型是gcn，该段代码是在模型构建之前将节点特征向量处理完成

# 将特征从稀疏矩阵，行归一化之后，转化成coo稀疏矩阵，输出坐标，值，shape
features = preprocess_features(features)
if FLAGS.model == 'gcn':
    # 对称归一化 D-0.5*A*D-0.5
    support = [preprocess_adj(adj)]
    num_supports = 1
    # 模型设定为GCN
    model_func = GCN

首先看preprocess_features，目的是对节点的特征向量做行L1归一化，每一行的和是1，具体实现是创建了一个每一行和的倒数的对角矩阵乘以特征向量（和度的-1乘X获得邻居求和的平均值同理）

def preprocess_features(features):
    """Row-normalize feature matrix and convert to tuple representation"""
    rowsum = np.array(features.sum(1))
    # 和的倒数
    r_inv = np.power(rowsum, -1).flatten()
    r_inv[np.isinf(r_inv)] = 0.
    r_mat_inv = sp.diags(r_inv)  # 对角阵 (2708, 2708)
    features = r_mat_inv.dot(features)  # 点乘对每一行做行标准化
    return sparse_to_tuple(features)

在标准化之后调用sparse_to_tuple将特征转化为一个tuple，跟以下这个函数

def sparse_to_tuple(sparse_mx):
    """Convert sparse matrix to tuple representation."""
    def to_tuple(mx):
        if not sp.isspmatrix_coo(mx):
            # 转化为coo格式的稀疏矩阵
            # 行列坐标和值
            mx = mx.tocoo()
        coords = np.vstack((mx.row, mx.col)).transpose()
        values = mx.data
        shape = mx.shape
        return coords, values, shape

    if isinstance(sparse_mx, list):
        for i in range(len(sparse_mx)):
            sparse_mx[i] = to_tuple(sparse_mx[i])
    else:
        sparse_mx = to_tuple(sparse_mx)

    return sparse_mx

直接定位到sparse_mx = to_tuple(sparse_mx)这一行再看to_tuple，实际上是将原来的feature从csr_matrix转化为coo_matrix，并且输出特征向量coords, values, shape（有值位置的坐标，值，特征向量的shape）三要素作为元组。这里有两个矩阵，分别是近接矩阵和节点特性向量矩阵，由于这两个都是1,0稀疏格式因此采用scipy的稀疏矩阵格式，其中邻接矩阵采用csr_matrix方便计算对称归一化，而特征矩阵采用的是先lil_matrix方便做行切片，最后转化为coo_matrix，原因是特征矩阵需要使用占位符placeholder传入模型内部，而邻接矩阵是全局共享不变的不需要占位符，而稀疏站位符tf.sparse_placeholder的格式是（行列索引，值，shape）和coo_matrix对应，因此代码中最后转化为coo_matrix。
在对features处理完毕后在看还有两行代码

num_supports = 1
model_func = GCN

第二个很明显采用GCN类作为模型，第一行如果是GCN模式直接写死是1，不纠结。

scipy.sparse的多种稀疏矩阵的区别

这里主要看一下代码中用到的三种稀疏向量表示csr_matrix，lil_matrix和coo_matrix

csr_matrix：压缩稀疏行矩阵，该种格式常用于稀疏矩阵的运算，以及高效的行切片操作
lil_matrix：基于行连接存储的稀疏矩阵，该种格式用于高效地添加、删除、查找元素，同时高效的行切片操作
coo_matrix：坐标格式的矩阵，不同稀疏格式间转换效率高，coo_matrix不支持元素的存取和增删，一旦创建之后，除了将之转换成其它格式的矩阵，几乎无法对其做任何操作和矩阵运算

代码实操一下先看一下coo_matrix，需要指定值，坐标，维度三个要素即可确定一个稀疏矩阵，中这方式将稀疏矩阵内容拆分，很明显方便转化为其他类型，但是不发进行矩阵计算和切片操作

import scipy.sparse as sp
data = [1, 1, 2]
row = [0, 1, 1]
col = [0, 1, 2]
matrix = sp.coo_matrix((data, (row, col)), shape=(3, 3))
matrix.todense()
# 输出
matrix([[1, 0, 0],
        [0, 1, 2],
        [0, 0, 0]])

第二个是csr_matrix，data是矩阵的非零值，indices是和非零值一一对应的所在行的列位置，indptr是总计非零值的个数，第一个元素默认是0，从第二个元素开始记录每行非零的值个数，这个再结合data按照顺序就可以确定一个稀疏矩阵

indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
matrix = sp.csr_matrix((data, indices, indptr), shape=(3, 3))
matrix.todense()
# 输出
matrix([[1, 0, 2],
        [0, 0, 3],
        [4, 5, 6]])

第三个是lil_matrix没有找到初始化创建的案例，直接看一下他在切片数据之后更新，增加数据的威力

import scipy.sparse as sp
data = [1, 1, 2]
row = [0, 1, 1]
col = [0, 1, 2]
matrix = sp.coo_matrix((data, (row, col)), shape=(3, 3))
# 输出
        [[1, 0, 0],
        [0, 1, 2],
        [0, 0, 0]]
# 转化为lil_matrix
matrix = matrix.tolil()
# 改变某个元素，第0行第2个位置更新为第1行第1个位置
matrix[0, 2] = matrix[1, 1] 
matrix.todense()
# 输出
        [[1, 0, 1],
        [0, 1, 2],
        [0, 0, 0]]
# 更新指定的多个行
matrix[[0, 2]] = matrix[1]
matrix.todense()
# 输出
        [[0, 1, 2],
        [0, 1, 2],
        [0, 1, 2]]

试一下其他稀疏矩阵能不能完成同样的更新操作

# 把转化为lil_matrix注释掉
# matrix = matrix.tolil()
matrix[0, 2] = matrix[1, 1]
TypeError: 'coo_matrix' object is not subscriptable

coo_matrix不行不支持下标，再看一下csr_matrix是可以完成同样任务的，但是对比和coo_matrix看一下效率

import time
t1 = time.time()
for i in range(1000):
    data = [1, 1, 2, 10, 100]
    row = [0, 1, 1, 4, 4]
    col = [0, 1, 2, 3, 4]
    matrix = sp.coo_matrix((data, (row, col)), shape=(5, 5))
    # matrix = matrix.tolil()
    matrix = matrix.tocsr()
    matrix[0, 2] = matrix[1, 1]
    matrix[[0, 2]] = matrix[1]
t2 = time.time()
print(t2 - t1)
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_intXint(row, col, x.flat[0])

耗时比lil_matrix更高并且已经爆出警告更改csr_矩阵的稀疏结构代价高昂。lil_matrix更有效。这也是为什么作者在做特征向量位置调整时采用lil_matrix格式。

tf.sparse_placeholder稀疏占位符

下面继续研究tf.sparse_placeholder，看下他说怎么和coo_matrix配合使用的。

row = np.array([0, 0, 1, 3])  # 第几行
col = np.array([0, 2, 1, 3])  # 第几列
data = np.array([4, 9, 7, 5])  # 值
tmp = sp.coo_matrix((data, (row, col)), shape=(4, 4))

x = tf.sparse_placeholder(tf.float32)  # 输入数据类型
with tf.Session() as sess:
    indices = np.mat([tmp.tocoo().row, tmp.tocoo().col]).transpose()
    values = tmp.tocoo().data
    shape = tmp.tocoo().shape
    # feed_dict的传入格式是三元组（坐标，非零值，维度）
    sp_ten = sess.run(x, feed_dict={x: (indices, values, shape)})
    print("-----------tf.sparse_placeholder效果")
    print(sp_ten)
    dense_tensor = tf.sparse_tensor_to_dense(sp_ten)
    print("-----------tf.sparse_placeholder转化为稠密矩阵")
    print(sess.run(dense_tensor))

-----------tf.sparse_placeholder效果
SparseTensorValue(indices=array([[0, 0],
       [0, 2],
       [1, 1],
       [3, 3]]), values=array([4., 9., 7., 5.], dtype=float32), dense_shape=array([4, 4]))
-----------tf.sparse_placeholder转化为稠密矩阵
[[4. 0. 9. 0.]
 [0. 7. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 5.]]

结论就是coo_matrix转化为三元组格式可以直接传入tf.sparse_placeholder中，作者的代码也是这样实现的。且看训练在这一行实现

feed_dict_val = construct_feed_dict(features, support, labels, mask, placeholders)

跟一下这个函数construct_feed_dict

    feed_dict = dict()
    feed_dict.update({placeholders['labels']: labels})
    feed_dict.update({placeholders['labels_mask']: labels_mask})
    feed_dict.update({placeholders['features']: features})

在看placeholders['features']这个在train.py中定义到全局

'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(features[2], dtype=tf.int64))

这下就实现了tf.sparse_placeholder和coo_matrix的对接

模型构建

基础数据分割和格式转化完成之后，进入模型训练，第一步定义占位符

placeholders = {
    'support': [tf.sparse_placeholder(tf.float32) for _ in range(num_supports)],
    'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(features[2], dtype=tf.int64)),
    'labels': tf.placeholder(tf.float32, shape=(None, y_train.shape[1])),
    'labels_mask': tf.placeholder(tf.int32),
    'dropout': tf.placeholder_with_default(0., shape=()),
    'num_features_nonzero': tf.placeholder(tf.int32)  # helper variable for sparse dropout
}

作者采用可key，value的格式定义了placeholders字典，先看下他在下面是怎么调用传值的

feed_dict = construct_feed_dict(features, support, y_train, train_mask, placeholders)
outs = sess.run([model.opt_op, model.loss, model.accuracy], feed_dict=feed_dict)

以上两行构建了feed_dict，看下construct_feed_dict

feed_dict = dict()
feed_dict.update({placeholders['labels']: labels})
feed_dict.update({placeholders['labels_mask']: labels_mask})

construct_feed_dict拿到了在train.py定义的placeholders，placeholders拿到指定的key替换为placeholders中的value（各种tensorflow tensor对象）作为key，以具体的值作为value，装进feat_dict中，feat_dict中一对kv的形式如下

{<tf.Tensor 'Placeholder_5:0' shape=(?, 7) dtype=float32>: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])}

区别于传统的将placeholder赋值给一个内存中额对象，在feat_dict中用这个对象作为key，作者直接拿的是tensor对象作为key，这个地方使用同一个字典拿到同一个value的方式确保tensor对象引用唯一，如果是新建了一个tensor对象就算是新建的语句一样也会匹配不到tensor对象和值的关系，以dropout为例看一下模型内部怎么调用以及外部怎么灌入数据的

class GraphConvolution(Layer):
    """Graph convolution layer."""
    def __init__(self, input_dim, output_dim, placeholders, dropout=0.,
                 sparse_inputs=False, act=tf.nn.relu, bias=False,
                 featureless=False, **kwargs):
        super(GraphConvolution, self).__init__(**kwargs)

        if dropout:
            self.dropout = placeholders['dropout']
        else:
            self.dropout = 0.

以上在GCN的卷积层定义而了一个dropout对象赋值为tf.placeholder_with_default(0., shape=())的占位符，在feat_dict中kv对如下

<tf.Tensor 'PlaceholderWithDefault:0' shape=() dtype=float32>: 0.5}

而tf.Tensor 'PlaceholderWithDefault:0' shape=()是通过placeholders['dropout']获取的，看这一行

feed_dict.update({placeholders['dropout']: FLAGS.dropout})

因此这个placeholders['dropout']是同一个tensor对象在这个地方实现了tensor引用传值作为feat_dict的key。
下面一个一个看一下定义这些占位符的目的，其中features，labels很好理解，看下下面几个到底在干嘛

support：对称归一化的领结矩阵，稀疏矩阵输入的列表，可以有多个稀疏矩阵，个数由num_supports控制，在GCN中num_supports为1，在模型中support用来和WX相乘
labels_mask：y值的屏蔽，屏蔽非当前数据集y对loss和acc的计算影响。实际使用train_mask灌入数据，train_mask是2708个布尔值，前140个为True，相当于把非训练集的y值给屏蔽了。在模型中在计算loss和accuracy时需要用到，下面具体分析。
dropout：dropout在GCN原理中没有单独写到，在模型层中dropout添加在节点向量矩阵X中，即每一阶的H中
num_features_nonzero：节点特征矩阵中非零值的个数，等于，传入的值是features三元组中的features[1].shape=49216，一个辅助变量，作用是生成和稀疏矩阵中有值位置想匹配的mask，具体是结合tf.sparse_retain使用，下面再具体分析

在往下面就是构建模型了

# Create model
model = model_func(placeholders, input_dim=features[2][1], logging=True)

模型实例化传入了placeholders，input_dim，logging

placeholders：传入placeholder集合，使得在模型层能够拿到对应的占位符在模型内部赋值到对应变量
input_dim：节点特征向量的维度，本例中是1433，这个变量的作用是在模型层创建与之相对应的W矩阵的输入维度
logging：布尔值，作用是一个开关是否在训练过程中使用tf.summary.histogram记录训练分析结果

build模块

下一步看具体的model_func类，在上面代码中model_func赋值于GCN，看GCN类，GCN继承了Model类，重写了Model的_loss，_accuracy，predict，_build模块，先看GCN的初始化

    def __init__(self, placeholders, input_dim, **kwargs):
        super(GCN, self).__init__(**kwargs)

        self.inputs = placeholders['features']
        self.input_dim = input_dim
        # self.input_dim = self.inputs.get_shape().as_list()[1]  # To be supported in future Tensorflow versions
        self.output_dim = placeholders['labels'].get_shape().as_list()[1]
        self.placeholders = placeholders

        self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)

        self.build()

这段代在模型内部拿到了所有placeholders占位符，并且将节点向量矩阵和inputs进行连接，设置了特征维度input_dim，输出维度output_dim，定义了模型内部的placeholders（里面有全部占位符信息包括邻接矩阵），定义了优化器，最后调用主类的build方法完成GCN所有内部节点对象的构建。看一下主类的build

    def _build(self):
        # 主类不实现，子类必须实现，否则报错NotImplementedError
        raise NotImplementedError

    def build(self):
        """ Wrapper for _build() """
        with tf.variable_scope(self.name):
            # 子类定义layer
            self._build()

        # Build sequential layer model
        self.activations.append(self.inputs)  # placeholders['features']
        # 开始对子；类型定义的layer遍历
        for layer in self.layers:
            hidden = layer(self.activations[-1])  # GraphConvolution inputs,拿到上一阶的输入
            self.activations.append(hidden)  # _call拿到一阶的输出
        self.outputs = self.activations[-1]  # 最新的输出

        # Store model variables for easy access
        variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
        self.vars = {var.name: var for var in variables}

        # Build metrics
        # 子类定义计算loss
        self._loss()
        # 子类定义计算acc
        self._accuracy()

        self.opt_op = self.optimizer.minimize(self.loss)

主类的build先调用_build，_build在子类中被重写，看一下子类的_build

    def _build(self):

        self.layers.append(GraphConvolution(input_dim=self.input_dim,  # 1433
                                            output_dim=FLAGS.hidden1,  # 16
                                            placeholders=self.placeholders,
                                            act=tf.nn.relu,
                                            dropout=True,
                                            sparse_inputs=True,
                                            logging=self.logging))

        self.layers.append(GraphConvolution(input_dim=FLAGS.hidden1,  # 16
                                            output_dim=self.output_dim,  # 7
                                            placeholders=self.placeholders,
                                            act=lambda x: x,  # 没有激活函数
                                            dropout=True,
                                            logging=self.logging))

子类_build相当硬核，定义了两层GCN卷积类对象，看一下self.layers对象，在主类初始化中是一个空列表

self.layers = []

因此_build将主类中的layers空列表填充了2阶卷积操作，可见作者的模型包含了2阶图卷积。下面继续看主类中的build操作

        # Build sequential layer model
        self.activations.append(self.inputs)  # placeholders['features']
        # 开始对子；类型定义的layer遍历
        for layer in self.layers:
            hidden = layer(self.activations[-1])  # GraphConvolution inputs,拿到上一阶的输入
            self.activations.append(hidden)  # _call拿到一阶的输出
        self.outputs = self.activations[-1]  # 最新的输出

activations是每一阶的节点特征向量，第一行代码其实是将原始节点向量加入到activations列表中作为第一层也就是X，下面开始遍历layers，每一个layer是一个GraphConvolution类对象，这里将self.activations[-1]（上一阶的节点特征向量）传入类中实际是直接执行了GraphConvolution类的_call方法，先瞄一眼主类Layer

    def __call__(self, inputs):
        with tf.name_scope(self.name):
            if self.logging and not self.sparse_inputs:
                tf.summary.histogram(self.name + '/inputs', inputs)
            outputs = self._call(inputs)
            if self.logging:
                tf.summary.histogram(self.name + '/outputs', outputs)
            return outputs

__call__的作用是直接传值给实例化后的类对象，可以直接执行call定义的函数，在call中作者调用了_call方法，因此hidden = layer(self.activations[-1])这行代码就是计算出了最新的这一阶节点的特征向量矩阵，然后填充到activations中给下一层计算使用，最终的节点向量输出等于activations的最后一个元素，赋值给outputs。
下面是拿到所有图变量，在下面save load模型ckpt文件是会用到

        # Store model variables for easy access
        variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
        self.vars = {var.name: var for var in variables}

不妨打印一下self.vars看下到底有哪些变量是需要神经网络训练的

{'gcn/graphconvolution_1_vars/weights_0:0': 
<tf.Variable 'gcn/graphconvolution_1_vars/weights_0:0' shape=(1433, 16) dtype=float32_ref>, 
'gcn/graphconvolution_2_vars/weights_0:0': 
<tf.Variable 'gcn/graphconvolution_2_vars/weights_0:0' shape=(16, 7) dtype=float32_ref>}

参数里面只有两层卷积的W，shape分别是(1433, 16)和(16, 7)，并没有全连接，卷积最后一层维度7已经和y值一致，可以直接softmax。

loss模块

下面开始定义loss

        self._loss()
        self._accuracy()
        self.opt_op = self.optimizer.minimize(self.loss)

_loss在子类覆写

    def _loss(self):
        # Weight decay loss
        for var in self.layers[0].vars.values():
            # 参数l2 loss W
            self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)

        # Cross entropy error
        self.loss += masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
                                                  self.placeholders['labels_mask'])

看一下self.layers[0].vars，这个layers是GraphConvolution中的对象，他有继承基类Layer中的self.vars = {}，这个字典在GraphConvolution初始化时被填充如下

        with tf.variable_scope(self.name + '_vars'):
            for i in range(len(self.support)):
                # 设置W，1433 × 16
                self.vars['weights_' + str(i)] = glorot([input_dim, output_dim],
                                                        name='weights_' + str(i))
            if self.bias:
                # DAXW没有偏执
                self.vars['bias'] = zeros([output_dim], name='bias')

由于support=1，vars添加了weights_0的glorot([input_dim, output_dim],name='weights_' + str(i))的tensor对象，跟一下这个glorot

def glorot(shape, name=None):
    """Glorot & Bengio (AISTATS 2010) init."""
    init_range = np.sqrt(6.0/(shape[0]+shape[1]))
    initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32)
    return tf.Variable(initial, name=name)

简单来看是glorot初始化，shape=(1433, 16)和(16, 7)，如果使用bias，再加一个[16]和[7]的0值偏置，进一步看一下命名空间，这段代码最上面声明了命名空间with tf.variable_scope(self.name + '_vars')，其中self.name 由基类Layer初始化定义

        if not name:
            layer = self.__class__.__name__.lower()
            name = layer + '_' + str(get_layer_uid(layer))

由于self.__class__.__name__.lower()在多次实例化类之后输出的名字是一样的（就是类的名字），因此作者在名字的基础上（GraphConvolution）增加了下标，实现方式是在全局记录了名字在全局内存中出现的次数，以次数作为下标

def get_layer_uid(layer_name=''):
    """Helper function, assigns unique layer IDs."""
    if layer_name not in _LAYER_UIDS:
        _LAYER_UIDS[layer_name] = 1
        return 1
    else:
        _LAYER_UIDS[layer_name] += 1
        return _LAYER_UIDS[layer_name]

因此结合上主类Model中的命名空间

    def build(self):
        """ Wrapper for _build() """
        with tf.variable_scope(self.name):
            # 子类定义layer
            self._build()

在双命名空间加持下最终的变量名是gcn/graphconvolution_1_vars/weights_0:0和gcn/graphconvolution_2_vars/weights_0:0，回过头来看loss，作者给所有卷积W增加了L2 loss，self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)，接下来进入主要的loss，输出和y值的交叉熵

 self.loss += masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
                                                  self.placeholders['labels_mask'])

跟一下这个masked_softmax_cross_entropy

def masked_softmax_cross_entropy(preds, labels, mask):
    """Softmax cross-entropy loss with masking."""
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels)
    mask = tf.cast(mask, dtype=tf.float32)
    mask /= tf.reduce_mean(mask)
    loss *= mask
    return tf.reduce_mean(loss)

首先作者用tf.nn.softmax_cross_entropy_with_logits求出了每一行训练样本的softmax交叉熵，具体是直接把第二层卷积的结果（140,7）直接softmax之后，与（140,7）的y计算交叉熵，然后屏蔽掉值中非训练集的y值，避免这些结果算进loss里面去，作者将placeholders['labels_mask'])（实际上是train_mask）从[True,True...False]转化为[1,1,1,...0]（前140个元素是1，属于训练集），mask /= tf.reduce_mean(mask)目的是在return的时候对loss的均值开始包括了其他遮蔽的值，因此此时在分子做扩大补充，那mask就是[19.34,19.34,19.34...0]即遮蔽掉的为0，没遮蔽的全部除以140/2708，最后每一行的交叉熵和每一行对应的mask值相乘得到最终的loss，至此loss模块结束。

accuracy模块

下一步看_accuracy在子类中的覆写

    def _accuracy(self):
        self.accuracy = masked_accuracy(self.outputs, self.placeholders['labels'],
                                        self.placeholders['labels_mask'])

基本格式是和masked_softmax_cross_entropy一样的

def masked_accuracy(preds, labels, mask):
    """Accuracy with masking."""
    correct_prediction = tf.equal(tf.argmax(preds, 1), tf.argmax(labels, 1))
    accuracy_all = tf.cast(correct_prediction, tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    mask /= tf.reduce_mean(mask)
    accuracy_all *= mask
    return tf.reduce_mean(accuracy_all)

这个地方preds和labels是打开的，因此既可以用在训练也可以用在测试。首先对比一下preds（shape=(2708,7)）和labels（shape=(2708,7)）每一行最大值的索引是否一致tf.argmax(preds, 1)其中1代表shape-1即从内向外的第一层求最大值的索引位置，进一步将布尔转化为1,0，然后mask除以140/2708（以训练集为例）再通过reduce_mean抹平，实际上最后的结果就是140个y值预测的准确率。

优化器模块

优化器模块一行代码

self.opt_op = self.optimizer.minimize(self.loss)

其中优化器在子类中申明，采用的adam优化器

self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)

GCN卷积模块

现在整个模型基本清晰了掉过头来看一下卷积部分，锁定这个卷积类GraphConvolution，主要看这个_call，主类Layer中直接函数化call里面调用了_call拿到输出

    def _call(self, inputs):
        x = inputs

        # dropout X dropout
        if self.sparse_inputs:
            x = sparse_dropout(x, 1-self.dropout, self.num_features_nonzero)
        else:
            x = tf.nn.dropout(x, 1-self.dropout)

        # convolve
        supports = list()
        for i in range(len(self.support)):
            if not self.featureless:
                # X × W
                pre_sup = dot(x, self.vars['weights_' + str(i)],
                              sparse=self.sparse_inputs)
            else:
                pre_sup = self.vars['weights_' + str(i)]
            # X × W * 对称归一化的A
            support = dot(self.support[i], pre_sup, sparse=True)
            supports.append(support)
        output = tf.add_n(supports)

        # bias
        if self.bias:
            output += self.vars['bias']

        return self.act(output)  # relu

首先这个函数（整个类实例化之后）的输入是inputs，实际上是每阶节点向量矩阵，初始阶段就是features（X），因此在一开始模型进行了一次判断输入是否是稀疏格式，明显第一次是，从第二次开始就不是了，下面作者对输入的features做了dropout，先看下不是稀疏数据时直接调用了tf.nn.dropout函数接口，默认的self.dropout是Flags中的0.5，因此这个地方会对输如的矩阵中1/2的值全部大为0，剩下的值全部除以1/（1/-0.5）就是乘以2倍，这个地方的目的是保证在dropout之后矩阵输出的期望尽量一致（就是和一致），再看一下稀疏输入的dropout实现

def sparse_dropout(x, keep_prob, noise_shape):
    """Dropout for sparse tensors."""
    random_tensor = keep_prob
    random_tensor += tf.random_uniform(noise_shape)  # 49216 个0~1随机数
    dropout_mask = tf.cast(tf.floor(random_tensor), dtype=tf.bool)
    pre_out = tf.sparse_retain(x, dropout_mask)
    return pre_out * (1./keep_prob)

noise_shape是49216，是稀疏矩阵中所有有值的数字个数，作者先用keep_prob加上了一个49216维的0-1的随机数，然后向下取整为0,1最终1的概率和keep_prob是一致的，下面是关键的一步sparse_retain，他的目的是保留指定的稀疏矩阵中的非空值，其他的置为0，输入还是采取三元组（坐标，值，shape），测试一下

import tensorflow as tf

a = [[0, 0], [1, 0], [2, 1], [3, 1]]
b = [1, 2, 3, 4]
shape = [4, 2]
c = tf.sparse_placeholder(tf.float32)
d = tf.sparse_retain(c, tf.convert_to_tensor([1, 0, 1, 1]))

with tf.Session() as sess:
    print(sess.run(c, feed_dict={c: (a, b, shape)}))
    print(sess.run(d, feed_dict={c: (a, b, shape)}))
    print(sess.run(tf.sparse_tensor_to_dense(d), feed_dict={c: (a, b, shape)}))

以上测试代码d就是将c的稀疏矩阵进行了[True, False, True, True]的mask之后的dropout结果，结果如下

SparseTensorValue(indices=array([[0, 0],
       [1, 0],
       [2, 1],
       [3, 1]]), values=array([1., 2., 3., 4.], dtype=float32), dense_shape=array([4, 2]))
SparseTensorValue(indices=array([[0, 0],
       [2, 1],
       [3, 1]]), values=array([1., 3., 4.], dtype=float32), dense_shape=array([4, 2]))
[[1. 0.]
 [0. 0.]
 [0. 3.]
 [0. 4.]]

实际上是吧第二个位置（False）的值置为0，注意这个地方mask的个数是根据值的个数确定的不是根据输入矩阵行的格数，如果mask长度和值个数不一致，默认以0在后面补齐。最后使用pre_out * (1./keep_prob)其他非0值扩大倍数，同理是保证输出的期望一致。
下面继续看卷积计算部分，直接看这行

pre_sup = dot(x, self.vars['weights_' + str(i)],
                              sparse=self.sparse_inputs)

这行在做X*W，看下dot函数

def dot(x, y, sparse=False):
    """Wrapper for tf.matmul (sparse vs dense)."""
    if sparse:
        res = tf.sparse_tensor_dense_matmul(x, y)
    else:
        res = tf.matmul(x, y)
    return res

实际上就是判断self.sparse_inputs是稀疏走tf.sparse_tensor_dense_matmul，不是稀疏走tf.matmul，其中tf.sparse_tensor_dense_matmul的输入第一个元素是稀疏矩阵，第二个元素是稠密矩阵，测试一下

a = [[0, 0], [1, 0], [1, 1], [2, 1], [3, 1]]
b = [1, 2, 2, 3, 4]
shape = [4, 2]
c = tf.sparse_placeholder(tf.float32)
d = tf.convert_to_tensor([[10.0, 1.0], [5.0, 2.0]])

with tf.Session() as sess:
    print(sess.run(tf.sparse_tensor_to_dense(c), feed_dict={c: (a, b, shape)}))
    print(sess.run(tf.sparse_tensor_dense_matmul(c, d), feed_dict={c: (a, b, shape)}))

输出如下，可以看到稀疏矩阵乘以稠密矩阵可以正常相乘

[[1. 0.]
 [2. 2.]
 [0. 3.]
 [0. 4.]]
[[10.  1.]
 [30.  6.]
 [15.  6.]
 [20.  8.]]

接着继续看GCN卷积计算部分

            # X × W * 对称归一化的A
            support = dot(self.support[i], pre_sup, sparse=True)

CGN中作者指定了len(support)=1，这个地方直接是对称归一化的A乘以X × W ，最后作者指定了卷积后的偏置，如果有的话就是和卷积第二个维度一致的0矩阵

        if self.bias:
            output += self.vars['bias']

在最后套用激活函数输出self.act(output)，这个act在实例化卷积核的时候指定为tf.nn.relu，至此模型层全部结束。

训练模型

模型训练再整体看一下这段代码

# Train model
for epoch in range(FLAGS.epochs):

    t = time.time()
    # Construct feed dictionary
    # features：节点特征向量，support：对称归一化的A
    feed_dict = construct_feed_dict(features, support, y_train, train_mask, placeholders)
    feed_dict.update({placeholders['dropout']: FLAGS.dropout})

    # Training step
    outs = sess.run([model.opt_op, model.loss, model.accuracy], feed_dict=feed_dict)

    # Validation
    cost, acc, duration = evaluate(features, support, y_val, val_mask, placeholders)
    cost_val.append(cost)

    # Print results
    print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(outs[1]),
          "train_acc=", "{:.5f}".format(outs[2]), "val_loss=", "{:.5f}".format(cost),
          "val_acc=", "{:.5f}".format(acc), "time=", "{:.5f}".format(time.time() - t))

    # 最新的loss比最近10轮的loss均值还大
    if epoch > FLAGS.early_stopping and cost_val[-1] > np.mean(cost_val[-(FLAGS.early_stopping+1):-1]):
        print("Early stopping...")
        break

模型默认epoch=200，每轮都把全部训练数据灌进去训练，outs = sess.run([model.opt_op, model.loss, model.accuracy], feed_dict=feed_dict)这行代码拿到了训练的loss和acc，同时每一轮在训练之后也验证一次cost, acc, duration = evaluate(features, support, y_val, val_mask, placeholders)，验证的数据量大小是500，索引从141到640，同时会记录下每轮验证集的loss变化

cost, acc, duration = evaluate(features, support, y_val, val_mask, placeholders)
cost_val.append(cost)

下面的代码打印出训练和验证的loss和acc每轮的变化和每轮的训练验证时间

# Print results
    print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(outs[1]),
          "train_acc=", "{:.5f}".format(outs[2]), "val_loss=", "{:.5f}".format(cost),
          "val_acc=", "{:.5f}".format(acc), "time=", "{:.5f}".format(time.time() - t))

最后指定早停，超过10轮后最新的loss比最近10轮的loss均值还大就早停

    if epoch > FLAGS.early_stopping and cost_val[-1] > np.mean(cost_val[-(FLAGS.early_stopping+1):-1]):
        print("Early stopping...")
        break

模型测试

# Testing
test_cost, test_acc, test_duration = evaluate(features, support, y_test, test_mask, placeholders)
print("Test set results:", "cost=", "{:.5f}".format(test_cost),
      "accuracy=", "{:.5f}".format(test_acc), "time=", "{:.5f}".format(test_duration))

代码格式和训练验证是一样的，看下evaluate函数

# Define model evaluation function
def evaluate(features, support, labels, mask, placeholders):
    t_test = time.time()
    feed_dict_val = construct_feed_dict(features, support, labels, mask, placeholders)
    outs_val = sess.run([model.loss, model.accuracy], feed_dict=feed_dict_val)
    return outs_val[0], outs_val[1], (time.time() - t_test)

主要看最后一行outs_val = sess.run([model.loss, model.accuracy], feed_dict=feed_dict_val)，sess不run优化器，仅仅把loss和acc跑出来，到此全部GCN代码跟读结束。

模型预测

这一段作者没有在train.py中写，但是模型层给出了predict接口，这个函数不接受任何输入，直接对模型内部的output做softmax输出，稍微拿出来加工一下，看一下测试集的混淆矩阵

# 在最后增加如下代码
feed_dict_val = construct_feed_dict(features, support, y_test, test_mask, placeholders)
outs_val = sess.run(model.predict(), feed_dict=feed_dict_val)
print("-----------测试集预测输出")
print(outs_val[1708:])
print("-----------测试集y值")
print(y_test[1708:])
outs_val_index = np.argmax(outs_val[1708:], 1)
y_test_index = np.argmax(y_test[1708:], 1)

from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_test_index, outs_val_index))
sr = confusion_matrix(y_test_index, outs_val_index)
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.matshow(sr, cmap=plt.cm.Greens)
plt.colorbar()
for i in range(len(sr)):
    for j in range(len(sr)):
        plt.annotate(sr[i, j], xy=(j, i), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True')
plt.xlabel('Predict')
plt.show()

准确率报告如下

              precision    recall  f1-score   support

           0       0.66      0.77      0.71       130
           1       0.84      0.87      0.85        91
           2       0.88      0.90      0.89       144
           3       0.91      0.78      0.84       319
           4       0.79      0.86      0.82       149
           5       0.82      0.76      0.79       103
           6       0.68      0.81      0.74        64

    accuracy                           0.82      1000
   macro avg       0.80      0.82      0.81      1000
weighted avg       0.83      0.82      0.82      1000

最终的混淆矩阵如下。整体准确率在80左右

测试集混淆矩阵

代码设计反思

（1）为什么一开始数据处理需要对测试数据的顺序进行排序

这个问题我看完所有代码之后还是有困惑，作者为什么要对test单独做shuffle（其实是从大到小排序），因为就算不做mask的index也是可以乱序的，对最后的计算测试集的loss和acc毫无影响，遮蔽并不需要排序，我注释掉load-data()中给features和labels的test位置两个顺序重排，最后代码照样跑，但是测试集效果极差，训练验证效果差不多。

Epoch: 0200 train_loss= 0.67370 train_acc= 0.96429 val_loss= 1.28556 val_acc= 0.73400 time= 0.01114
Test set results: cost= 2.18207 accuracy= 0.28500 time= 0.00725

我试试在issue找找看，有至少3个人问了跟我一样的问题，为啥要对test做shuffle

issue

其实我没太看懂，后来下面还有一个人评论我大概猜到了是这样，问题是邻接矩阵和节点特征矩阵在测试集部分错位，因此shuffle不影响loss和acc逻辑，但是影响A*X逻辑，因为邻接矩阵是完全按照index顺序的，而特性向量在test位置是乱序的存储在ind.cora.test.index里面，因此需要保持一致否则矩阵点乘驴头不对马嘴。看一下load_data中的networks对象的邻接矩阵

nx.from_dict_of_lists(graph)
Out[55]: NodeView((0, 1, 2, 3, 4...2706, 2707))

邻接矩阵的nodes是完全顺序的，而ind.dataset_str.test.index这个文件单独记录了测试集中节点的索引位置，是乱序的，导致在stack之后features的最后1000个索引值和邻接矩阵不一致。

（2）为什么要用mask屏蔽y

mask出现在代码的loss计算和acc计算部分，其中loss部分直接决定模型的训练优化方向，加入mask是GCN模型导致，因为模型的训练需要输入全部节点的邻接矩阵以及全部节点的特征向量，图卷积操作也是在全部节点上点乘邻接矩阵和特征向量完成，不论是训练，验证还是测试，所有节点都需要全部进入模型训练，因此需要在训练计算loss时遮蔽掉非训练的节点，同理验证测试也是。说白了是训练测试验证之间数据集无法解耦，如果解耦模型无法训练，这也是GCN的劣势。归纳以下GCN的训练和传统的DNN的劣势：

直推式学习：无法拓展到新的图上，只能在训练的图上获得节点的向量表示和做算法应用，即预测的节点必须在训练集中，这大大限制了工程应用场景。
全图形式训练：GCN无法实现像DNN那样小批量batch训练，而每次必选全量的邻接矩阵乘以全量节点的特征向量完成一次迭代，梯度更新的效率极低
数据量大不利于训练：因为GCN需要全量的邻接矩阵和节点向量，而由于硬件资源限制不可能全图纳入，此时需要的模式对全图进行瘦身采样，在一定规模的图结构上进行训练，在其他图上进行拓展

（3）为什么卷积最后不接全连接

看了其他的GCN分类示意图最后一层都直接是GCN embedding之后的softmax，这里就不纠结了，我觉得可以加