Abstrct/摘要
Graph embedding methods produce unsupervised node features from graphs that can then be used for a variety of machine learning tasks. Modern graphs, particularly in industrial applications, contain billions of nodes and trillions of edges, which exceeds the capability of existing embedding systems. We present PyTorch-BigGraph (PBG), an embedding system that incorporates several modifications to traditional multi-relation embedding systems that allow it to scale to graphs with billions of nodes and trillions of edges. PBG uses graph partitioning to train arbitrarily large embeddings on either a single machine or in a distributed environment. We demonstrate comparable performance with existing embedding systems on common benchmarks, while allowing for scaling to arbitrarily large graphs and parallelization on multiple machines. We train and evaluate embeddings on several large social network graphs as well as the full Freebase dataset, which contains over 100 million nodes and 2 billion edges.
图嵌入是一种从图中生成无监督节点特征(node features)的方法,生成的特征可以应用在各类机器学习任务上。现代的图网络,尤其是在工业应用中,通常会包含数十亿的节点(node)和数万亿的边(edge)。这已经超出了已知嵌入系统的处理能力。我们介绍了一种嵌入系统,PyTorch-BigGraph(PBG),系统对传统的多关系嵌入系统做了几处修改让系统能扩展到能处理数十亿节点和数万亿条边的图形。PBG使用了图形分区来支持任意大小的嵌入在单机或者分布式环境中训练。我们展示了在功能的基准上与现有嵌入系统的性能比较,同时PBG允许在多台机器上允许缩放到任意大小并且支持并行。我们在使用几个大型社交图网络作为完整Freebase数据集来训练和评估嵌入系统,数据集包含超过1亿个节点和20亿条边。
1 Introduction / 1 简介
Graph structured data is a common input to a variety of machine learning tasks (Wu et al., 2017; Cook & Holder,2006; Nickel et al., 2016a; Hamilton et al., 2017b). Working with graph data directly is difficult, so a common technique is to use graph embedding methods to create vector representations for each node so that distances between these vectors predict the occurrence of edges in the graph. Graph embeddings have been have been shown to serve as useful features for downstream tasks such as recommender systems in e-commerce (Wang et al., 2018), link prediction in social media (Perozzi et al., 2014), predicting drug interactions and characterizing protein-protein networks (Zitnik & Leskovec, 2017).
图结构数据是各种机器学习任务的一种常见输入,直接处理图结构数据是比较困难的,常用的技术是通过图嵌入方法为图中的每个节点创建向量化表示使得这些向量间的距离能预测图形中是否存在边。图嵌入已经被证明是下游任务中有意义的特征,如:电子商务中的推荐系统,社交媒体中的链接预测,药物相互作用、表征蛋白网络。
Graph data is common at modern web companies and poses an extra challenge to standard embedding methods: scale. For example, the Facebook graph includes over two billion user nodes and over a trillion edges representing friendships, likes, posts and other connections (Ching et al., 2015). The graph of users and products at Alibaba also consists of more than one billion users and two billion items (Wang et al.,2018). At Pinterest, the user to item graph includes over 2 billion entities and over 17 billion edges (Ying et al., 2018).There are two main challenges for embedding graphs of this size. First, an embedding system must be fast enough to embed graphs with 1011 − 1012 edges in a reasonable time. Second, a model with two billion nodes and even 32 embedding parameters per node expressed as floats) would require 800GB of memory just to store its parameters, thus many standard methods exceed the memory capacity of typical commodity servers.
图结构数据在现代网络公司非常常见,这给标准的嵌入方法提出了额外的调整:规模。例如:Facebook图中包含20亿个用户节点和超过1万亿条边,这些边代表朋友关系、喜好、帖子和其他链接。阿里巴巴的用户和产品图也包含10亿以上的用户和20亿以上的商品。在Pinterest,用户到项目的图包含20亿的实体和超过170亿边。对这样大小的图做嵌入主要有两个挑战,一是这个嵌入系统必须足够快,需要在一个合理的时间内完成10^11 - 10^12条边的嵌入;二是拥有20亿个节点,每个节点32个嵌入参数(浮点表示),大概需要800GB大小的内存来存储这些参数,因此许多标准方法超出了典型商用服务器的内存容量。
We present PyTorch-BigGraph (PBG), an embedding system that incorporates several modifications to standard models. The contribution of PBG is to scale to graphs with billions of nodes and trillions of edges.
我们介绍了Pytorch(译注:facebook开源的深度学习框架)Biggraph,一种基于标准模型做了若干改进的嵌入系统。PBG带来的改进是讲标准方案扩展到具有数十亿个节点和数万亿个边的图。
Important components of PBG are:
PBG的重要组成部分:
A block decomposition of the adjacency matrix intoN buckets, training on the edges from one bucket at a time. PBG then either swaps embeddings from each partition to disk to reduce memory usage, or performs distributed execution across multiple machines.
1、将邻接矩阵分块,分解为n个桶,每次从一个桶的边缘开始进行训练,然后PBG要么交换每个分区到磁盘的嵌入来减少内存的使用,要么跨机器进行分布式训练。
A distributed execution model that leverages the block decomposition for the large parameter matrices, as well as a parameter server architecture for global parameters and feature embeddings for featurized nodes.
2、分布式执行模型:对大参数矩阵进行快分解,以及用于特征化节点的全局参数和特征嵌入的参数服务架构
Efficient negative sampling for nodes that samples negative nodes both uniformly and from the data, and reuses negatives within a batch to reduce memory bandwidth.
3、高效的节点负采样:在数据中对负向节点进行均匀采样 并在批处理中重用负节点以减少内存用量
Support for multi-entity, multi-relation graphs with per-relation configuration options such as edge weight and choice of relation operator.
4、支持多实体、多关系图,支持边缘权重、关系运算符选择等关系配置选项
We evaluate PBG on the Freebase, LiveJournal and YouTube graphs and show that it matches the performance of existing embedding systems.
我们在Freebase、LiveJournal和YouTube图数据集上评估了PBG,表明了PBG能和当前已有的陷入系统性能能匹配。
We also report results on larger graphs. We construct an embedding of the full Freebase knowledge graph (121 million entities, 2.4 billion edges), which we release publicly with this paper. Partitioning of the Freebase graph reduces memory consumption by 88% with only a small degradation in the embedding quality, and distributed execution on 8 machines decreases training time by a factor of 4. We also perform experiments on a large Twitter graph showing similar results with near-linear scaling.
我们也给出了更大图数据上的结果,我们构造了一个完整的Freebase图(1.21亿个实体,24亿条边)的嵌入,并将其公开发布。Freebase图的划分减少了88%的内存消耗,嵌入向量的质量只有略微的下降,在8台机器上分布式执行,时间减少了4倍。另外我们再一个Twitter的大图数据上进行了试验,通过近似线性缩放得到了相似的结果。
PBG 作为一个开源项目发布在:https://github.com/facebookresearch/PyTorch-BigGraph.PBG是通过Pytorch实现的,没有其他外部依赖或者自动以运算符。