Facebook开源图神经网络-Pytorch Biggraph

Abstrct/摘要

Graph embedding methods produce unsupervised node features from graphs that can then be used for a variety of machine learning tasks. Modern graphs, particularly in industrial applications, contain billions of nodes and trillions of edges, which exceeds the capability of existing embedding systems. We present PyTorch-BigGraph (PBG), an embedding system that incorporates several modifications to traditional multi-relation embedding systems that allow it to scale to graphs with billions of nodes and trillions of edges. PBG uses graph partitioning to train arbitrarily large embeddings on either a single machine or in a distributed environment. We demonstrate comparable performance with existing embedding systems on common benchmarks, while allowing for scaling to arbitrarily large graphs and parallelization on multiple machines. We train and evaluate embeddings on several large social network graphs as well as the full Freebase dataset, which contains over 100 million nodes and 2 billion edges.

图嵌入是一种从图中生成无监督节点特征（node features）的方法，生成的特征可以应用在各类机器学习任务上。现代的图网络，尤其是在工业应用中，通常会包含数十亿的节点（node）和数万亿的边（edge）。这已经超出了已知嵌入系统的处理能力。我们介绍了一种嵌入系统，PyTorch-BigGraph（PBG），系统对传统的多关系嵌入系统做了几处修改让系统能扩展到能处理数十亿节点和数万亿条边的图形。PBG使用了图形分区来支持任意大小的嵌入在单机或者分布式环境中训练。我们展示了在功能的基准上与现有嵌入系统的性能比较，同时PBG允许在多台机器上允许缩放到任意大小并且支持并行。我们在使用几个大型社交图网络作为完整Freebase数据集来训练和评估嵌入系统，数据集包含超过1亿个节点和20亿条边。

1 Introduction / 1 简介

Graph structured data is a common input to a variety of machine learning tasks (Wu et al., 2017; Cook & Holder,2006; Nickel et al., 2016a; Hamilton et al., 2017b). Working with graph data directly is difficult, so a common technique is to use graph embedding methods to create vector representations for each node so that distances between these vectors predict the occurrence of edges in the graph. Graph embeddings have been have been shown to serve as useful features for downstream tasks such as recommender systems in e-commerce (Wang et al., 2018), link prediction in social media (Perozzi et al., 2014), predicting drug interactions and characterizing protein-protein networks (Zitnik & Leskovec, 2017).

图结构数据是各种机器学习任务的一种常见输入，直接处理图结构数据是比较困难的，常用的技术是通过图嵌入方法为图中的每个节点创建向量化表示使得这些向量间的距离能预测图形中是否存在边。图嵌入已经被证明是下游任务中有意义的特征，如：电子商务中的推荐系统，社交媒体中的链接预测，药物相互作用、表征蛋白网络。

Graph data is common at modern web companies and poses an extra challenge to standard embedding methods: scale. For example, the Facebook graph includes over two billion user nodes and over a trillion edges representing friendships, likes, posts and other connections (Ching et al., 2015). The graph of users and products at Alibaba also consists of more than one billion users and two billion items (Wang et al.,2018). At Pinterest, the user to item graph includes over 2 billion entities and over 17 billion edges (Ying et al., 2018).There are two main challenges for embedding graphs of this size. First, an embedding system must be fast enough to embed graphs with 1011 − 1012 edges in a reasonable time. Second, a model with two billion nodes and even 32 embedding parameters per node expressed as floats) would require 800GB of memory just to store its parameters, thus many standard methods exceed the memory capacity of typical commodity servers.

图结构数据在现代网络公司非常常见，这给标准的嵌入方法提出了额外的调整：规模。例如：Facebook图中包含20亿个用户节点和超过1万亿条边，这些边代表朋友关系、喜好、帖子和其他链接。阿里巴巴的用户和产品图也包含10亿以上的用户和20亿以上的商品。在Pinterest，用户到项目的图包含20亿的实体和超过170亿边。对这样大小的图做嵌入主要有两个挑战，一是这个嵌入系统必须足够快，需要在一个合理的时间内完成10^11 - 10^12条边的嵌入；二是拥有20亿个节点，每个节点32个嵌入参数(浮点表示)，大概需要800GB大小的内存来存储这些参数，因此许多标准方法超出了典型商用服务器的内存容量。

We present PyTorch-BigGraph (PBG), an embedding system that incorporates several modifications to standard models. The contribution of PBG is to scale to graphs with billions of nodes and trillions of edges.

我们介绍了Pytorch(译注：facebook开源的深度学习框架)Biggraph，一种基于标准模型做了若干改进的嵌入系统。PBG带来的改进是讲标准方案扩展到具有数十亿个节点和数万亿个边的图。

Important components of PBG are:

PBG的重要组成部分：

A block decomposition of the adjacency matrix intoN buckets, training on the edges from one bucket at a time. PBG then either swaps embeddings from each partition to disk to reduce memory usage, or performs distributed execution across multiple machines.

1、将邻接矩阵分块，分解为n个桶，每次从一个桶的边缘开始进行训练，然后PBG要么交换每个分区到磁盘的嵌入来减少内存的使用，要么跨机器进行分布式训练。

A distributed execution model that leverages the block decomposition for the large parameter matrices, as well as a parameter server architecture for global parameters and feature embeddings for featurized nodes.

2、分布式执行模型：对大参数矩阵进行快分解，以及用于特征化节点的全局参数和特征嵌入的参数服务架构

Efficient negative sampling for nodes that samples negative nodes both uniformly and from the data, and reuses negatives within a batch to reduce memory bandwidth.

3、高效的节点负采样：在数据中对负向节点进行均匀采样并在批处理中重用负节点以减少内存用量

Support for multi-entity, multi-relation graphs with per-relation configuration options such as edge weight and choice of relation operator.

4、支持多实体、多关系图，支持边缘权重、关系运算符选择等关系配置选项

We evaluate PBG on the Freebase, LiveJournal and YouTube graphs and show that it matches the performance of existing embedding systems.

我们在Freebase、LiveJournal和YouTube图数据集上评估了PBG，表明了PBG能和当前已有的陷入系统性能能匹配。

We also report results on larger graphs. We construct an embedding of the full Freebase knowledge graph (121 million entities, 2.4 billion edges), which we release publicly with this paper. Partitioning of the Freebase graph reduces memory consumption by 88% with only a small degradation in the embedding quality, and distributed execution on 8 machines decreases training time by a factor of 4. We also perform experiments on a large Twitter graph showing similar results with near-linear scaling.

我们也给出了更大图数据上的结果，我们构造了一个完整的Freebase图(1.21亿个实体，24亿条边)的嵌入，并将其公开发布。Freebase图的划分减少了88%的内存消耗，嵌入向量的质量只有略微的下降，在8台机器上分布式执行，时间减少了4倍。另外我们再一个Twitter的大图数据上进行了试验，通过近似线性缩放得到了相似的结果。

PBG 作为一个开源项目发布在：https://github.com/facebookresearch/PyTorch-BigGraph.PBG是通过Pytorch实现的，没有其他外部依赖或者自动以运算符。

最后编辑于：2019.07.23 14:08:54

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,179评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,229评论 2赞 380
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,032评论 0赞 336
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,533评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,531评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,539评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,916评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,574评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,813评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,568评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,654评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,354评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,937评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,918评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,152评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,852评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,378评论 2赞 342

Facebook开源图神经网络-Pytorch Biggraph

推荐阅读更多精彩内容