Taskonomy: Disentangling Task Transfer Learning翻译 上
Taskonomy: Disentangling Task Transfer Learning
任务:解构任务转移学习
Abstract
摘要
Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity.
视觉任务是否有关系,或者它们是不相关的?例如,可以通过表面法线来简化估计图像的深度?直觉正面回答这些问题,暗示在视觉任务中存在结构。了解这种结构具有显着的价值;它是转移学习的基础概念,并提供了识别任务间冗余的原则性方法,例如,无缝地重复使用相关任务之间的监督或者解决一个系统中的许多任务而不会增加复杂性。
We propose a fully computational approach for modeling the structure of the space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty-six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 23 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.
我们提出了一个完全计算的方法来建模视觉任务空间的结构。这是通过在潜在空间中发现二十六个2D,2.5D,3D和语义任务的字典(第一阶和更高阶)传输学习依赖关系完成的。该产品是用于任务转移学习的计算分类图。我们研究这种结构的后果,例如非平凡的涌现关系,并利用它们来减少对标记数据的需求。例如,我们表明,解决一组10个任务所需的标记数据点的总数可以减少大约23(与独立训练相比),同时保持性能几乎相同。我们提供了一套计算和探测这个分类结构的工具,包括一个求解器,用户可以利用这个求解器为他们的用例制定有效的监督策略。
1. Introduction
1.介绍
Object recognition, depth estimation, edge detection, pose estimation, etc are examples of common vision tasks deemed useful and tackled by the research community. Some of them have rather clear relationships: we understand that surface normals and depth are related (one is a derivate of the other), or vanishing points in a room are useful for orientation. Other relationships are less clear: how keypoint detection and the shading in a room can, together, perform pose estimation.
对象识别,深度估计,边缘检测,姿态估计等都是研究界认为有用和处理的常见视觉任务的例子。其中一些人的关系相当清晰:我们知道表面法线和深度是相关的(一个是另一个的衍生物),或者房间中的消失点对定位有用。其他关系不太清楚:关键点检测和房间中的阴影如何一起执行姿态估计。
Figure 1: A sample task structure discovered by the computational task taxonomy (taskonomy). It found that, for instance, by combining the learned features of a surface normal estimator and occlusion edge detector, good networks for reshading and point matching can be rapidly trained with little labeled data.
图1:计算任务分类(taskonomy)发现的示例任务结构。例如,通过将表面法线估计器和遮挡边缘检测器的学习特征相结合,发现用于重建和点匹配的良好网络可以用少量标记数据进行快速训练。
The field of computer vision has indeed gone far without explicitly using these relationships. We have made remarkable progress by developing advanced learning machinery (e.g. ConvNets) capable of finding complex mappings from X to Y when many pairs ofAlternatively, a model aware of the relationships among tasks demands less supervision, uses less computation, and behaves in more predictable ways. Incorporating such a structure is the first stepping stone towards develop ing provably efficient comprehensive/universal perception models [34, 4], i.e. ones that can solve a large set of tasks before becoming intractable in supervision or computation demands. However, this task space structure and its effects are still largely unknown. The relationships are non-trivial, and finding them is complicated by the fact that we have imperfect learning models and optimizers. In this paper, we attempt to shed light on this underlying structure and present a framework for mapping the space of visual tasks. Here what we mean by “structure” is a collection of computationally found relations specifying which tasks supply useful information to another, and by how much (see Fig. 1). We employ a fully computational approach for this purpose, with neural networks as the adopted computational function class. In a feedforward network, each layer successively forms more abstract representations of the input containing the information needed for mapping the input to the output. These representations, however, can transmit statistics useful for solving other outputs (tasks), presumably if the tasks are related in some form [83, 19, 58, 46]. This is the basis of our approach: we computes an affinity matrix among tasks based on whether the solution for one task can be sufficiently easily read out of the representation trained for another task. Such transfers are exhaustively sampled, and a Binary Integer Programming formulation extracts a globally efficient transfer policy from them. We show this model leads to solving tasks with far less data than learning them independently and the resulting structure holds on common datasets (ImageNet [78] and Places [104]).
另外,知道任务之间关系的模型需要更少的监督,使用更少的计算,并以更可预测的方式运行。结合这样的结构是开发可证明有效的综合/通用感知模型的第一个垫脚石[34,4],即在监督或计算需求变得棘手之前可以解决大量任务的那些模型[34,4]。然而,这个任务空间结构及其影响仍然大部分未知。这些关系是非平凡的,发现它们很复杂,因为我们有不完善的学习模型和优化器。在本文中,我们试图揭示这个底层结构,并提出一个映射视觉任务空间的框架。这里我们所说的“结构”是指计算找到的关系集合,指定哪些任务向另一个任务提供有用的信息,以及提供多少(见图1)。我们采用完全计算的方法来实现这一目的,神经网络作为采用的计算功能类。在前馈网络中,每个层次都连续形成包含将输入映射到输出所需信息的输入的更多抽象表示。然而,这些表述可以传输对解决其他输出(任务)有用的统计数据,假设这些任务是以某种形式相关的[83,19,58,46]。这是我们方法的基础:我们根据一项任务的解决方案是否可以充分轻松地从为另一项任务而训练的表示中读出来计算任务之间的亲和度矩阵。这种转移是抽象的,二进制整数规划公式从中提取全球有效的转移策略。我们展示这个模型导致解决任务的数据远远少于独立学习它们的结果,并且所得到的结构在普通数据集上保留(ImageNet [78]和Places [104])。
Being fully computational and representation-based, the proposed approach avoids imposing prior (possibly incorrect) assumptions on the task space. This is crucial because the priors about task relations are often derived from either human intuition or analytical knowledge, while neural networks need not operate on the same principles [63, 33, 40, 45, 102, 88]. For instance, although we might expect depth to transfer to surface normals better (derivatives are easy), the opposite is found to be the better direction in a computational framework (i.e. suited neural networks better).
所提出的方法完全基于计算和表示为基础,避免了在任务空间上施加先验(可能是错误的)假设。这是至关重要的,因为关于任务关系的先验往往源于人的直觉或分析知识,而神经网络不需要按照相同的原则运作[63,33,40,45,102,88]。例如,虽然我们可以预期深度能够更好地转移到曲面法线(衍生物很容易),但发现相反是计算框架中更好的方向(即更好地适应神经网络)。
An interactive taxonomy solver which uses our model to suggest data-efficient curricula, a live demo, dataset, and code are available at http://taskonomy.vision/.
http://taskonomy.vision/提供了一个使用我们的模型来建议数据有效课程,现场演示,数据集和代码的交互式分类解析器。
2. Related Work
2.相关工作
Assertions of existence of a structure among tasks date back to the early years of modern computer science, e.g. with Turing arguing for using learning elements [95, 98] rather than the final outcome or Jean Piaget’s works on developmental stages using previously learned stages as sources [74, 39, 38], and have extended to recent works [76, 73, 50, 18, 97, 61, 11, 66]. Here we make an attempt to actually find this structure. We acknowledge that this is related to a breadth of topics, e.g. compositional modeling [35, 10, 13, 23, 55, 92, 90], homomorphic cryptography [42], lifelong learning [93, 15, 85, 84], functional maps [71], certain aspects of Bayesian inference and Dirichlet processes [54, 91, 90, 89, 37, 39], few-shot learning [81, 25, 24, 70, 86], transfer learning [75, 84, 29, 64, 67, 59], un/semi/selfsupervised learning [22, 8, 17, 103, 19, 83], which are studied across various fields [73, 94, 12]. We review the topics most pertinent to vision within the constraints of space: Self-supervised learning methods leverage the inherent relationships between tasks to learn a desired expensive one (e.g. object detection) via a cheap surrogate (e.g. colorization) [68, 72, 17, 103, 100, 69]. Specifically, they use a manually-entered local part of the structure in the task space (as the surrogate task is manually defined). In contrast, our approach models this large space of tasks in a computational manner and can discover obscure relationships.
在任务中存在结构的断言可追溯到现代计算机科学的早期年代,例如,图灵主张使用学习元素[95,98]而不是最终结果或让皮亚杰在发育阶段使用以前学过的阶段作为来源的作品[74,39,38],并延伸到最近的作品[76,73,50 ,18,97,61,11,66]。这里我们试图找到这个结构。我们承认这与广泛的主题有关,例如组合建模[35,10,13,23,55,92,90],同态密码学[42],终身学习[93,15,85,84],功能图[71],贝叶斯推理和狄利克雷过程的某些方面[54,91,90,89,37,39],少数学习[81,25,24,70,86],转移学习[75,84,29,64,67,59],非/半/自我监督学习[22,8,17,103,19,83],它们在各个领域进行了研究[73,94,12]。我们在空间约束下回顾与视觉最相关的主题:自我监督学习方法利用任务之间固有的关系,通过廉价代理(例如着色)学习所需的昂贵代码(例如对象检测)[68,72,17 ,103,100,69]。具体而言,它们在任务空间中使用结构的手动输入局部部分(因为代理任务是手动定义的)。相比之下,我们的方法以计算方式模拟这个庞大的任务空间,并且可以发现模糊的关系。
Unsupervised learning is concerned with the redundancies in the input domain and leveraging them for forming compact representations, which are usually agnostic to the downstream task [8, 49, 20, 9, 32, 77]. Our approach is not unsupervised by definition as it is not agnostic to the tasks. Instead, it models the space tasks belong to and in a way utilizes the functional redundancies among tasks.
无监督学习关注输入领域的冗余,并利用它们形成紧凑的表示形式,这通常对下游任务是不可知的[8,49,20,9,32,77]。我们的方法并非由定义无监督,因为它不是不可知的任务。相反,它模拟任务所属的空间任务,并以某种方式利用任务间的功能冗余。
Meta-learning generally seeks performing the learning at a level higher than where conventional learning occurs, e.g. as employed in reinforcement learning [21, 31, 28], optimization [2, 82, 48], or certain architectural mechanisms [27, 30, 87, 65]. The motivation behind meta learning has similarities to ours and our outcome can be seen as a computational meta-structure of the space of tasks.
元学习通常寻求在比传统学习更高的层次上进行学习,例如,如强化学习[21,31,28],优化[2,82,48]或某些架构机制[27,30,87,65]中所采用的。元学习背后的动机与我们相似,我们的结果可以被看作是任务空间的计算元结构。
Multi-task learning targets developing systems that can provide multiple outputs for an input in one run [50, 18]. Multi-task learning has experienced recent progress and the reported advantages are another support for existence of a useful structure among tasks [93, 100, 50, 76, 73, 50, 18, 97, 61, 11, 66]. Unlike multi-task learning, we explicitly model the relations among tasks and extract a meta-structure. The large number of tasks we consider also makes developing one multi-task network for all infeasible.
多任务学习目标开发系统,可以在一次运行中为输入提供多个输出[50,18]。多任务学习经历了最近的进展,并且所报道的优势是在任务之间存在有用结构的另一个支持[93,100,50,76,73,50,18,97,61,11,66]。与多任务学习不同,我们明确地建模任务之间的关系并提取元结构。我们考虑的大量任务也使得为所有不可行的开发一个多任务网络。
Domain adaption seeks to render a function that is developed on a certain domain applicable to another [44, 99, 5, 80, 52, 26, 36]. It often addresses a shift in the input domain, e.g. webcam images to D-SLR [47], while the task is kept the same. In contrast, our framework is concerned with output (task) space, hence can be viewed as task/output adaptation. We also perform the adaptation in a larger space among many elements, rather than two or a few.
域自适应试图呈现一个在某个域上开发的函数,该函数适用于另一个[44,99,5,80,52,26,36]。它经常处理输入域的变化,例如摄像头图像到D-SLR [47],而任务保持不变。相比之下,我们的框架关注输出(任务)空间,因此可以视为任务/输出适配。我们也在很多元素的较大空间中执行适应,而不是两个或几个。
In the context of our approach to modeling transfer learning across tasks: Learning Theoretic approaches may overlap with any of the above topics and usually focus on providing generalization guarantees. They vary in their approach: e.g. by modeling transferability with the transfer family required to map a hypothesis for one task onto a hypothesis for another [7], through information-based approaches [60], or through modeling inductive bias [6]. For these guarantees, learning theoretic approaches usually rely on intractable computations, or avoid such computations by restricting the model or task. Our method draws inspiration from theoretical approaches but eschews (for now) theoretical guarantees in order to use modern neural machinery.
在我们的跨越任务的转移学习建模方法的背景下:学习理论方法可能会与上述任何主题重叠,并且通常侧重于提供泛化保证。它们的方法各不相同:例如通过建模转移家族的可转移性,以将一项任务的假设映射到另一项假设上[7],通过基于信息的方法[60]或通过建模归纳偏置[6]。对于这些保证,学习理论方法通常依赖于难处理的计算,或者通过限制模型或任务来避免这种计算。我们的方法从理论方法中汲取灵感,但是为了使用现代神经机器而避开(现在)理论保证。
Figure 2: Computational modeling of task relations and creating the taxonomy. From left to right: I. Train task-specific networks. II. Train (first order and higher) transfer functions among tasks in a latent space. III. Get normalized transfer affinities using AHP (Analytic Hierarchy Process). IV. Find global transfer taxonomy using BIP (Binary Integer Program).
图2:任务关系的计算建模和创建分类。从左到右:I.训练任务特定的网络。 II。在一个潜在空间中训练(一阶或更高阶段)的任务。 III。使用层次分析法(层次分析法)获得标准化的转移支付。 IV。使用BIP(二进制整数程序)查找全局转移分类。
3. Method
3.方法
We define the problem as follows: we want to maximize the collective performance on a set of tasksTaxonomy is built using a four step process depicted in Fig. 2. In stage I, a task-specific network for each task in S is trained. In stage II, all feasible transfers between sources and targets are trained. We include higher-order transfers which use multiple inputs task to transfer to one target. In stage III, the task affinities acquired from transfer function performances are normalized, and in stage IV, we synthesize a hypergraph which can predict the performance of any transfer policy and optimize for the optimal one.
分类标准是使用图2所示的四步过程构建的。在第一阶段,针对S中每个任务的任务特定网络进行训练。在第二阶段,培训来源和目标之间的所有可行转移。我们包括使用多个输入任务转移到一个目标的高阶转移。在阶段III中,从传递函数性能获得的任务亲和度被归一化,并且在阶段IV中,我们合成超图,其可以预测任何传递策略的性能并优化为最优。
Figure 3: Task Dictionary. Outputs of 24 (of 26) task-specific networks for a query (top left). See results of applying frame-wise on a video here.
图3:任务词典。查询的24个(26个)任务特定网络的输出(左上)。在这里查看在视频中应用帧结果的结果。
A vision task is an abstraction read from a raw image. We denote a task t more formally as a function.
视觉任务是从原始图像读取的抽象。我们更正式地将任务t表示为映射I到。
Figure 4: Transfer Function. A small readout function is trained to map representations of source task’s frozen encoder to target task’s labels. If order> 1, transfer function receives representations from multiple sources.
图4:传递函数。训练一个小读数函数,将源任务的冻结编码器的表示映射到目标任务的标签。如果订单> 1,传递函数接收来自多个来源的表示。
Task Dictionary: Our mapping of task space is done via (26) tasks included in the dictionary, so we ensure they cover common themes in computer vision (2D, 3D, semantics, etc) to the elucidate fine-grained structures of task space. See Fig. 3 for some of the tasks with detailed definition of each task provided in the supplementary material. We include tasks with various levels of abstraction, ranging from solvable by a simple kernel convolved over the image (e.g. edge detection) to tasks requiring basic understanding of scene geometry (e.g. vanishing points) and more abstract ones involving semantics (e.g. scene classification).
任务词典:我们通过字典中包含的任务完成任务空间的映射,因此我们确保它们涵盖计算机视觉(2D,3D,语义等)中的常见主题,以阐明任务空间的细化结构。有关详细定义补充材料中提供的每项任务的一些任务,请参见图3。我们包括具有各种抽象级别的任务,从对图像(例如边缘检测)进行卷积的简单内核可解算到需要对场景几何(例如消失点)有基本理解的任务和涉及语义(例如场景分类)的更抽象的任务。
It is critical to note the task dictionary is meant to be a sampled set, not an exhaustive list, from a denser space of all conceivable visual tasks/abstractions. Sampling gives us a tractable way to sparsely model a dense space, and the hypothesis is that (subject to a proper sampling) the derived model should generalize to out-of-dictionary tasks. The more regular / better sampled the space, the better the generalization. We evaluate this in Sec. 4.2 with supportive results. For evaluation of the robustness of results w.r.t the choice of dictionary, see the supplementary material.
需要注意的是,任务词典应该是所有可以想象的视觉任务/抽象的密集空间中的抽样集,而不是详尽的列表。抽样为我们提供了一种容易的方法来稀疏地模拟密集空间,并且假设(受适当抽样),派生模型应推广到超字典任务。空间越规则/越好,泛化越好。我们在第二部分对此进行评估。 4.2有支持的结果。为了评估结果的稳健性,字典的选择,请参阅补充材料。
Dataset: We need a dataset that has annotations for every task on every image. Training all of our tasks on exactly the same pixels eliminates the possibility that the observed transferabilities are affected by different input data peculiarities rather than only task intrinsics. There has not been such a dataset of scale made of real images, so we created a dataset of 4 million images of indoor scenes from about 600 buildings; every image has an annotation for every task. The images are registered on and aligned with buildingwide meshes similar to [3, 101, 14] enabling us to programmatically compute the ground truth for many tasks without human labeling. For the tasks that still require labels (e.g. scene classes), we generate them using Knowledge Distillation [43] from known methods [104, 57, 56, 78]. See the supplementary material for full details of the process and a user study on the final quality of labels generated using Knowledge Distillation (showing < 7% error).
数据集:我们需要一个数据集,每个图像上的每个任务都有注释。在完全相同的像素上训练我们所有的任务消除了观察到的可转换性受到不同输入数据特性影响的可能性,而不仅仅是任务内在因素。目前还没有这样的真实图像尺度数据集,因此我们创建了一个来自约600座建筑物的400万幅室内场景图像数据集;每个图像都有对每个任务的注释。这些图像在类似于[3,101,14]的建筑物网格上进行注册和对齐,使我们能够以编程方式计算许多任务的地面实况,而无需人工标注。对于仍然需要标签的任务(例如场景类),我们使用已知方法[104,57,56,78]的知识蒸馏[43]来生成它们。请参阅补充材料了解过程的全部细节,以及使用Knowledge Distillation生成的标签的最终质量的用户研究(显示误差<7%)。
3.1. Step I: Task-Specific Modeling
3.1。第一步:任务特定建模
We train a fully supervised task-specific network for each task in S. Task-specific networks have an encoderdecoder architecture homogeneous across all tasks, where the encoder is large enough to extract powerful representations, and the decoder is large enough to achieve a good performance but is much smaller than the encoder.
我们为S中的每个任务训练完全监督的任务特定网络。任务特定的网络在所有任务中具有均匀的编码器解码器架构,其中编码器足够大以提取强大的表示,并且解码器足够大以实现良好的性能,但比编码器小得多。
Figure 5: Transfer results to normals (upper) and 2.5D Segmentation (lower) from 5 different source tasks. The spread in transferability among different sources is apparent, with reshading among top-performing ones in this case. Task-specific networks were trained on 60x more data. “Scratch” was trained from scratch without transfer learning.
图5:从5个不同的源任务将结果传输到法线(上)和2.5D分割(下)。不同来源之间可转移性的差异是显而易见的,在这种情况下,重新转化为表现最佳的之一。任务特定的网络在60倍以上的数据上进行了培训。 “划痕”是从零开始训练的,没有转移学习。
3.2. Step II: Transfer Modeling
3.2。第二步:传输建模
Given a source task s and a target task t, whereAccessibility: For a transfer to be successful, the latent representation of the source should both be inclusive of sufficient information for solving the target and have the information accessible, i.e. easily extractable (otherwise, the raw image or its compression based representations would be optimal). Thus, it is crucial for us to adopt a low-capacity (small) architecture as transfer function trained with a small amount of data, in order to measure transferability conditioned on being highly accessible. We use a shallow fully convolutional network and train it with little data (8x to 120x less than task-specific networks).
可访问性:为了成功转移,源的潜在表示应该包含用于解决目标的足够信息,并且具有可访问的信息,即易于提取(否则,原始图像或其基于压缩的表示将是最优的) 。因此,对于我们来说,采用低容量(小型)体系结构作为用少量数据进行培训的传递函数是非常重要的,以测量可高度访问的可转移性。我们使用一个浅层完全卷积网络,并用很少的数据训练它(比任务特定的网络少8倍到120倍)。
Higher-Order Transfers: Multiple source tasks can contain complementary information for solving a target task (see examples in Fig 6). We include higher-order transfers which are the same as first order but receive multiple representations in the input. Thus, our transfers are functions, where ℘ is the powerset operator.
高阶传输:多源任务可以包含补充信息以解决目标任务(请参见图6中的示例)。我们包含与第一阶相同的高阶传输,但在输入中接收多个表示。因此,我们的转账功能是,其中℘是权力机构运营商。
As there is a combinatorial explosion in the number of feasible higher-order transfers (Figure 6: Higher-Order Transfers. Representations can contain complementary information. E.g. by transferring simultaneously from 3D Edges and Curvature individual stairs were brought out. See our publicly available interactive transfer visualization page for more examples.
图6:高阶传输。陈述可以包含补充信息。例如。通过从3D边缘和曲率同时转移个别楼梯被带出。有关更多示例,请参阅我们公开的交互式传输可视化页
3.3. Step III: Ordinal Normalization using Analytic Hierarchy Process (AHP)
3.3。第三步:使用层次分析法(AHP)进行序数标准化
We want to have an affinity matrix of transferabilities across tasks. Aggregating the raw losses/evaluationsThis approach is derived from Analytic Hierarchy Process [79], a method widely used in operations research to create a total order based on multiple pairwise comparisons.
这种方法来源于层次分析法[79],这是一种在运筹学中广泛使用的方法,用于基于多对比较来创建总体订单。
3.4. Step IV: Computing the Global Taxonomy
3.4。第四步:计算全球分类
Given the normalized task affinity matrix, we need to devise a global transfer policy which maximizes collective performance across all tasks, while minimizing the used supervision. This problem can be formulated as subgraph selection where tasks are nodes and transfers are edges. The optimal subgraph picks the ideal source nodes and the best edges from these sources to targets while satisfying that the number of source nodes does not exceed the supervision budget. We solve this subgraph selection problem using Boolean Integer Programming (BIP), described below, which can be solved optimally and efficiently [41, 16].
鉴于规范化的任务亲和力矩阵,我们需要制定全球转移政策,以最大限度地提高所有任务的集体绩效,同时尽量减少所用的监督。这个问题可以表示为子图选择,其中任务是节点,传输是边。最佳子图选择理想源节点和从这些源到目标的最佳边缘,同时满足源节点数量不超过监督预算。我们使用下面描述的布尔整数规划(BIP)来解决这个子图选择问题,它可以被优化和有效地解决[41,16]。
Now we add three types of constraints via matrix A to enforce each feasible solution of the BIP instance corresponds to a valid subgraph for our transfer learning problem: Constraint I: if a transfer is included in the subgraph, all of its source nodes/tasks must be included too, Constraint II: each target task has exactly one transfer in, Constraint III: supervision budget is not exceeded.
现在我们通过矩阵A添加三种约束来强制BIP实例的每个可行解对应于我们的转移学习问题的有效子图:约束I:如果转移包含在子图中,则其所有源节点/任务必须也包括在内,约束二:每个目标任务恰好有一个转移进来,约束三:监督预算不超过。
Constraint I: For each rowcorresponds to the optimal subgraph, which is our taxonomy.
上面没有定义的A的元素被设置为0.现在问题是一个有效的BIP,并且可以在几分之一秒内被最优解决[41]。BIP解决方案对应于最佳子图,这是我们的分类。
文章引用于 http://tongtianta.site/paper/1750
编辑 Lornatang
校准 Lornatang