Multi-label Classification with Partial Annotations using Class-aware Selective Loss
Abstract
摘要
Large-scale multi-label classification datasets are commonly, and perhaps inevitably, partially annotated. That is, only a small subset of labels are annotated per sample. Different methods for handling the missing labels induce different properties on the model and impact its accuracy. In this work, we analyze the partial labeling problem, then propose a solution based on two key ideas. First, un-annotated labels should be treated selectively according to two probability quantities: the class distribution in the overall dataset and the specific label likelihood for a given data sample. We propose to estimate the class distribution using a dedicated temporary model, and we show its improved efficiency over a na ̈ıve estimation computed using the dataset’s partial annotations. Second, during the training of the target model, we emphasize the contribution of annotated labels over originally un-annotated labels by using a dedicated asymmetric loss. With our novel approach, we achieve state-of-the-art results on OpenImages dataset (e.g. reaching 87.3 mAP on V6). In addition, experiments conducted on LVIS and simulated-COCO demonstrate the effectiveness of our approach. Code is available at https://github.com/Alibaba-MIIL/PartialLabelingCSL.
大规模多标签分类数据集通常是部分标注的,这是不可避免的。也就是说,每个样本只有一小部分标签被标注。不同的缺失标签处理方法会对模型产生不同的影响,从而影响模型的准确性。在本工作中,我们分析了部分标注问题,并基于两个关键思想提出了解决方案:第一点,应该根据两个概率量对未标注的标签进行选择性处理:类别在整个数据集上的分布与给定样本特定标签可能性。我们提出使用一个专用的临时模型来估计类分布,并且我们展示了它比使用数据集的部分注释进行简单的估计更有效。第二点,在目标模型的训练过程中,我们通过使用专用的不对称损失函数来突出有注释标签比原本没有注释标签的贡献。通过我们的新方法,我们在OpenImages数据集上实现了最先进的结果(例如在V6上达到mAP 87.3)。此外,在LVIS和模拟coco实验上也验证了该方法的有效性。代码可以在https://github.com/Alibaba-MIIL/PartialLabelingCSL找到。
1. Introduction
1. 引言
Recently, a remarkable progress has been made in multi-label classification [4, 7, 16, 29]. Dedicated loss functions were proposed in [2, 27], as well as transformers based approaches [5, 16, 19]. In many common cases, such as [6, 8, 11, 14, 15], as the amounts of samples and labels in the data increase, it becomes impractical to fully annotate each image. For example, OpenImages dataset [15] consists of 9 million training images and having 9,600 classes. An exhaustive annotation process would require annotating more than 86 billion labels. As a result, partially labeled data is inevitable in realistic large-scale multi-label classification tasks. A partially labeled image is annotated with a subset of positive labels and a subset of negative labels, where the rest un-annotated labels are considered as unknown (Figure 1). Typically, the majority of the labels are un-annotated. For example, on average, a picture in OpenImages is annotated with only 7 labels. Thus, the question of how to treat the numerous un-annotated labels may have a considerable impact on the learning process.
近年来,多标签分类技术取得了显著进展 [4,7,16,29]。在 [2,27] 中提出了专用损失函数,以及基于transformer的方法 [5,16,19] 。在许多常见的情况下,如 [6,8,11,14,15],随着数据中样本和标签数量的增加,对每一幅图像进行完全注释变得不切实际。例如,OpenImages数据集 [15] 包含900万张训练图像和9600个类。一个详尽的注释过程需要注释超过860亿个标签。因此,在现实的大规模多标签分类任务中,部分标注数据是不可避免的。部分标注的图像用正面标签的子集和负面标签的子集进行标注,其余未标注的标签被认为是未知的(图1)。通常,大多数标签都是未注释的。例如,OpenImages中的一张图片平均只有7个标签。因此,如何处理大量未注释标签的问题可能对学习过程有相当大的影响。
The basic training mode for handling the un-annotated labels is simply to ignore their contribution in the loss function, as proposed in [6]. We denote this mode as Ignore. While ignoring the un-annotated labels is a reasonable choice, it may lead to a poor decision boundary as it exploits only a fraction of the data, see Figure 2(b). Moreover, in a typical multi-label dataset, the probability of a label being negative is very high. Consequently, treating the un-annotated labels as negative may improve the discriminative power as it enables the exploitation of the entire data [14]. However, this training mode, denoted as Negative, has two main drawbacks: adding label noise to the training, and inducing a high imbalance between negative and positive samples [2]. This mode is illustrated in Figure 2(c).
处理未标注标签最基本的训练模式是简单地忽略它们在损失函数中的贡献,就像 [6] 中提出的那样。我们将此模式表示为Ignore。尽管忽略未注释的标签是一个合理的选择,但它可能导致一个糟糕的决策边界,因为它只利用了数据的一部分,请参见图2(b)。此外,在典型的多标签数据集中,标签为负的概率非常高。因此,将未标注的标签作为负标签处理可能会提高鉴别能力,因为它允许利用整个数据 [14]。但是,这种被记为Negative的训练模式,有两个主要的缺点:在训练中加入了标签噪音,导致正负样本之间的高度不平衡 [2]。这种模式如图2(c)。
While treating the un-annotated labels as negative can be useful for many classes, it may significantly harm the learning of labels that tend to appear frequently in the images although not being sufficiently annotated. For example, color classes are labeled in only a small number of samples in OpenImages [15], e.g. class “Black” is annotated in 1,688 samples, which is only 0.02% of the samples, while they are probably present in most of the images (see an example in Figure 1). Consequently, such classes are trained with many wrong negative samples. Thus, it would be worthwhile to first identify the frequent classes in the data and treat them accordingly. While in fully annotated multi-label datasets (e.g. MS-COCO [18]) the class frequencies can be directly inferred by counting the number of their annotations, in partially annotated datasets it is not straightforward. Counting the number of positive annotations per class is misleading as the numbers are usually not proportional to the true class frequencies. In OpenImages, assumably infrequent classes like “Boat” and “Snow” are annotated in more than 100,000 samples, while frequent classes as colors are annotated in only ∼1,500 images. Therefore, the class distribution is required to be estimated from the data.
虽然将未注释的标签视为负标签的对许多类都有用,但它可能严重损害对标签的学习,这些标签往往频繁地出现在图像中,尽管没有完全注释。例如,在OpenImages [15] 中,颜色类只在少量样本中被标注,例如,类“Black”在1,688个样本中被标注,这只占样本的0.02%,而它们可能存在于大多数图像中(见图1中的一个示例)。因此,这些类使用许多错误的负样本进行训练。因此,有必要首先确定数据中经常出现的类并相应地处理它们。虽然在完全标注的多标签数据集(如MS-COCO [18])中,类的频率可以通过计算它们标注的数量直接推断出来,但在部分标注的数据集中就不那么简单了。计算每个类的正标签的数量是有误导性的,因为这些数字通常与真实的类频率不成比例。在OpenImages中,像“Boat”和“Snow”这样不常见的类在超过10万个样本中得到了标注,而像颜色这样经常出现的类在大约1500张图片中得到了标注。因此,需要从数据来估计类的分布。
In this paper, we propose a Selective approach that aims at mitigating the weaknesses raised by the primary training modes (Figure 2). In particular, we will select one of the primary mode (Ignore or Negative) for each label individually by utilizing two probabilistic conditions, termed as label likelihood and label prior. The label likelihood quantifies the probability of a label being present in a specific image. The label prior represents the probability of a label being present in the data. To acquire a reliable label prior, we propose a method for estimating the class distribution. To that end, we train a classification model using Ignore mode and evaluate it on a representative dataset. Then, when training the final model, to handle the high negative-positive imbalance, we adopt the asymmetric loss [2], which enables focusing on the hard samples, while at the same time controlling the impact from the positive and negative samples. We further suggest decoupling the focusing levels of the annotated and un-annotated terms in the loss to emphasize the contributions from the annotated negative samples.
在本文中,我们提出了一种Selective 方法,旨在减轻主要训练模式所带来的缺点(图2)。特别地,我们将利用两个概率条件分别为每个标签选择一个主要模式(Ignore或Negative),称为标签可能性和标签先验。标签可能性量化标签出现在特定图像中的概率。标签先验表示数据中存在标签的概率。为了获得一个可靠的标签先验,我们提出了一种估计类分布的方法。为此,我们使用Ignore模式训练分类模型,并在一个有代表性的数据集上对其进行评估。然后,在训练最终模型时,为了处理负-正样本的高度不均衡,我们采用了不对称损失函数 [2],可以在关注困难样本的同时控制正、负样本的影响。我们进一步提出解耦损失函数中有标注和无标注部分的聚焦水平,以强调有标注负样本的贡献。
Extensive experiments were conducted on three datasets: OpenImages [15] (V3 and V6) and LVIS [8] which are partially annotated datasets with 9,600 and 1,203 classes, respectively. In addition, we simulated partially annotated versions of the MS-COCO [18] for exploring and verifying our approach. Results and comparisons demonstrate the effectiveness of our proposed scheme. Specifically, on OpenImages (V6) we achieve a state-of-the-art result of 87.34% mAP score. The contributions of the paper can be summarized as follows:
• Introducing a novel selective scheme for handling partially labeled data, that treat each un-annotated label separately based on two probabilistic quantities: label likelihood and label prior. Our approach outperforms previous methods in several partially labeled benchmarks.
• We identify a key challenge in partially labeled data, regarding the inaccuracy of calculating the class distribution using the annotations, and offer an effective approach for estimating the class distribution from the data.
• A partial asymmetric loss is proposed to dynamically control the impact of the annotated and un-annotated negative samples.
在三个数据集上进行了广泛的实验:OpenImages [15] (V3和V6)和LVIS [8],它们是部分标注的数据集,分别有9,600和1,203个类。此外,我们模拟了MS-COCO[18]的部分标注版本,以探索和验证我们的方法。结果和比较验证了该方案的有效性。具体来说,在OpenImages (V6)上,我们实现了87.34%的mAP得分的最先进的结果。本文的贡献可归纳如下:
•引入一种新的selective方案,用于处理部分标注数据,根据两个概率量单独地处理每个未标注的标签:标签可能性和标签先验。我们的方法在几个部分标注的基准测试中优于以前的方法。
•我们发现了部分标注数据的一个关键挑战,即使用标注数据计算类分布的不准确性,并提供了一种从数据估计类分布的有效方法。
•提出了部分不对称损失函数,以动态控制标注负样本和未标注的负样本的影响。
2. Related Work
2. 相关工作
Several methods had been proposed to tackle the partial labeling challenge. [6] offered a partial binary cross-entropy (CE) loss to weigh each sample according to the proportion of known labels, where the un-annotated labels are simply ignored in the loss computation. In [14] they proposed to involve also the un-annotated labels in the loss, treating them as negative while smoothing their contribution by incorporating a temperature parameter in their sigmoid function. An interactive approach was presented in [11] whose loss is composed of cross-entropy for the annotated labels and a smoothness term as a regularization. A curriculum learning strategy was also used in [6] to complete the missing labels. Instead of using the same training mode for all classes, in this paper we propose adjusting the training mode, either as Ignore or Negative for each class individually, relying on probabilistic based conditions. Also, we introduce a key challenge in partial labeling, concerning the inability to infer the class distribution directly from the number of annotations, and suggest an estimation procedure to handle this.
已经提出了几种方法来解决部分标签的挑战。[6] 提供了一个部分二分类交叉熵(CE)损失,根据已知标签的比例来衡量每个样本,其中未注释的标签在损失计算中被简单地忽略。在 [14] 中,他们提出在损失中也包括未注释的标签,将它们视为负标签,同时通过在其sigmoid函数中加入温度参数来平滑它们的贡献。在 [11] 中提出了一种交互式方法,该方法的损失由标注标签的交叉熵和正则化的平滑项组成。在 [6] 中还使用了一种课程学习策略来补全缺失的标签。在本文中,我们提出了一种基于条件概率来调整训练模式的方法,即对每个类别单独地设置为Ignore或Negative,而不是对所有类别使用相同的训练模式。此外,我们引入了部分标注中的一个关键挑战,即不能直接从标注的数量来推断类的分布,并提出了一个估计过程来处理这个问题。
Other methods were proposed in [26, 28, 30] to cope with partial labels, for example by a low-rank empirical risk minimization [30] or by learning structured semantic correlations [28]. However, they are not scalable to large datasets and their optimization procedures are not well adapted to deep neural networks.
其他方法在 [26,28,30] 中提出以应对部分标签,如低等级的经验风险最小化 [30] 或者学习语义结构关联 [28]。然而,它们不能扩展到大型数据集,它们的优化过程不能很好地适应深度神经网络。
Positive Unlabeled (PU) is also related to partial labeling [1, 9, 12]. The difference is that PU learning approaches use only positive and un-annotated labels without any negative annotations.
阳性未标注(PU)也与部分标注有关 [1,9,12]。不同的是,PU的学习方法只用正标签和未标注标签而没有任何负标签。
3. Learning from Partial Annotations
3. 部分标注学习
3.1. Problem Formulation
3.1. 问题公式化
Given a partially annotated multi-label dataset with classes, each sample , corresponding to a specific image, and is annotated by a label vector , where , denotes whether the class is present in the image (‘1’), absent (‘−1’) or unknown (‘0’). For a given image, we denote the sets of positive and negative labels as , and , respectively. The set of un-annotated labels is denoted by . Note that typically, . A general form of the partially annotated multi-label classification loss can be defined as follows,
(1)
where , and are the loss terms of the positive, negative and un-annotated labels for sample , respectively. Given a set of labeled samples , our goal is to train a neural-network model , parametrized by , to predict the presence or absence of each class given an input image. We denote by the class prediction vector, computed by the model: where is the sigmoid function, and is the output logit corresponding to class .
给定一个带有类的部分标注多标签数据集,每个样本对应于一个特定的图像,被标签向量标注,其中,标示类是否存在于图片中,出现 (‘1’),不出现 (‘-1’),未知 (‘0’)。对于给定的图片,我们分别定义正标签集和负标签集为和。未标注的标签定义为。值得注意的是,一般来说。部分标注多标签分类损失函数的一般形式可以定义为,
(1)
,和分别是样本的正、负和未标注标签的损失项。给定个带标签的样本,我们的目标是训练一个神经网络模型,由参数化,用于预测每个类别在给定的输入图像中存在或不存在。我们用表示类预测向量,由模型计算:,其中是sigmoid函数,是对应于类别的逻辑输出。
For example, applying the binary CE loss while considering only the annotated labels is defined by setting the loss terms as , and .
例如,只考虑标注标签时应用二分类CE损失函数将损失项定义为:, 和。
3.2. How to Treat the Un-annotated Labels?
3.2. 如何对待未标注标签?
Typically, the number of un-annotated labels is much higher than the annotated ones. Therefore, the question of how to treat the un-annotated labels may have a considerable impact on the learning process. Herein, we will first define the two primary training modes and detail their strengths and limitations. Then, in light of these insights, we will propose a class aware mechanism which may better handle the un-annotated labels.
通常,未标注标签的数量要比标注标签的数量多得多。因此,如何对待未标注标签的问题可能对学习过程有相当大的影响。在此,我们将首先界定两种主要的训练模式,并详细说明它们的优势和局限性。然后根据这些理解,我们提出一种类感知机制,可以更好地处理未标注标签。
Mode Ignore. The basic scheme for handling the un-annotated labels is simply to ignore them, as suggested in [6]. In this mode we set . This way the training data is not contaminated with wrong annotations. However, its drawback is that it enables to use only a subset of the data. For example, in OpenImages dataset, the number of samples with either positive or negative annotations for the class “Cat” is only ∼ 0.9% of the training data. This may lead to a sub-optimal classification boundary when the annotated negative labels do not sufficiently cover the space of the negative class. See illustration in Figure 2(b).
模式 Ignore。处理未标注标签的基本方案是忽略它们,就像 [6] 中建议的那样。在此模式中,我们设置。这样训练数据就不会受到错误标注的污染。但是它的缺点是只能使用数据的一个子集。例如,在OpenImage数据集中,对于类“Cat”有正标签或者负标签的样本仅占训练样本的0.9%。当标注负样本没有充分覆盖父类空间时,这可能导致一个次优的分类边界。见图2(b)。
Mode Negative. In typical multi-label datasets, the chance of a specific label to appear in an image is very low. For example, in the fully-annotated MS-COCO dataset [18], a label is annotated as negative with a probability of ∼ 0.96. Based on this prior assumption, a reasonable choice would be to treat the un-annotated labels as negative, i.e. setting . This working mode was also suggested in [14]. While this mode enables the utilization of the entire dataset, it suffers from two main limitations. First, it may wrongly annotate positive labels as negative annotations, adding label noise to the training. Secondly, this mode inherently triggers a high imbalance between negative and positive samples. Balancing them, for example by down-weighting the contribution of the negative samples, may diminish the impact of the valuable annotated negative samples. These weaknesses are illustrated in Figure 2(c).
模式 Negative。在典型的多标签数据集中,特定标签出现在图像中的几率非常低。例如,在完全标注的MS-COCO数据集 [18] 中,一个标签被标注为负的概率为0.96。基于这个先验的假设,一个合理的选择是将未标注的标签视为负标签,即设置。在 [14] 中也提出了这种工作模式。虽然这种模式允许使用整个数据集,但它有两个主要的限制。首先,它可能错误地将正标签标注为负标签,给训练增加标签噪音。其次,这种模式本身会导致负样本和正样本之间的高度不平衡。为了平衡它们,例如使用降低负样本的作用,可能会降低有价值的标注负样本的影响。图2(c)说明这些缺点。
The question of which mode to choose has no unequivocal answer. It depends on various conditions and may have its origin in the annotation scheme used. In section 5.1, we will show that different partial annotation procedures can lead to favor different loss modes (See Figure 6). Moreover, as discussed in the next section, the used mode can influence each class differently, depends upon the class presence frequency in the data and the number of available annotations.
选择哪种模式的问题没有明确的答案。它取决于各种条件,可能起源于所使用的标注方案。在第5.1节中,我们将展示不同的部分标注过程可能导致不同的损失模式偏好(见图6)。此外,正如在下一节中讨论的,所选的模式可能会对每个类产生不同的影响,这取决于类在数据中出现的频率和可用标签的数量。
3.3. Class Distribution in Partial Annotation
3.3 部分标注中的类分布
As aforementioned, in multi-label datasets the majority of labels are present in only a small fraction of the data. For example, in MS-COCO, 89% of the classes appear in less than 5% of the data. Thus, treating all un-annotated labels as negative may improve the discriminative power for many classes, as more real negative samples are involved in the training, while the added label noise is negligible. However, this may significantly harm the learning of classes whose number of positive annotations in the dataset is much lower than the actual number of samples they appear in. Consider the case of the class “Person” in MS-COCO. It is present in 55% of the data (45,200 samples). Now, suppose that only a subset of 1,000 positive annotations are available, and the rest are switched to negative. It means that during the training, most of the prediction errors are due to wrong annotations. In this case, the optimization will be degraded and the network confidence will be decayed considerably. Hence, it will be beneficial to first identify the frequent labels and handle them differently in the loss.
如上所述,在多标签数据集中,大多数标签只存在于数据的一小部分中。例如,在MS-COCO中,89%的类出现在不到5%的数据中。因此,将所有未标注的标签视为负的可能会提高许多类的甄别能力,因为训练中涉及到更多真实的负样本,而添加的标签噪声可以忽略不计。然而,这可能会严重数据集中的正标签数量远远低于它们出现在样本中的实际数量的类的学习。以MS-COCO中的“Person”类为例。它存在于55%的数据(45200个样本)中。现在,假设只有1000个正标签的子集可用,其余的都切换为负面。这意味着在训练过程中,大部分的预测错误都是由于错误的标注造成的。在这种情况下,优化效果会下降,网络置信度会大幅下降。因此,在损失函数中,首先识别出经常出现的标签,并采取不同的处理方法将是有益的。
3.3.1 Positive Annotations Deficiency
3.3.1 正标签不足
To identify the frequent labels, we need to reliably acquire their distribution in the data. While in fully annotated datasets it can be easily obtained by counting the number of annotations per class and normalizing by the total number of samples, in partially annotated datasets it is not straight-forward. While one may suggest counting the number of positive annotations for each class, the resulted numbers are misleading and are usually not proportional to the true class frequencies. For example, in OpenImages (V6), we found that many common and general classes which are frequently present in images are labeled with very few positive annotations. For example, general classes such as “Day-time”, “Event” or “Design” are labeled in only 1,709, 1,517 and 1,394 images (out of 9M), respectively. Color classes which massively appear in images are also rarely annotated. “Black” and “White” classes are labeled in only 1,688 and 1,497 images, respectively. We may assume that classes such as “Daytime” or “White” are present in much more than 0.02% of the samples. Similarly, in LVIS dataset, the classes “Person” and “Shirt” are annotated in only 1,928 and 1,942 samples, respectively, while they practically appear in much more images (note that in MS-COCO, which shares the same images with LVIS, the class “Person” ap- pears in 55% of the samples).
为了获得标签的频率,我们需要可靠地获取它们在数据中的分布。虽然在完全标注的数据集中,通过计算每个类的标注数量并通过样本总数进行规范化可以很容易地获得它,但在部分标注的数据集中就不那么简单了。虽然有人可能建议计算每个类的正标签的数量,但得出的数字具有误导性,而且通常与真实的类频率不成比例。例如,在OpenImages (V6)中,我们发现许多常见的和通用的类经常出现在图像中,但很少有正面的标注。例如,“白天”、“事件”或“图案”这样的通用类分别只在1,709、1,517和1,394张图片中被标记(9M)。大量出现在图像中的颜色类也很少被注释。“黑色”和“白色”类别分别只在1688和1497张图片中被标记。我们可以假设,像“白天”或“白色”这样的类在样本中所占的比例远远超过0.02%。类似地,在LVIS数据集中,类“人”和“衬衫”分别只在1928和1942个样本中被标注,而实际上它们出现在更多的图像中(注意,在MS-COCO中,与LVIS共享相同的图像,类“人”出现在55%的样本中)。
Note that the labels are not necessarily annotated according to their dominance in the image. In Figure 1, we show examples of three images and corresponding annotations of the classes “Lip” and “Yellow”. As can be seen, the left image was not annotated with neither “Lip” nor “Yellow” although these labels are present and dominant in it. Also, “Lip” is annotated in only 1,121 images which is highly deficient in view of the fact that the class “Human face” is annotated in 327,899 images.
请注意,标签的标注不一定符合其在图像中的主导地位。在图1中,我们展示了三个图像的示例以及类“嘴唇”和“黄色”的相应标注。可以看到,左边的图像既没有标注“嘴唇”,也没有标注“黄色”,虽然这些标签在其中是存在的,并且是占主导地位的。此外,“嘴唇”的注释只有1121张,这与“人脸”类的注释有327,899张相比是非常不足的。
According to the above-mentioned observations, the number of positive annotations cannot be used to measure the class frequencies in partially labeled datasets. In section 4.2, we will propose a simple yet effective approach for estimating the class distribution from the data.
根据上述观察,正样本的数量不能用来衡量部分标注数据集中的类频率。在第4.2节中,我们将提出一种简单而有效的方法来从数据估计类分布。
4. Proposed Approach
4. 提出的方法
In this section we will present our method which aims at mitigating the issues raised in training partially annotated data. An overview of the proposed approach is summarized in Figure 3.
在本节中,我们将介绍我们的方法,该方法旨在减轻训练部分标注数据时产生的问题。图3概述了所提出的方法。
To mitigate the high negative-positive imbalance problem, we adopt the asymmetric loss (ASL) proposed in [2] as the base loss for the multi-label classification task. It enables to dynamically focus on the hard samples while at the same time controlling the contribution propagated from the positive and negative samples. First, let us denote the basic term of the focal loss [17] for a given class , by:
(2)
where is the focusing parameter, which adjusts the decay rate of the easy samples. Then, we define the partially annotated loss as follows,
(3)
where , and are the focusing parameters for the positive, negative and un-annotated labels, respectively. is the selectivity parameter and it is introduced in section 4.1. We usually set to decay the positive term with a lower rate than the negative one because the positive samples are infrequent compared to the negative samples. In addition, as for a given class, the negative annotated samples are verified ground-truth we are interested in preserving their contribution to the loss. Therefore, we suggest decoupling of the focusing parameter of the annotated negative labels from the un-annotated one, allowing us to set a lower decay rate for the annotated negative labels: . This way, the impact of the annotated negative samples on establishing the classification boundary for each class is higher (see Figure 2(d)). We term this form of asymmetric loss as Partial-ASL (P-ASL).
为了缓解高度不平衡问题,我们采用 [2] 中提出的asymmetric loss(ASL)作为多标签任务的基础损失。它能够动态地关注困难样本,同时控制正样本和负样本的贡献传播。首先我们定义给定类的在focal loss [17] 中的基本项:
(2)
其中是焦点参数,调节容易样本的衰减率。然后,我们对部分标注损失的定义如下:
(3)
其中,和分别是正标签,负标签和未标注标签的焦点参数。是selectivity参数将在4.1节介绍。我们通常设置以比负标签更低的比例衰减正项,因为与负标签相比,正标签的频率较低。此外,对于一个给定的类,样本被标注的负标签被证实为真实标签,我们期望的是保留它们对损失函数的贡献。因此,我们建议将带标注负标签的焦点参数与未带标注负标签的焦点参数解耦,从而为带标注负标签设置更低的衰减率:。这样,带标注的负样本对建立每个类的分类边界的影响更大(见图2(d))。我们将这种形式的不对称损失称为Partial-ASL (P-ASL)。
4.1. Class-aware Selective Loss
4.1. 类选择损失函数
As described in section 3.1, both Ignore and Negative modes are supported by inadequate assumptions for the partial annotation problem. In this section, we propose a selective approach for adjusting the mode per individual class. The core idea is to examine the probability of each un- annotated label being present in a given sample . Un-annotated labels that are suspected as positive will be ignored. The others will be treated as negative.
如3.1节所述,对于部分标注问题,Ignore和Negative模式都有不充分的假设支持。在本节中,我们提出了一种针对每个类调整模式的选择性方法。核心思想是检查每个未标注标签出现在给定样本中的概率。被怀疑为阳性的未标注标签将被忽略。其他的将被视为负面的。
For that purpose, we define two probabilistic values: label likelihood and label prior, and detail their usage in the following section. These two quantities are complementary to each other. The label likelihood enables to dynamically ignore the loss contribution of a label in a given image by inspecting its visual content. The label prior extracts useful information of the estimated class frequencies in the data and uses it regardless of the specific image content.
为此,我们定义了两个概率值:标签可能性和标签先验,并在下一节详细介绍它们的用法。这两个量是互补的。标签可能性能够通过检查给定图片的视觉内容来动态地忽略一个标签在给定图片中的损失贡献。标签先验提取数据中类频率估计的有用信息,并且使用它而不考虑具体的图像内容。
Label likelihood. Defined by the conditional probability of an un-annotated label of being positive given the image and the model parameters. i.e.
(4)
It can be simply estimated by the network prediction throughout the training. A high may imply that the un-annotated label appears in the image, and treating it as negative may lead to an error. Accordingly, the label should be ignored. In practice, we allow for un-annotated labels with top prediction values to be ignored. i.e.
(5)
where the operator returns the indices of the top elements of the input vector. The algorithm scheme is illustrated in Figure 4. Note that this implementation enables us to “walk” on a continuous scale between the Negative and Ignore modes. Setting corresponds to Negative mode, as no un-annotated label is ignored. equivalents to the Ignore mode, as all un-annotated labels are ignored.
标签可能性。定义为在给定图像和模型参数中未标注标签c为正的条件概率。即
(4)
它在整个训练过程中通过模型预测来简单的估计。一个高的可能意味着未标注的标签出现在图片中,而将其视为负标签可能会导致错误。因此标签应该被忽略。在实践中,我们允许K个高预测值的未标注标签被忽略。即
(5)
其中操作返回输入向量的前K个值的索引。算法方案如图4所示。请注意,这种实现使我们能够在Negative和Ignore模式之间连续“游走”。设置对应于Negative模式,因为没有未标注的标签会被忽略。等同于Ignore模式,因为所有未标注的标签都会被忽略。
Label prior. Defined by the probability of a label being present in an image. It can also be viewed as the actual label presence frequency in the data. We are interested in the label prior for the un-annotated labels,
(6)
According to section 3.3, the label prior should be estimated from the data, as the class distribution is hidden in partially annotated datasets. In the next section (4.2), we will introduce the scheme for estimating the label prior. Meanwhile, let us denote by the label prior estimator for class . We are interested in disabling the loss contribution of labels with high prior values. These labels are formally defined by the following set,
(7)
where represents the minimum fraction of the data determining a label to be ignored.
标签先验。定义为标签在图像中出现的概率。它也可以被视为数据中的实际标签出现频率。我们关注未标注标签的先验概率,
(6)
根据3.3节,标签先验应该从数据中估计,因为类分布隐藏在部分标注的数据集中。在下一节(4.2)中,我们将介绍估计标签先验的方案。同时,让我们用表示类的标签先验估计。我们希望的是禁用具有高先验值的标签的损失贡献。这些标签由下面的集合正式定义,
(7)
其中表示数据决定忽略一个标签的最小分数。
Finally, we denote the set of labels whose loss contribution are ignored, as the union of the two previously computed sets,
(8)
Accordingly, we set the parameter in equation (3) as follows,
(9)
Note that we have explored other alternatives for implementing the label prior in the loss function. In particular, in appendix we compare a soft method that integrates the label prior by setting , and show that using a hard decision mechanism, as proposed in equation (), produces better results.
最后,我们将损失贡献被忽略的标签集合表示为之前计算的两个集合的并集,
(8)
据此,我们设置公式()中的参数如下:
(9)
注意,我们已经研究了在损失函数中标签先验的其他替代方法。尤其是,在附录中,我们对比了一个设置来继承标签先验的软方法,并且展示了使用公式()提出的硬决策机制,可以产生更好的结果。
4.2. Estimating the Class Distribution
4.2. 类分布估计
We aim at estimating the class distribution in a representative dataset . For that, we first need to assess the presence of each class in every image in the data, i.e. we would like to first approximate the probability of a class being present in an image . To that end, we propose training a model parametrized by , for predicting each class in a given image, i.e. . Afterwards, the model is applied on the sample set (e.g. the training data). The label prior can then be estimated by calculating the expectation,
(10)
我们的目的是估计一个代表性数据集中的类别分布。为此,我们首先需要估计数据中每个图像中每个类的存在,即,我们想首先近似一个类别存在于图像中的概率。为此,我们提出训练一个以为参数的模型,用于预测给定图像中的每个类,即。然后,将模型应用于样本集合(例如训练数据)。标签先验可以通过计算期望来估计,
(10)
For estimating the label priors, we train the model in Ignore mode. While the discriminative power of the Negative mode may be stronger for majority of the labels, it may fail to provide a reliable prediction values for frequent classes with small number of positive annotations. Propagating abundance of gradient errors from wrong negative annotations will decay the expected returned prediction for those classes and will fail to approximate . Consequently, our suggested estimation for the class distribution is given by,
(11)
where denotes the model parameters trained in Ignore mode. In section , we will empirically show the effectiveness of the Ignore mode in ranking the class frequencies and the inapplicability of the Negative mode to do that. To qualitatively show the estimation effectiveness, we present in Figure the top 20 frequent classes in OpenImages (V6) as estimated by our proposed procedure. Note that all the top classes are commonly present in images such as colors (“White”, “Black”, “Blue” etc.) or general classes such as “Photograph”, “Light”, “Daytime” or “Line”. In appendix , we show the next top 60 estimated classes. Also, in appendix , we provides the top 20 estimated frequent classes for LVIS dataset.
为了估计标签先验,我们采用Ignore模式训练模型。虽然Negative模式对大多数标签的识别力可能更强,但对于带有少量正正标签的频繁类,它可能无法提供可靠的预测值。从错误的负标签传播梯度误差将降低这些类的预测值,并且将无法近似。因此,我们建议的类别分布估计如下:
(11)
其中表示在Ignore模式下训练的模型参数。在节中,我们将经验地展示Ignore模式对类频率排序的有效性,以及Negative模式在此方面的不适用性。为了定量地表示估计的有效性,我们在图中展示了使用我们提出的程序在OpenImages (V6)上估计出的最常用的20个类。请注意,所有顶级类别通常在图像中出现,如颜色(“白色”、“黑色”、“蓝色”等)或一般类别,如“照片”、“光”、“白天”或“线”。在附录中,我们展示了之后60个类的估计。此外,在附录中,我们展示了LVIS数据集频率估计最高的20个类。
5. Experimental Study
5. 实验研究
In this section, we will experimentally demonstrate the insights discussed in the previous sections. We will mainly utilize the fully annotated MS-COCO dataset [18] to validate and demonstrate the effectiveness of our approach by simulating partial annotation under specific case studies. The evaluation metric used in the experiments is the mean average precision (mAP). Training details are provided in appendix .
5.1. Impact of Annotation Schemes
5.1. 标注规则的影响
As aforementioned in section , the scheme used for annotating the dataset can substantially induce the learning process. Specifically, the choice of how to treat the un-annotated labels is highly influenced by the annotation scheme. To demonstrate that, we simulate two partial annotation schemes on the original fully annotated MS-COCO dataset [18]. MS-COCO includes 80 classes, 82,081 training samples, and 40,137 validation samples, following the 2014 split. The two simulated annotation schemes are detailed as follows:
Fixed per class (FPC). For each class, we randomly sample a fixed number of positive annotations, denoted by , and the same number of negative annotations. The rest of the annotations are dropped.
Random per annotation (RPA). We omit each annotation with probability . Note that this simulation preserves the true class distribution of the data.
正如节提到的,数据集的标注方案可以从本质上引导学习过程。具体而言,如何处理未标注数据的选择很大程度上受标注方案的影响。为了证明这一点,我们在原始的全标注的MS-COCO数据集 [18]上模拟了两种标注方案。MS-COCO包含80个类别,2014年分为82081个训练样本和40137个验证样本。两种模拟标注方案详细说明如下:
固定每类(FPC)。对于每个类,我们随机抽取固定数量的正标注用表示和相同数量的负标注。其余的标注被删除。
随机每个标注(RPA)。我们以概率为删除每个标注。注意,这个模拟保留了数据的真实类分布。
In Figure , we show results obtained using each one of the simulation schemes for each primary mode (Ignore and Negative) while varying and values. As can be seen, while in RPA (Figure (a)), the Ignore mode consistently shows better results, in FPC (Figure (b)), the Negative mode is superior. Note that as we keep more of the annotated labels (by either increasing or decreasing ), the gap between the two training modes is reduced, catch- ing the maximal result. The phenomenons observed in the two case studies we simulated are also related in real practical procedures for partially annotating multi-label datasets. While in the FPC simulation, the class distribution is completely vanished and cannot be inferred by the number of positive annotations ( for ), the RPA scheme preserves the class distribution.
在图中,我们显示了在改变和值时,使用每个主模式(Ignore和Negative)的每个模拟方案获得的结果。可以看到,RPA(图(a)), Ignore模式始终显示更好的结果,在FPC(图(b)),Negative模式是优越的。注意,当我们保留更多的带标注标签时(通过增加或减少),两种训练模式之间的差距就会缩小,从而获得最大的结果。我们模拟的两个案例研究中观察到的现象也与多标签数据集部分标注的实际过程有关。而在FPC模拟中,类分布完全消失了,不能通过正标注的数量来推断( for ), RPA方案保留了类的分布。
5.2. Estimating the Label Prior
5.2. 标签先验估计
To demonstrate the estimation quality of the class distribution obtained by the approach proposed in section , we follow the FPC simulation scheme applied on the MS-COCO dataset (as described in section ), where a constant number of 1,000 annotations remained for each class. Because MS-COCO is a fully annotated dataset, we can compare the estimated class distribution (i.e. the label prior) to the true class distribution inferred by the original number of annotations. In particular, we measure the similarity be- tween the original class frequencies and the estimated ones using the Spearman correlation test. In figure , we show the Spearman correlation scores while varying the number of top-ranked classes. We also show the results obtained with Negative mode as a reference. Specifically, the Spearman correlation computed over all the 80 classes, with the estimator obtained using the Ignore mode is 0.81, demonstrating the estimator’s effectiveness. In the next section, we will show how it benefits the overall classification results. Also, in appendix we present the top frequent classes measured by our estimator and compare them to those obtained by the original class frequencies in MS-COCO.
为了证明由第中提出的方法获得的类分布的估计质量,我们在MS-COCO数据集上应用的FPC模拟方案(如第中所述),其中每个类保留不变的1000个标注。由于MS-COCO是一个全标注数据集,我们可以将估计的类分布(即之前的标签)与原始标注数量推断出的真实类分布进行比较。特别地,我们使用Spearman相关检验来衡量原始类别频率和估计类别频率之间的相似性。在图中,我们在改变排名靠前的类的数量时显示了Spearman相关分数。我们还展示了用Negative模式得到的结果,以供参考。具体来说,对所有80类进行Spearman相关性计算,使用Ignore模式得到的估计量为0.81,证明了估计量的有效性。在下一节中,我们将展示它对整体分类结果的好处。此外,在附录中,我们呈现了由我们的估计器测量的最高频率类,并将它们与MS-COCO中原始类频率得到的那些进行比较。
6. Benchmark Results
6. 基准测试结果
In this section, we will report our main results on the partially annotated multi-label datasets: OpenImages [15], and LVIS [8]. The results on MS-COCO dataset are presented in appendix . We will present a comparison to previous methods which handle partial annotations, among other baseline approaches in multi-label classification. The evaluation metric used in the experiments is the mean average precision (mAP). In particular, we report the standard per-class mAP denoted as mAP(C), and overall mAP denoted as mAP(O), which considers the number of samples in each class. The training details and the loss hyper-parameters used are provided in appendix .
在本节中,我们将报告关于部分标注的多标签数据集的主要结果:OpenImages [15] 和LVIS [8]。在MS-COCO数据集上的结果显示在附录中。在多标签分类的其他基线方法中,我们将与以前处理部分标注的方法进行比较。实验中使用的评价指标是平均精度(mean average precision, mAP)。特别地,我们报告了标为mAP(C)的每个类的标准mAP,以及标为mAP(O)的总体mAP,它考虑了每个类中的样本数量。使用的训练细节和损失超参数在附录中提供。
6.1. OpenImages V6
6.1. OpenImages V6
Openimages V6 is a large-scale multi-label dataset [15], consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. It is a partially annotated dataset, with 9,600 trainable classes. In Table , we present the mAP results obtained by our proposed Selective method and compare them to other approaches. Interestingly, Ignore mode produces better results than Negative mode, as OpenImages contains many under-annotated frequent classes such as colors and other general classes (see Figure ). Using Negative mode adds a massive la- bel noise and harms the learning of many common classes. In Table , we present results for different network architectures. Specifically, using TResNet-L [23], we achieve state-of-the-art result of 87.34 mAP score.
Openimages V6是一个大规模的多标签数据集[15],由900万张训练图像、41,620个验证样本和125,456个测试样本组成。它是一个部分标注的数据集,有9600个可训练的类。在表中,我们展示了通过我们提出的Selective方法获得的mAP结果,并将它们与其他方法进行比较。有趣的是,Ignore模式比Negative模式产生更好的结果,因为OpenImages包含许多未标注的频繁类,如colors和其他通用类(参见图)。使用否定模式会增加大量的噪音,并损害许多普通班级的学习。在表中,我们展示了不同网络架构的结果。具体来说,我们使用TResNet-L[23]实现了最先进的mAP得分87.34的结果。
To show the impact of decoupling the focusing parameters of the annotated and un-annotated loss terms in P-ASL as proposed in equation (), we varied the negative focusing parameter , while fixing. The results are presented in Figure . The case of represents the standard ASL [2]. As can be seen, the mAP score increases as we lower , up to 2. It indicates that lowering the decay rate for the annotated negative term boosts their contribution to the loss.
为了显示解耦P-ASL中有标注和无标注损失项的焦点参数的影响,我们改变负焦点参数,同时固定。结果显示在图中。表示标准ASL [2]。可以看到,当我们将降低到2时,mAP得分增加。这表明,降低标注负项的衰减率会增加它们对损失的贡献。
In Figure , we show the mAP scores while varying the number of top likelihood classes, as defined in equation (). Note that setting is equivalent to use Negative mode. Training with high enough K becomes similar to training using Ignore mode. The highest mAP results are obtained with both the likelihood and prior conditions.
在图中,我们显示了当改变最高可能类的数量时的mAP分数,定义在等式()中。注意,设置等价于使用Negative模式。K足够高的训练变得类似于使用Ignore模式的训练。在似然和先验条件下获得最高的mAP结果。
6.2. OpenImages V3
6.2. OpenImages V3
To be compatible with previously published results, we used the OpenImages V3 which contains 5,000 trainable classes. We follow the comparison setting described in [11]. Also, for a fair comparison we used the ResNet-101 [10] backbone, pre-trained on the ImageNet dataset. In Table , we show the mAP score results obtained using previous approaches and compared them to our Selective method. As shown, our method significantly outperforms previous approaches that deal with partial annotation in a multi-label setting.
为了与以前发布的结果兼容,我们使用了OpenImages V3,它包含5000个可培训的类。我们遵循 [11] 中描述的比较设置。此外,为了进行公平的比较,我们使用ResNet-101 [10] 主干,在ImageNet数据集上进行了预训练。在表,中,我们展示了使用之前的方法获得的mAP评分结果,并将它们与我们的Selective方法进行了比较。如图所示,我们的方法明显优于以前处理多标签设置中的部分标签的方法。
6.3. LVIS
6.3. LVIS
LVIS is a partially labeled dataset originally annotated for object detection and image segmentation, that was adopted as a multi-label classification benchmark. It consists of 100,170 images for training and 19,822 images for testing. It contains 1,203 classes. In Table ,, we present a comparison between different approaches on the LVIS dataset. As can be seen, in this case, the Negative mode is better, compared to the Ignore mode. This can be related to the fact that most of the labels are related to specific objects which do no appear frequently in the images. The most frequent class is “Person”. Therefore we also added its average precision to Table ,. Note that the Ignore model better learns the class “Person” compared to the one trained with Negative mode. Using the P-ASL with the Selective mode, we were able to obtain superior mAP results as well as top average precision even for the most frequent class, “Person”.
LVIS是一个部分标注的数据集,最初用于物体检测和图像分割,被用作多标签分类基准。它包括100,170张用于训练的图像和19822张用于测试的图像。它包含1,203个类。在表,中,我们对LVIS数据集上的不同方法进行了比较。可以看出,在这种情况下,Negative模式比Ignore模式更好。这可能与这样一个事实有关,即大多数标签都与特定的对象有关,这些对象在图像中不经常出现。最常见的类是“Person”。因此,我们还将其平均精度添加到表,中。注意,Ignore模型比Negative模式训练的更好地学习了“人”类。使用带有选择性模式的P-ASL,即使是最常见的类别“Person”,我们也能够获得更好的mAP结果以及最高的平均精度。
7. Conclusion
7. Conclusion
In this paper, we presented a novel technique for handling partially labeled data in multi-label classification. We observed that ignoring the un-annotated labels in the loss or treating them as negative should be determined individually for each class. We proposed a selective mechanism that uses the label likelihood computed throughout the training, and the label prior which is obtained by estimating the class distribution from the data. The un-annotated labels are further softened via a partial asymmetric loss. Extensive experiments analysis shows that our proposed approach outperforms other previous methods on partially labeled datasets, including OpenImages, LVIS, and simulated-COCO.
在本文中,我们提出了一种处理多标签分类中部分标注数据的新方法。我们注意到,应该针对每个类分别确定在损失函数中忽略未标注标签还是将它们视为负标签。我们提出了一种选择机制,使用在训练过程中计算的标签似然,以及通过估计数据的类分布获得的标签先验。未标注的标签通过部分不对称损失被进一步软化。大量的实验分析表明,我们提出的方法在部分标记数据集上优于其他以前的方法,包括OpenImages、LVIS和simulated-COCO。
References
参考文献
[1] Jessa Bekker and Jesse Davis. Learning from positive and unlabeled data: A survey. CoRR, abs/1811.04820, 2018. 3
[2] Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik- Manor. Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119, 2020. 1, 2, 4, 7, 8
[3] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020. 7
[4] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. Multi-label image recognition with graph convolu- tional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5177– 5186, 2019. 1
[5] Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan Wang, Nian Shi, and Honglin Liu. Mltr: Multi-label classification with transformer, 2021. 1
[6] Thibaut Durand, Nazanin Mehrasa, and Greg Mori. Learn- ing a deep convnet for multi-label classification with partial labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 647–657, 2019. 1, 2, 3,7,8
[7] Bin-Bin Gao and Hong-Yu Zhou. Multi-label image recog- nition with multi-class attentional regions. arXiv preprint arXiv:2007.01755, 2020. 1
[8] Agrim Gupta, Piotr Dolla ́r, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation, 2019. 1, 2,7
[9] Zayd Hammoudeh and Daniel Lowd. Learning from positive and unlabeled data with arbitrary positive shift, 2020. 3
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7, 8
[11] D. Huynh and E. Elhamifar. Interactive multi-label CNN learning with partial labels. IEEE Conference on Computer Vision and Pattern Recognition, 2020. 1, 3, 8
[12] Liwei Jiang, Dan Li, Qisheng Wang, Shuai Wang, and Song- tao Wang. Improving positive unlabeled learning: Practical aul estimation and new training method for extremely imbalanced data sets, 2020. 3
[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for
stochastic optimization, 2017. 11
[14] Kaustav Kundu and Joseph Tighe. Exploiting weakly super-
vised visual patterns to learn from partial annotations. In
NeurIPS, 2020. 1, 3, 7, 8
[15] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-
jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, and et al. The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, Mar 2020. 1, 2, 7
[16] JackLanchantin,TianluWang,VicenteOrdonez,andYanjun Qi. General multi-label image classification with transform- ers, 2020. 1
[17] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dolla ́r. Focal loss for dense object detection. CoRR, abs/1708.02002, 2017. 5
[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dolla ́r. Microsoft coco: Common objects in context, 2014. 2, 3, 6
[19] Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. Query2label: A simple transformer way to multi-label clas- sification, 2021. 1
[20] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 11
[21] Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels, 2016. 8
[22] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses, 2021. 11
[23] Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Friedman. Tresnet: High perfor- mance gpu-dedicated architecture. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1400–1409, 2021. 7, 11
[24] Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momen- tum, and weight decay, 2018. 11
[25] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnn-rnn: A unified framework for multi-label image classification, 2016. 8
[26] Baoyuan Wu, Fan Jia, Wei Liu, Bernard Ghanem, and Siwei Lyu. Multi-label learning with missing labels using mixed dependency graphs, 2018. 3
[27] Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. Distribution-balanced loss for multi-label classification in long-tailed datasets, 2020. 1
[28] Hao Yang, Joey Tianyi Zhou, and Jianfei Cai. Improving multi-label learning with missing labels by structured seman- tic correlations, 2016. 3
[29] Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. Cross-modality attention with seman- tic graph embedding for multi-label classification. In AAAI, pages 12709–12716, 2020. 1
[30] Hsiang-FuYu,PrateekJain,PurushottamKar,andInderjitS. Dhillon. Large-scale multi-label learning with missing labels, 2013. 3
Appendices
附录
A. Training Details
A. 训练详情
Unless stated otherwise, all experiments were conducted with the following training configuration. As a default, we used the TResNet-M model [23], pre-trained on ImageNet- 21k dataset [22]. The model was fine-tuned using Adam optimizer [13] and 1-cycle cosine annealing policy [24] with a maximal learning rate of 2e-4 for training OpenImages and MS-COCO, and 6e-4 for training LVIS. We used true- weight-decay [20] of 3e-4 and standard ImageNet augmentations. For fair comparison to previously published results on OpenImages V3, we also trained a ResNet-101 model, pre-trained on ImageNet.
In the OpenImages experiments we used the following hyper-parameters: , , , and . In LVIS we used: , and .
除非另有说明,所有实验均采用以下训练配置进行。作为默认值,我们使用了在ImageNet21k数据集 [22] 上预训练的TResNet-M模型 [23] 。使用Adam优化器 [13] 和1周期余弦退火策略 [24] 对模型进行微调,OpenImages和MS-COCO的最大学习率为2e-4, LVIS的最大学习率为6e-4。我们使用3e-4的真实重量衰减 [20] 和标准的ImageNet增强。为了与之前在OpenImages V3上发布的结果进行公平的比较,我们还训练了一个ResNet-101模型,在ImageNet上进行了预训练。
在OpenImages实验中,我们使用了以下超参数:, , , 和。在LVIS我们使用:,和。
B. Soft Label Prior
B. 软标签先验
Herein, we will explore a soft alternative for integrating the label prior in the loss. We follow equation (3) and define the un-annotaetd weights by
(12)
where is the decay factor. In Table we compare the soft label prior to the configuration used in section .
在此,我们将探索一种软的替代方法,在损失之前集成标签。我们按照公式(3)定义未注释的权重为
(12)
其中是衰变因子。在表中,我们比较软标签之前的配置在中使用。
As the soft label prior provided with lower mAP(C) results, we did not use it in our experiments.
由于软标签先验提供较低的mAP(C)结果,我们在实验中没有使用它。
C. Results on MS-COCO
C. MS-COCO上的结果
In this section, we will present the results obtained on a partially annotated version of MS-COCO, based on the fixed per class (FPC) simulation scheme. Note that in this experiment, the class distribution measured by the number of annotations is no longer meaningful, as all classes have the same number of annotations. The mAP results, as well as the average precision (AP) scores for the class ”Person”, are presented in Figure . The Negative mode produces higher mAP (computed over all the classes) compared to the Ignore mode. However, as the frequent class ”Person” is present in most of the images, the Negative mode is inferior, especially in the cases of a small number of annotations. Using the Selective approach, top results can be achieved for both mAP and the person AP. In Figure , we show the top 10 frequent classes obtained using our procedure for estimating the class distribution as described in section , and compared them to those obtained using the original class frequencies in MS-COCO. As can be seen, most of the frequent classes measured by the original distribution are also highly ranked by our estimator.
在本节中,我们将介绍在基于固定每类(FPC)模拟方案的部分注释版本MS-COCO上获得的结果。注意,在这个实验中,由注释数量衡量的类分布不再有意义,因为所有类都有相同数量的注释。mAP结果,以及类“Person”的平均精度(AP)得分,显示在图中。与Ignore模式相比,Negative模式产生更高的mAP(在所有类上计算)。但由于大多数图片中都出现了频繁的类“Person”,所以Negative模式较差,尤其是在标注数量较少的情况下。在图中,我们显示了使用估计类分布的程序获得的最常见的10个类,如中所述,并将它们与MS-COCO中使用原始类频率获得的类进行比较。可以看到,原始分布测量的大多数频繁类也被我们的估计器排得很高。
D. Frequent classes in OpenImages
D. OpenImages中频繁出现的类
We add more results of the class distribution estimated by our approach (detailed in ) for OpenImages dataset. See Figure .
我们为OpenImages数据集添加了更多通过我们的方法估计的类分布结果(详细显示在)。参见图。
E. Frequent classes in LVIS
E. LVIS中频繁出现的类
In Figure we plot the top frequent classes in LVIS, obtained by our estimator detailed in section . Also in LVIS, it can be seen that the most estimated frequent classes are related to common objects as ”Person”, ”Shirt”, ”Trousers”, ”Shoe”, etc.
在图中,我们绘制了LVIS中最常见的类,由估计器获得,详细信息在中。同样在LVIS中,我们可以看到,最常被估计的类与常见的对象有关,如“人”、“衬衫”、“裤子”、“鞋子”等。