论文地址:https://arxiv.org/pdf/1905.02822.pdf
项目地址:https://github.com/Guanghan/lighttrack
mpii数据集:http://human-pose.mpi-inf.mpg.de/#download
posetrack数据集:https://posetrack.net/users/download.php
Abstract
In this paper, we propose a novel effective light-weight framework, called as LightTrack, for online human pose tracking. The proposed framework is designed to be generic for top down pose tracking and is faster than exiting online and offline methods. Single person Pose Tracking (SPT) and Visual Object Tracking(VOT) are incorporated into one unified functioning entity, easily implemented by a replaceable single-person pose estimation module. Our framework unifies single-person pose tracking with multi-person identity association and sheds first light upon bridging keypoint tracking with object tracking. We also propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module in our pose tracking system. Incontrary to other Re ID modules,we use a graphical representation of human joints for matching. The skeleton based representation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shift that introduces human drifting. To the best of our knowledge, this is the first paper to propose an online human pose tracking framework in a top down fashion. The proposed framework is general enough to fit other pose estimators and candidate maching mechanisms. Our method outperforms other online methods while maintaining a much higher frame rate, and is very competitive with our offline state-of-the-art. We make the code publicly available at: https://github.com/Guanghan/lighttrack.
在本文中,我们提出了一种新的有效轻量级框架,称为LightTrack,用于在线人体姿势跟踪。建议的框架被设计为自上而下的姿势跟踪通用,并且比在线和offline方法更快。单人姿势跟踪(SPT)和视觉对象跟踪(VOT)被整合到一个统一的功能实体中,可通过可替换的单人姿势估计模块轻松实现。我们的框架统一单人姿势跟踪与多人身份关联和使关键点检测和目标追踪轻量化。我们还提出了一个用于人体姿势匹配的连体图形卷积网络(SGCN)作为我们的姿势跟踪系统中的Re-ID模块。与其他Re ID模块相比,我们使用人体关键点的图形表示用于匹配。基于骨架的表示有效地捕获了人体姿势的相似性,并且在计算上很便宜。它对于突然摄像机移位引入人类漂移具有鲁棒性。据我们所知,这是提出在线人体姿势跟踪框架的第一篇自上而下的方式的论文。所提出的框架足够通用,可以用于其他姿势估计和候选机制。我们的方法优于其他在线方法,同时保持更高的帧速率,并且与我们的现有技术相比具有很强的竞争力。我们将代码公开发布在:https://github.com/Guanghan/lighttrack。
1.Introduction
Pose tracking is the task of estimating multi-person human poses in videos and assigning unique instance IDs for each keypoint across frames. Accurate estimation of human keypoint-trajectories is useful for human action recognition, human interaction understanding, motion capture and animation, etc. Recently, the publicly available PoseTrackdataset[18,3 andMPIIVideoPosedataset[17]have pushedtheresearchonhumanmotiona alysisonestepfurther to its real-world scenario. Two PoseTrack challenges have been held. However, most existing methods are of fline hence lacking the potential to be real-time. More emphasis has been put on the Multi-Object Tracking Accuracy (MOTA criterion compared to the Frame PerSecond(FPS) criterion. Existing offline methods divide the tasks of human detection, candidate pose estimation, and identity association into sequential stages. In the procedure, multi person poses are estimated across frames within a video. Based on the pose estimation results, the pose tracking outputs are computed via solving an optimization problem. It requires the poses of future frames to be pre-computed, or at least for the frames within some range.
姿势跟踪是在视频中估计多人姿势并为帧中的每个关键点分配唯一实例ID的任务。对人类关键点轨迹的准确估计对于人类行为识别,人类交互理解,动作捕捉和动画等是有用的。最近,公开可用的Pose Track dataset [18,3和MPII Video Pose dataset [17]已经推动了对人类动态的分析以及其真实场景的进一步发展。已经举行了两次PoseTrack挑战。然而,大多数现有方法都不具备实时性。更多的重点放在多目标跟踪精度(MOTA) 标准与帧周期(FPS)标准相比较。现有的offline方法将人体检测,候选姿态估计和身份关联的任务划分为连续阶段。在视频内的帧中估计多人姿势。基于姿势估计结果,通过求解优化问题来计算姿势跟踪输出。它需要预先计算未来帧的姿势,或者至少对于某些帧内的帧范围。
In this paper, we propose a novel effective light-weight framework for pose tracking. It is designed to be generic, top-down (i.e., pose estimation is performed after candidates are detected), and truly online. The proposed framework unifies single-person pose tracking with multi-person identityassociation. It sheds first light on bridging key point tracking with object tracking. To the best of our knowledge, this is the first paper to propose an online pose tracking framework in a top-down fashion. The proposed framework is general enough to fit other pose estimators and candidate matching mechanisms. Thus, if individual component is further improved in the future, our framework will be faster and/or more accurate.
在本文中,我们提出了一种新颖有效的轻量级姿态跟踪框架。 它被设计为通用的,自上而下的(即,在检测到候选者之后执行姿势估计),并且真正在线。 拟议的框架统一了具有多人身份关联的单人姿势跟踪。 它首先阐明了使用对象跟踪进行桥接关键点跟踪。 据我们所知,这是第一篇以自上而下的方式提出在线姿势跟踪框架的论文。 所提出的框架足够通用于其他姿势估计器和候选匹配机制。 因此,如果将来进一步改进单个组件,我们的框架将更快和/或更准确。
In contrast to Visual Object Tracking (VOT) methods, in which the visual features are implicitly represented by kernels or CNN feature maps, we track each human pose by recursively updating the bounding box and its corresponding pose in an explicit manner. The bounding box region of a target is inferred from the explicit features, i.e., the human keypoints. Human keypoints can be considered as a series of special visual features. The advantages of using pose as explicit features include: (1) The explicit features are human-related and interpretable, and have very strong and stable relationship with the boundingbox position. Human pose enforces direct constraint on the boundingbox region. (2) The task of pose estimation and tracking requires human keypoints be predicted in the first place. Taking advantage of the predicted keypoints is efficient in tracking the ROI region, which is almost free. This mechanism makes the online tracking possible. (3) It naturally keeps the identity of the candidates,which greatly all eviates the burden of data association in the system. Even when data association is necessary, we can re-use the pose features for skeletonbased pose matching. Single Pose Tracking (SPT) and Single Visual Object Tracking(VOT) are thus incorporated into one unified functioning entity, easily implemented by a replaceable single-person human pose estimation module.
与视觉对象跟踪(VOT)方法相比,其中视觉特征由核或CNN特征图隐式表示,我们通过以显式方式递归地更新边界框及其对应的姿势来跟踪每个人体姿势。从显式特征(即人类关键点)推断出目标的边界框区域。人类关键点可以被视为一系列特殊的视觉特征。使用姿势作为显式特征的优点包括:
(1)显式特征与人类相关且可解释,并且与边界框位置具有非常强大且稳定的关系。人体姿势对边界框区域强制执行直接约束。
(2)姿势估计和跟踪的任务需要首先预测人类关键点。利用预测的关键点可以有效地跟踪ROI区域,这几乎是免费的。该机制使在线跟踪成为可能。
(3)它自然地保留了候选者的身份,这极大地消除了系统中数据关联的负担。即使需要数据关联,我们也可以重新使用姿势特征进行基于骨架的姿势匹配。单姿态跟踪(SPT)和单视觉对象跟踪(VOT因此被整合到一个统一的功能实体中,可由可替换的单人人体姿势估计模块容易地实现。
图1.建议的在线姿势跟踪框架概述。 我们在第一帧中检测人类候选者,然后通过单人姿势估计器跟踪每个候选人的位置和姿势。 当目标丢失时,我们对该帧进行检测,并与图形卷积网络进行数据关联,以进行基于骨架的姿势匹配。 我们使用基于骨架的姿势匹配,因为具有不同身份的视觉上相似的候选人可能会混淆视觉分类。 在线跟踪系统中,提取视觉特征在计算上也是昂贵的。 考虑姿势匹配是因为我们观察到在两个相邻的帧中,人的位置可能由于突然的相机移位而漂移,但是人的姿势将保持几乎相同,因为人们通常不能快速地行动。
Our contributions are in three-fold: (1) We propose a general online pose tracking framework that is suitable for top-down approaches of human pose estimation. Both human pose estimator and Re-ID module are replaceable. In contrast to Multi-Object Tracking (MOT) frameworks, our framework is specially designed for the task of pose tracking. To the best of our knowledge, this is the first paper to propose an online human pose tracking system in a topdown fashion. (2) We propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a Re-ID module in our pose tracking system. Different to existing Re-ID modules, we use a graphical representation of humanjointsformatching. Theskeleton-basedrepresentation effectively captures human pose similarity and is computationally inexpensive. It is robust to sudden camera shift that introduces human drifting. (3) We conduct extensive experiments with various settings and ablation studies. Our proposed online pose tracking approach outperforms existing online methods and is competitive to the offline stateof-the-arts but with much higher frame rates. We make the code publicly available to facilitate future research.
我们的贡献有三方面:
(1)我们提出了一种通用的在线姿势跟踪框架,适用于人体姿势估计的自上而下的方法。人体姿势估计器和Re-ID模块都是可替换的。与多目标跟踪(MOT)框架相比,我们的框架专为姿势跟踪任务而设计。据我们所知,这是第一篇以自上而下的方式提出在线人体姿势跟踪系统的论文。
(2)我们在姿势跟踪系统中提出了一种用于人体姿势匹配的连体图形卷积网络(SGCN)作为Re-ID模块。与现有的Re-ID模块不同,我们使用人关键点匹配的图形表示。基于骨架的表示有效地捕获人类姿势相似性并且在计算上是便宜的。突然的相机移位很强大,引入了人类漂移。
(3)我们进行了各种设置和消融研究的广泛实验。我们提出的在线姿势跟踪方法优于现有的在线方法,并且与现有技术相比具有竞争力,但具有更高的帧速率。我们公开提供代码以方便将来的研究。
2.RelatedWork
2.1.HumanPoseEstimationandTracking
Human Pose Estimation (HPE) has seen rapid progress with thee mergence of CNN-based methods[34,31,39,21]. The most widely usedd atasets,e.g.,MPII[4]andLSP[20], are saturated with methods that achieve 90% and higher accuracy. Multi-person human pose estimation is more realistic and challenging, and has received increasing attentions with the hosting of COCO keypoints challenges [26] since 2017. Existing methods can be classified into top-down and bottom-up approaches. The top-down approaches [14, 32, 15] rely on the detection module to obtain human candidates and then applying single-person pose estimation to locate human keypoints. The bottom-up methods [6,35,30]detect human keypoints from all potential candidates and then assemble these keypoints into human limbs for each individual based on various data association techniques. The advantage of bottom-up approaches is their excellent trade-off between estimation accuracy and computational cost because the cost is nearly invariant to the number of human candidates in the image. In contrast, the advantage of top-down approaches is their capability in disassembling the task into multiple comparatively easier tasks, i.e., object detection and single-person pose estimation. The object detector is expert in detecting hard (usually small) candidates,so that the pose estimator will perform better with a focused regression space. Pose tracking is a new topic that is primarily introduced by the Pose Track dataset[18,3]and MPII Video Pose dataset [17]. The task is to estimate human keypoints and assign unique IDs to each keypoint at instance-level across frames in videos. A typical top-down but offline method was introduced in[17],whereposetracking is transformed into a minimum cost multi-cut problem with a graph partitioning formulation.
人体姿势估计(HPE)随着基于CNN的方法的合并而迅速发展[34,31,39,21]。最广泛使用的atasets,例如MPII [4]和LSP [20],已经达到了达到90%和更高精度的方法。多人人体姿势估计更加现实和具有挑战性,并且自2017年以来越来越多地关注COCO关键点挑战的托管[26]。现有方法可以分为自上而下和自下而上的方法。自上而下的方法[14,32,15]依靠检测模块来获得人类候选者,然后应用单人姿势估计来定位人类关键点。自下而上的方法[6,35,30]检测来自所有潜在候选者的人类关键点,然后基于各种数据关联技术将这些关键点组装成每个人的人体肢体。自下而上方法的优点是它们在估计精度和计算成本之间的出色折衷,因为成本对于图像中的人类候选者的数量几乎不变。相反,自上而下方法的优点是它们能够将任务分解成多个相对容易的任务,即物体检测和单人姿势估计。物体检测器是检测硬(通常是小的)候选物的专家,因此姿势估计器将在聚焦回归空间中表现更好。姿势跟踪是一个新主题,主要由Pose Track数据集[18,3]和MPII Video Pose数据集[17]引入。任务是估计人体关键点,并在视频中的帧中为实例级别的每个关键点分配唯一ID。在[17]中引入了一种典型的自上而下的方法,其中使用图分区公式将轮廓跟踪转换为最小成本多切割问题。
[17] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele. Arttrack:articulated multiperson tracking in the wild. In CVPR, 2017. 1, 2
2.2.Object Detection vs. Human Pose Estimation
Earlier works in object detection regress visual features into bounding box coordinates. HPE, on the other hand, usually regresses visual features into heatmaps, each channel representing a human joint. Recently, research in HPE has inspired many works on object detection [40, 22, 28]. These works predict heat maps for a set of special keypoints to infer detection results (bounding boxes). Based on this motivation, we propose to predict human keypoints to infer boundingboxregions. Human keypoints are a special set of keypoints to represent detection of the human class only
早期的物体检测工作将视觉特征回归到边界框坐标中。 另一方面,HPE通常将视觉特征回归到热图中,每个通道代表人体关节。 最近,HPE的研究激发了许多关于物体检测的工作[40,22,28]。 这些工作预测了一组特殊关键点的热图,以推断检测结果(边界框)。 基于这一动机,我们建议预测人类关键点以推断边界框区域。 人类关键点是一组特殊的关键点,仅用于表示人类的检测
[40] X. Zhou, J. Zhuo, and P. Kr¨ahenb¨uhl. Bottom-up object detection by grouping extreme and center points. In arXiv preprint arXiv:1901.08043, 2019. 2
[22] H. Law and J. Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, pages 734–750, 2018. 2
[28] K.Maninis,S.Caelles,J.Pont-Tuset,andL.VanGool. Deep extremecut: Fromextremepointstoobjectsegmentation. In CVPR, 2018. 2
2.3.Multi-ObjectTracking
MOT aims to estimate trajectories of multiple objects by finding target locations while maintaining their identities across frames. Offline methods use both pastand future frames to generate trajectories while online methods only exploit information that is available until the current frame. An online MOT pipeline [41] was presented with applying a single object tracker to keep tracking each target given these target detections in each frame. The target state is set as tracked until the tracking result becomes unreliable. The target is then regarded as lost, and data association is performed to compute the similarity between the track-let and detections. Our proposed online pose tracking framework also tracks each target (with corresponding keypoints) individually while keeping their identities, and performs data association when target is lost. However, our framework is distinct in several aspects: (a) the detection is generated by object detector only at key frames, therefore not necessarily provided at each frame. It can be provided scarcely; (b)the single object tracker is actually a pose estimator that predicts keypoints based on an enlarged region.
MOT旨在通过找到目标位置来估计多个物体的轨迹,同时保持它们在帧之间的id。offline方法使用过去和未来帧来生成轨迹,而在线方法仅利用在当前帧之前可用的信息。提出了一个online MOT管道[41],应用单个物体跟踪器,在每个帧中给定这些目标检测,跟踪每个目标。目标状态被设置为跟踪,直到跟踪结果变得不可靠。然后将目标视为丢失,并且执行数据关联以计算轨道和检测之间的相似性。我们提出的在线姿势跟踪框架还在保持其身份的同时单独跟踪每个目标(具有相应的关键点),并在目标丢失时执行数据关联。然而,我们的框架在几个方面是不同的:(a)检测仅由关键帧的对象检测器生成,因此不一定在每个帧处提供。可以几乎没有提供; (b)单个对象跟踪器实际上是一个姿势估计器,它根据放大的区域预测关键点。
2.4.Graphical Representation for Human Pose
It is recently studied in [38] on how to effectively model dynamic skeletons with a specially tailored graph convolution operation. The graph convolution operation turns human skeletons into spatio-temporal representation of human actions. Inspired by this work, we propose to employ GCN to encode spatial relationship among human joints in toalatent representation of human pose. The representation aims to robustly encode the pose, which is invariant to human location or view angle. We measure similarities of such encodings for the matching of human poses.
最近在[38]中研究了如何使用专门定制的图形卷积运算有效地建模动态骨架。graph卷积运算将人类骨骼转化为人类行为的时空表示。 受这项工作的启发,我们建议采用GCN编码人体关节之间的空间关系,以人体姿势的表现形式。 该表示旨在对姿势进行鲁棒编码,该姿势对于人的位置或视角是不变的。 我们测量这种编码的相似性,以匹配人体姿势。
GCN:https://tkipf.github.io/graph-convolutional-networks/
GCN讲解:https://www.cnblogs.com/SivilTaram/p/graph_neural_network_1.html
3.ProposedMethod
3.1.Top-Down Pose Tracking Framework
We propose a novel top-down pose tracking framework. It has been proved that human pose can be employed for better inference of human locations [27]. We observe that, in a top-down approach, accurate human locations also ease the estimation of human poses. We further study the relationships between these two levels of information: (1) Coarse person location can be distilled into body keypoints by a single-person pose estimator. (2) The position of human joints can be straightforwardly used to indicate rough locations of human candidates. (3) Thus, recurrently estimating one from the other is a feasible strategy for Single-person Pose Tracking (SPT).
我们提出了一种新颖的自上而下姿势跟踪框架。 已经证明,人类姿势可以用于更好地推断人类位置[27]。 我们观察到,在自上而下的方法中,准确的人体位置也可以简化人体姿势的估计。 我们进一步研究了这两个层面信息之间的关系:(1)粗人位置可以通过单人姿势估计器提炼成身体关键点。 (2)人体关节的位置可以直接用于指示人类候选人的粗略位置。 (3)因此,反复估计一个是单人姿势跟踪(SPT)的可行策略。
However, it is not a good idea to merely consider the Multi-target Pose Tracking (MPT) problem as a repeated SPT problem for multiple individuals. Because certain constraints need to be met, e.g., in a certain frame, two different IDs should not belong to the same person; neither two candidates should share the same identity. A better way is to track multiple individuals simultaneously and preserve/update their identities with an additional Re-ID module. The Re-ID module is essential because it is usually hard to maintain correct identities all the way. It is unlikely to track the individual pose seffectively across frames of the entire video. For instance, under the following scenarios, identities have to be updated: (1) some people disappear from the camera view or get occluded; (2) new candidates come in or previous candidates re-appear; (3) people walk across each other (two identities may merge into one if not treatedcarefully);(4)tracking fails due to fast camera shifting or zooming.
然而,仅将多目标姿势跟踪(MPT)问题视为多个人的重复SPT问题并不是一个好主意。因为需要满足某些约束,例如,在某个帧中,两个不同的ID不应该属于同一个人;两位候选人都不应该拥有相同的身份。更好的方法是同时跟踪多个人并使用额外的Re-ID模块保留/更新他们的身份。 Re-ID模块是必不可少的,因为通常很难保持正确的身份。不可能跨整个视频的帧有效地跟踪个体姿势。例如,在以下情况下,必须更新身份:(1)有些人从摄像机视图中消失或被遮挡; (2)新候选人进入或以前的候选人重新出现; (3)人们互相走过(如果不经过严格的处理,两个身份可能合并成一个);(4)由于快速的摄像机移动或缩放而导致跟踪失败。
In our method, we first treat each human candidate separately such that their corresponding identity is kept across the frames. In this way, we circumvent the time-consuming offline optimization procedure. In case the tracked candidate isl ost due to occlusion or camera shift,we then call the detection module to revive candidates and associate them to the tracked targets from the previous frame via pose matching. In this way, we accomplish multi-target pose tracking with an SPT module and a pose matching module.
在我们的方法中,我们首先分别处理每个人候选人,使他们的相应身份保持在帧之间。 通过这种方式,我们可以避免耗时的优化程序。 如果被跟踪的候选者由于遮挡或相机移位而失败,我们然后调用检测模块来恢复候选者并通过姿势匹配将它们与前一帧中的跟踪目标相关联。 通过这种方式,我们使用SPT模块和姿势匹配模块完成多目标姿态跟踪。
Specifically, the bounding box of the person in the upcoming frame is inferred from the joints estimated by the pose module from the current frame. We find the minimum and maximum coordinates and enlarge this ROI region by 20% on each side. The enlarged bounding box is treated as the localized region for this person in the next frame. If the average confidence score ¯ s from the estimated joints is lower than the standard τs, it reflects that the target is lost since the joints are not likely to appear in the bounding box region. The state of the target is defined as:
具体地,即将到来的帧中的人的边界框是从姿势模块从当前帧估计的关节推断的。 我们找到最小和最大坐标,并将每侧的ROI区域扩大20%。 放大的边界框在下一帧中被视为此人的局部区域。 如果估计关节的平均信度得分低于标准τs,则反映目标丢失,因为关节不太可能出现在边界框区域。 目标的状态定义为:
If the target is lost, we have two modes: (1) Fixed Keyframe Interval (FKI) mode. Neglect this target until the scheduled next key-frame, where the detection module re-generate the candidates and then associate their IDs to the tracking history. (2) Adaptive Keyframe Interval (AKI) mode. Immediately revive the missing target by candidate detection and identity association. The advantage of FKI mode is that the frame rate of pose tracking is stable due to the fixed interval of keyframes. The advantage of AKI mode is that the average frame rate can be higher for noncomplex videos. In our experiments, we incorporate them by taking keyframes with fixed intervals while also calling detection module once a target is lost before the arrival of thenextarrangedkeyframe. The tracking accuracy is higher because when a target is lost, it is handled immediately
如果目标丢失,我们有两种模式:(1)固定关键帧间隔(FKI)模式。 忽略该目标直到预定的下一个关键帧,其中检测模块重新生成候选者,然后将他们的ID与跟踪历史相关联。 (2)自适应关键帧间隔(AKI)模式。 通过候选检测和身份关联立即恢复缺失的目标。 FKI模式的优点在于,由于关键帧的固定间隔,姿势跟踪的帧速率是稳定的。 AKI模式的优点是非复杂视频的平均帧速率可以更高。 在我们的实验中,我们通过采用固定间隔的关键帧来合并它们,同时一旦目标在下一个关键帧到达之前丢失,也会调用检测模块。 跟踪精度更高,因为当目标丢失时,立即处理
For identity association, we propose to consider two complementary information: spatial consistency and pose consistency. We first rely on spatial consistency, i.e., if two bounding boxes from the current and the previous frames are adjacent, or their Intersection Over Union (IOU) is above a certain threshold, we consider them to belong to the same target. Specifically, we set the matching flag m(tk,dk) to 1 if the maximum IOU overlap ratio o(tk,Di,k) between the tracked target tk ∈Tk and the corresponding detection dk ∈ Dk for key-frame k is higher than the threshold τo. Otherwise, m(tk,dk) is set as 0:
对于身份关联,我们建议考虑两个互补信息:空间一致性和姿势一致性。 我们首先依赖于空间一致性,即,如果来自当前帧和先前帧的两个边界框相邻,或者它们的交叉点(IOU)超过某个阈值,我们认为它们属于同一目标。 具体地,如果跟踪目标tk∈Tk与关键帧k的对应检测dk∈Dk之间的最大IOU重叠率o(tk,di,k) 比阈值τo更高,则我们将匹配flag m(tk,dk)设置为1。 否则,m(tk,dk)设置为0:
The above criterion is based on the assumption that the tracked target from the previous frame and the actual location of the target in the current frame have significant overlap, which is true for most cases. However, such assumption is not always reliable, especially when the camera shifts swiftly. In such cases, we need to match the new observation to the tracked candidates. In Re-ID problems, this is usually accomplished by a visual feature classifier. However, visually similar candidates with different identities may confuse such classifiers. Extracting visual features can also be computationally expensive in an online tracking system. Therefore, we design a Graph Convolution Network (GCN) to leverage the graphical representation of the human joints. We observe that in two adjacent frames, the location of a person may drift away due to sudden camera shift, but the human pose will stay almost the same as people usually cannot act that fast, as illustrated in Fig. 2. Consequently, the graph representation of human skeletons can be a strong cue for candidate matching, which we refer to as pose matching in the following text.
上述标准基于以下假设:来自前一帧的跟踪目标和当前帧中目标的实际位置具有显着重叠,这对于大多数情况都是如此。然而,这种假设并不总是可靠的,尤其是当摄像机快速移动时。在这种情况下,我们需要将新观察与被跟踪候选人匹配。在Re-ID问题中,这通常由视觉特征分类器完成。然而,具有不同身份的视觉上相似的候选人可能会混淆这些分类。在线跟踪系统中,提取视觉特征在计算上也是昂贵的。因此,我们设计了一个图形卷积网络(GCN)来利用人体关节的图形表示。我们观察到,在两个相邻的帧中,由于突然的相机移位,人的位置可能会漂移,但是人的姿势将保持几乎与人们通常不能快速行动的相同,如图2所示。因此,图人体骨架的表示可以是候选匹配的强烈提示,我们在下文中将其称为姿势匹配。
3.2.Siamese Graph Convolutional Networks
Siamese Network: Given the sequences of body joints in the form of 2D coordinates, we construct a spatial graph with the joints as graph nodes and connectivities in human body structures as graph edges. The input to our graph convolutional network is the joint coordinate vectors on the graph nodes. It is analogous to image-based CNNs where the input is formed by pixel intensity vectors residing on the 2D image grid [38]. Multiple graph convolutions are performed on the input to generate a feature representation vector as a conceptual summary of the human pose. It inherently encodes the spatial relationship among the human joints. The input to the siamese networks is therefore a pair of inputs to the GCN network. The distance between two output features represent how similar two poses are to each other. Two poses are called a match if they are conceptually similar. The network is illustrated in Fig. 3. The siamese network consists of 2 GCN layers and 1 convolutional layer using contrastive loss. We take normalized keypoint coordinates as input; the output is a 128 dimensional feature vector. The network is optimized with contrastive loss L because we want the network to generate feature representations, that are close by enough for positive pairs, whereas they are far away at least by a minimum for negative pairs. we employ the margin contrastive loss:
Siamese Network:给定2D坐标形式的身体关节序列,我们构建一个空间图,其中关节为图形节点,人体结构中的连接为图边。我们的图卷积网络的输入是图节点上的联合坐标向量。它类似于基于图像的CNN,其中输入由驻留在2D图像网格上的像素强度矢量形成[38]。对输入执行多个图形卷积以生成特征表示向量作为人类姿势的概念性概要。它固有地编码人体关节之间的空间关系。因此,对连接网络的输入是GCN网络的一对输入。两个输出特征之间的距离表示两个姿势彼此相似的程度。如果两个姿势在概念上相似,则称为匹配。网络如图3所示。连接网络由2个GCN层和1个使用对比度损失的卷积层组成。我们将标准化的关键点坐标作为输入;输出是128维特征向量。网络使用对比度损失L进行优化,因为我们希望网络生成特征表示,对于正对来说,它们足够接近,而对于负对,它们至少距离最小。我们使用保证金对比损失margin contrastive loss:
Graph Convolution for Skeleton: For standard 2D convolution on natural images, the output feature maps can have the same size as the input feature maps with stride 1 and appropriate padding. Similarly, the graph convolution operation is designed to output graphs with the same number of nodes. The dimensionality of attributes of these nodes, which is analogous to the number of feature map channels in standard convolution, may change after the graph convolution operation.
骨架的图形卷积:对于自然图像上的标准2D卷积,输出要素图可以与具有步幅1和适当填充的输入要素图相同。 类似地,图形卷积运算被设计为输出具有相同数量节点的图形。 这些节点的属性的维度(类似于标准卷积中的特征映射信道的数量)可以在图形卷积运算之后改变。
The standard convolution operation is defined as follows: given a convolution operator with the kernel size of K ×K, and an input feature map fin with the number of channels c, the output value of a single channel at the spatial location x can be written as:
标准卷积运算定义如下:给定内核大小为K×K的卷积运算符,以及具有通道数c的输入特征映射fin,可以写出空间位置x处的单个通道的输出值 如:
The convolution operation on graphs is defined by extending the above formulation to the cases where the input features map resides on a spatial graph Vt, i.e. the feature map fin t : Vt -> Rc has a vector on each node of the graph. The next step of the extension is to re-define the sampling function p and the weight function w. We follow the method proposed in [38]. For each node, only its adjacent nodes are sampled. The neighbor set for node vi is B(vi) = {vj|d(vj; vi) ≤ 1}. The sampling function p : B(vi) -> V can be written as p(vi; vj) = vj. In this way, the number of adjacent nodes is not fixed, nor is the weighting order. In order to have a fixed number of samples and a fixed order of weighting them, we label the neighbor nodes around the root node with fixed number of partitions, and then weight these nodes based on their partition class. The specific partitioning method is illustrated in Fig. 4.
通过将上述公式扩展到输入特征图存在于空间图Vt上的情况来定义图上的卷积运算,即,特征图fin t:Vt - > Rc在图的每个节点上具有向量。 扩展的下一步是重新定义采样函数p和权重函数w。 我们遵循[38]中提出的方法。 对于每个节点,仅对其相邻节点进行采样。 节点vi的邻居集合为B(vi)= {vj | d(vj; vi)≤1}。 采样函数p:B(vi) - > V可写为p(vi; vj)= vj。 这样,相邻节点的数量不固定,加权顺序也不固定。 为了获得固定数量的样本和加权的固定顺序,我们使用固定数量的分区标记根节点周围的邻居节点,然后根据它们的分区类对这些节点进行加权。 具体的划分方法如图4所示。
Therefore, Eq. (4) for graph convolution is re-written as:
where the normalization term Zi(vj) is to balance the contributions of different subsets to the output. According to the partition method mentioned above, we have:
归一化项Zi(vj)是为了平衡不同子集对输出的贡献。 根据上面提到的分区方法,我们有:
where ri is the average distance from gravity center to joint i over all frames in the training set.
其中ri是训练集中所有帧上从重心到关节i的平均距离。
4. Experiments
In this section, we present quantitative results of our experiments. Some qualitative results are shown in Fig. 5.
4.1. Dataset
PoseTrack [3] is a large-scale benchmark for human pose estimation and articulated tracking in videos. It provides publicly available training and validation sets as well as an evaluation server for benchmarking on a held-out test set. The benchmark is a basis for the challenge competitions at ICCV’17 [1] and ECCV’18 [2] workshops. The dataset consisted of over 68; 000 frames for the ICCV’17 challenge and is extended to twice as many frames for the ECCV’18 challenge. It now includes 593 training videos, 74 validation videos and 375 testing videos. For held-out test set, at most four submissions per task can be made for the same approach. Evaluation on validation set has no submission limit. Therefore, ablation studies in Section 4.4 are performed on the validation set. Since PoseTrack’18 test set is not open yet, we compare our results with other approaches in Sec. 4.5 on PoseTrack’17 test set.
PoseTrack [3]是人类姿势估计和视频中清晰跟踪的大规模基准。 它提供公开可用的培训和验证集以及评估服务器,用于对保留的测试集进行基准测试。 该基准是ICCV'17 [1]和ECCV'18 [2]研讨会挑战赛的基础。 数据集对于ICCV'17挑战,帧数超过68000帧,并且扩展到ECCV'18挑战的帧数的两倍。 它现在包括593个培训视频,74个验证视频和375个测试视频。 对于保持测试集,对于相同的方法,每个任务最多可以提交四个提交。 验证集的评估没有提交限制。 因此,第4.4节中的消融研究是在验证集上进行的。 由于PoseTrack'18测试集尚未开放,我们将结果与Sec中的其他方法进行比较。 4.5在PoseTrack'17测试集上。
4.2. Evaluation Metrics
The evaluation includes pose estimation accuracy and pose tracking accuracy. Pose estimation accuracy is evaluated using the standard mAP metric, whereas the evaluation of pose tracking is according to the clear MOT [5] metrics that are the standard for evaluation of multi-target tracking.
评估包括姿势估计精度和姿势跟踪精度。 使用标准mAP度量评估姿势估计准确度,而姿势跟踪的评估根据作为评估多目标跟踪的标准的明确MOT [5]度量。
4.3. Implementation Details
We adopt state-of-the-art key-frame object detectors trained with ImageNet and COCO datasets. Specifically, we use pre-trained models from deformable ConvNets [9]. We conduct experiments on validation sets to choose the object detector with better recall rates. For the object detectors, we compare the deformable convolution versions of the RFCN network [8] and of the FPN network [25], both with ResNet101 backbone [16]. The FPN feature extractor is attached to the Fast R-CNN [13] head for detection. We compare the detection results with the ground truth based on the precision and recall rate on PoseTrack’17 validation set. In order to eliminate redundant candidates, we drop candidates with lower likelihood. As shown in Table 2, precision and recall of the detectors are given for various drop thresholds. Since the FPN network performs better, we choose it as our human candidate detector. During training, we infer ground truth bounding boxes of candidates from the annotated keypoints, because in PoseTrack’17 dataset, the bounding box positions are not provided in the annotations. Specifically, we locate a bounding box from the minimum and maximum coordinates of the 15 keypoints, and then enlarge this box by 20% both horizontally and vertically
我们采用使用ImageNet和COCO数据集训练的最先进的关键帧物体探测器。具体来说,我们使用来自可变形ConvNets的预训练模型[9]。我们对验证集进行实验,以选择具有更好召回率的物体检测器。对于物体探测器,我们比较了RFCN网络[8]和FPN网络[25]的可变形卷积版本,两者都与ResNet101主干[16]。 FPN特征提取器连接到快速R-CNN [13]头进行检测。我们根据PoseTrack'17验证集上的精度和召回率,将检测结果与基本事实进行比较。为了消除多余的候选人,我们放弃了可能性较低的候选人。如表2所示,针对各种下降阈值给出了检测器的精度和召回率。由于FPN网络表现更好,我们选择它作为我们的人类候选检测器。在训练期间,我们从带注释的关键点推断出候选地面真实边界框,因为在PoseTrack'17数据集中,注释中没有提供边界框位置。具体来说,我们从15个关键点的最小和最大坐标中找到一个边界框,然后将此框水平和垂直放大20%
For the single-person human pose estimator, we adopt CPN101 [7] and MSRA152 [36] with slight modifications. We first train the networks with the merged dataset of PoseTrack’17 and COCO for 260 epochs. Then we fainetune the network solely on PoseTrack’17 for 40 epochs in order to mitigate the inaccurate regression on head and neck. For COCO, bottom-head and top-head positions are not given We infer these keypoints by interpolation on the annotated keypoints. We find that by finetuning on the PoseTrack dataset, the prediction on head keypoints will be refined. During finetuning, we use the technique of online hard keypoint mining, only focusing on losses from the 7 hardest keypoints out of the total 15 keypoints. Pose inference is performed online with single thread.
对于单人人体姿势估计器,我们采用CPN101 [7]和MSRA152 [36]稍作修改。 我们首先使用PoseTrack'17和COCO的合并数据集训练网络260个epoch。 然后我们仅在PoseTrack '17上将网络用于40个epoch,以减轻头部和颈部的不准确回归。 对于COCO,没有给出底部头部和顶部头部位置。我们通过在注释关键点上插值来推断这些关键点。 我们发现通过对PoseTrack数据集进行微调,可以优化对头部关键点的预测。 在微调期间,我们使用在线硬关键点挖掘技术,仅关注总共15个关键点中7个最难关键点的损失。 使用单线程在线执行姿势推断。
For the pose matching module, we train a siamese graph convolutional network with 2 GCN layers and 1 convolutional layer using contrastive loss. We take normalized keypoint coordinates as input; the output is a 128 dimensional feature vector. Following [38], we use spatial configuration partitioning as the sampling method for graph convolution and use learnable edge importance weighting. To train the siamese network, we generate training data from the PoseTrack dataset. Specifically, we extract people with same IDs within adjacent frames as positive pairs, and extract people with different IDs within the same frame and across frames as negative pairs. Hard negative pairs only include spatially overlapped poses. The number of collected pairs are illustrated in Table 1. We train the model with batch size of 32 for a total of 200 epochs with SGD optimizer. Initial learning rate is set to 0:001 and is decayed by 0.1 at epochs of 40,60, 80,100. Weight decay is 10−4.
对于姿势匹配模块,我们使用对比损失训练具有2个GCN层和1个卷积层的siamese图卷积网络。 我们将标准化的关键点坐标作为输入; 输出是128维特征向量。 在[38]之后,我们使用空间配置分区作为图卷积的采样方法,并使用可学习的边缘重要性加权。 为了训练暹罗网络,我们从PoseTrack数据集生成训练数据。 具体而言,我们在相邻帧中提取具有相同ID的人作为正对,并在同一帧内提取具有不同ID的人,并在负帧对中提取帧。 硬负对只包括空间重叠的姿势。 收集对的数量如表1所示。我们使用SGD优化器训练批量大小为32的模型,总共200个时期。 初始学习率设置为0.001,并且在40,60,80,100的时期内衰减0.1。 重量衰减是10-4。
4.4. Ablation Study
We conducted a series of ablation studies to analyze the contribution of each component on the overall performance.
Detectors: We experimented with several detectors and decide to use Deformable ConvNets with ResNet101 as backbone, Feature Pyramid Networks (FPN) for feature extraction, and fast R-CNN scheme as detection head. As shown in Table 2, this detector performs better than Deformable R-FCN with the same backbone. It is no surprise that the better detector results in better performances on both pose estimation and pose tracking, as shown in Table 3.
探测器:我们试验了几个探测器并决定使用可变形的ConvNets和ResNet101作为主干,特征金字塔网络(FPN)用于特征提取,以及Faster R-CNN方案作为探测头。 如表2所示,该检测器比具有相同骨干的可变形R-FCN表现更好。 毫无疑问,更好的探测器可以在姿态估计和姿态跟踪方面获得更好的性能,如表3所示。
可变型卷积提出:DCN
Offline vs. Online: We studied the effect of keyframe intervals of our online method and compare with the offline method. For fair comparison, we use identical human candidate detector and pose estimator for both methods. For offline method, we pre-compute human candidate detection and estimate the pose for each candidate, then we adopt a flow-based pose tracker [37], where pose flows are built by associating poses that indicate the same person across frames. For online method, we perform truly online pose tracking. Since human candidate detection is performed only at key frames, the online performance varies with different intervals. In Table 4, we illustrate the performance of the offline method, compared with the online method that is given various keyframe intervals. Offline methods performed better than online methods. But we can see the great potential of online methods when the detections (DET) at keyframes are more accurate, the upper-limited of which is achieved with ground truth (GT) detections. As expected, frequent keyframe helps more with the performance. Note that the online methods only use spatial consistency for data association at key frames. We report ablation experiments on the pose matching module in the following text.
离线与在线:我们研究了在线方法的关键帧间隔的影响,并与离线方法进行了比较。为了公平比较,我们对两种方法使用相同的人类候选检测器和姿势估计器。对于离线方法,我们预先计算人类候选人检测并估计每个候选人的姿势,然后我们采用基于流量的姿势跟踪器[37],其中通过关联表示跨越帧的同一人的姿势来建立姿势流。对于在线方法,我们执行真正的在线姿势跟踪。由于仅在关键帧处执行人类候选检测,因此在线性能随着不同的间隔而变化。在表4中,我们说明了离线方法的性能,与给出各种关键帧间隔的在线方法进行了比较。离线方法比在线方法表现更好。但是当关键帧的检测(DET)更准确时,我们可以看到在线方法的巨大潜力,其中上限是通过基础事实(GT)检测实现的。正如所料,频繁的关键帧有助于提高性能。请注意,联机方法仅在关键帧处使用空间一致性来进行数据关联。我们在下文中报告姿势匹配模块的消融实验。
GCN vs. Spatial Consistency (SC): Next, we report results when pose matching is performed during data association stage, compared with only employing spatial consistency. It can be shown in Table 5 that the tracking performance increases with GCN-based pose matching. However, in some situations, different people may have near-duplicate poses, as shown in Fig. 6. To mitigate such ambiguities, spatial consistency is considered prior to pose similarity
GCN与空间一致性(SC):接下来,我们报告在数据关联阶段执行姿势匹配时的结果,与仅采用空间一致性进行比较。 表5中可以显示,跟踪性能随着基于GCN的姿势匹配而增加。 然而,在某些情况下,不同的人可能具有接近重复的姿势,如图6所示。为了减轻这种模糊,在姿势相似性之前考虑空间一致性。