无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

2019 年 5 月 5 日 PaperWeekly
无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

VisDrone 2019

The VisDrone 2019 Challenge will be held on the ICCV 2019 workshop "Vision Meets Drone: A Challenge" (or VisDrone2019, for short) in October, 2019, in Seoul, Korea, for object detection and tracking in visual data taken from drones. The VisDrone2019 dataset is collected by the AISKYEYE team at Lab of Machine Learning and Data Mining , Tianjin University, China. We invite researchers to participate the challenge and to evaluate and discuss their research at the workshop, as well as to submit papers describing research, experiments, or applications based on the VisDrone2019 dataset. 

14 different cities spanning thousands of kilometers


272117 video frames/images


2.6 million bounding boxes


     Four Tasks      

Task 1: object detection in images

The task aims to detect objects of predefined categories (e.g., cars and pedestrians) from individual images taken from drones. 

Task 2: object detection in videos 

The task is similar to Task 1, except that objects are required to be detected from videos. 

Task 3: single-object tracking

 The task aims to estimate the state of a target, indicated in the first frame, in the subsequent video frames. 

Task 4: multi-object tracking

The task aims to recover the trajectories of objects in each video frame.

Task1: object detection in images 

Task2: object detection in videos 

Task3: single-object tracking 

Task4: multi-object tracking 

Important Dates 

 Website open: April 25, 2019
 Data available: April 25, 2019
 Submission deadlineTBD
 Author notificationTBD

 Workshop dateTBD

 Camera-ready due: TBD


Pengfei Zhu

Tianjin University 

Longyin Wen

JD Digits

Dawei Du

University AT Albany, SUNY 

Xiao Bian

GE Global Research

Qinghua Hu

Tianjin University

Haibin Ling

Temple University

Advisory Committee

  • Liefeng Bo (JD Digits, USA)

  • Hamilton Scott Clouse (US Airforce Research)

  • Liyi Dai (US Army Research Office)

  • Riad I. Hammound (BAE Systems, USA)

  • David Jacobs (Univ. Maryland College Park, USA)

  • SiweiLyu (Univ. At Albany, SUNY, USA)

  • Stan Z. Li (Institute of Automation, Chinese Academy of Sciences, China)

  • Fuxin Li (Oregon State Univ.,USA)

  • Anton Milan (Amazon Research and Development Center, Germany)

  • Hailin Shi (JD AI Research)

  • Siyu Tang (Max Planck Institute forIntelligent Systems, Germany)

Technical Committee

Hailin Shi

JD AI Research

Tao Peng

Tianjin University

Jiayu Zheng

Tianjin University

Yue Si

JD AI Research

Xiaolu Li

Tianjin University

Wenya Ma

Tianjin University


Contact Us



WeChat Official Accounts




ICCV 的全称是 IEEE International Conference on Computer Vision,即国际计算机视觉大会,由IEEE主办,与计算机视觉模式识别会议(CVPR)和欧洲计算机视觉会议(ECCV)并称计算机视觉方向的三大顶级会议,被澳大利亚ICT学术会议排名和中国计算机学会等机构评为最高级别学术会议,在业内具有极高的评价。不同于在美国每年召开一次的CVPR和只在欧洲召开的ECCV,ICCV在世界范围内每两年召开一次。ICCV论文录用率非常低,是三大会议中公认级别最高的。ICCV会议时间通常在四到五天,相关领域的专家将会展示最新的研究成果。

We present FAST NAVIGATOR, a general framework for action decoding, which yields state-of-the-art results on the recent Room-to-Room (R2R) Vision-and-Language navigation challenge of Anderson et. al. (2018). Given a natural language instruction and photo-realistic image views of a previously unseen environment, the agent must navigate from a source to a target location as quickly as possible. While all of current approaches make local action decisions or score entire trajectories with beam search, our framework seamlessly balances local and global signals when exploring the environment. Importantly, this allows us to act greedily, but use global signals to backtrack when necessary. Our FAST framework, applied to existing models, yielded a 17% relative gain over the previous state-of-the-art, an absolute 6% gain on success rate weighted by path length (SPL).


Object detectors tend to perform poorly in new or open domains, and require exhaustive yet costly annotations from fully labeled datasets. We aim at benefiting from several datasets with different categories but without additional labelling, not only to increase the number of categories detected, but also to take advantage from transfer learning and to enhance domain independence. Our dataset merging procedure starts with training several initial Faster R-CNN on the different datasets while considering the complementary datasets' images for domain adaptation. Similarly to self-training methods, the predictions of these initial detectors mitigate the missing annotations on the complementary datasets. The final OMNIA Faster R-CNN is trained with all categories on the union of the datasets enriched by predictions. The joint training handles unsafe targets with a new classification loss called SoftSig in a softly supervised way. Experimental results show that in the case of fashion detection for images in the wild, merging Modanet with COCO increases the final performance from 45.5% to 57.4%. Applying our soft distillation to the task of detection with domain shift on Cityscapes enables to beat the state-of-the-art by 5.3 points. We hope that our methodology could unlock object detection for real-world applications without immense datasets.


This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint predictions linked over the entire video. For frame-level pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D extension of this model, which leverages temporal information over small clips to generate more robust frame predictions. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our approach achieves an accuracy of 55.2% on the validation and 51.8% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art performance on the ICCV 2017 PoseTrack keypoint tracking challenge.


In this paper we present a large-scale visual object detection and tracking benchmark, named VisDrone2018, aiming at advancing visual understanding tasks on the drone platform. The images and video sequences in the benchmark were captured over various urban/suburban areas of 14 different cities across China from north to south. Specifically, VisDrone2018 consists of 263 video clips and 10,209 images (no overlap with video clips) with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc. With intensive amount of effort, our benchmark has more than 2.5 million annotated instances in 179,264 images/video frames. Being the largest such dataset ever published, the benchmark enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. In particular, we design four popular tasks with the benchmark, including object detection in images, object detection in videos, single object tracking, and multi-object tracking. All these tasks are extremely challenging in the proposed dataset due to factors such as occlusion, large scale and pose variation, and fast motion. We hope the benchmark largely boost the research and development in visual analysis on drone platforms.


Tracking humans that are interacting with the other subjects or environment remains unsolved in visual tracking, because the visibility of the human of interests in videos is unknown and might vary over time. In particular, it is still difficult for state-of-the-art human trackers to recover complete human trajectories in crowded scenes with frequent human interactions. In this work, we consider the visibility status of a subject as a fluent variable, whose change is mostly attributed to the subject's interaction with the surrounding, e.g., crossing behind another object, entering a building, or getting into a vehicle, etc. We introduce a Causal And-Or Graph (C-AOG) to represent the causal-effect relations between an object's visibility fluent and its activities, and develop a probabilistic graph model to jointly reason the visibility fluent change (e.g., from visible to invisible) and track humans in videos. We formulate this joint task as an iterative search of a feasible causal graph structure that enables fast search algorithm, e.g., dynamic programming method. We apply the proposed method on challenging video sequences to evaluate its capabilities of estimating visibility fluent changes of subjects and tracking subjects of interests over time. Results with comparisons demonstrate that our method outperforms the alternative trackers and can recover complete trajectories of humans in complicated scenarios with frequent human interactions.


In the same vein of discriminative one-shot learning, Siamese networks allow recognizing an object from a single exemplar with the same class label. However, they do not take advantage of the underlying structure of the data and the relationship among the multitude of samples as they only rely on pairs of instances for training. In this paper, we propose a new quadruplet deep network to examine the potential connections among the training instances, aiming to achieve a more powerful representation. We design four shared networks that receive multi-tuple of instances as inputs and are connected by a novel loss function consisting of pair-loss and triplet-loss. According to the similarity metric, we select the most similar and the most dissimilar instances as the positive and negative inputs of triplet loss from each multi-tuple. We show that this scheme improves the training performance. Furthermore, we introduce a new weight layer to automatically select suitable combination weights, which will avoid the conflict between triplet and pair loss leading to worse performance. We evaluate our quadruplet framework by model-free tracking-by-detection of objects from a single initial exemplar in several Visual Object Tracking benchmarks. Our extensive experimental analysis demonstrates that our tracker achieves superior performance with a real-time processing speed of 78 frames-per-second (fps).


We explore object discovery and detector adaptation based on unlabeled video sequences captured from a mobile platform. We propose a fully automatic approach for object mining from video which builds upon a generic object tracking approach. By applying this method to three large video datasets from autonomous driving and mobile robotics scenarios, we demonstrate its robustness and generality. Based on the object mining results, we propose a novel approach for unsupervised object discovery by appearance-based clustering. We show that this approach successfully discovers interesting objects relevant to driving scenarios. In addition, we perform self-supervised detector adaptation in order to improve detection performance on the KITTI dataset for existing categories. Our approach has direct relevance for enabling large-scale object learning for autonomous driving.


We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

CCF A类 | 顶级会议RTSS 2019诚邀稿件
7+阅读 · 2019年4月17日
人工智能 | PRICAI 2019等国际会议信息9条
4+阅读 · 2018年12月13日
3+阅读 · 2018年8月27日
3+阅读 · 2018年8月6日
计算机视觉领域顶会CVPR 2018 接受论文列表
27+阅读 · 2018年3月30日
人工智能 | 国际会议截稿信息9条
4+阅读 · 2018年3月13日
5+阅读 · 2017年12月31日
6+阅读 · 2017年11月25日
Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
Liyiming Ke,Xiujun Li,Yonatan Bisk,Ari Holtzman,Zhe Gan,Jingjing Liu,Jianfeng Gao,Yejin Choi,Siddhartha Srinivasa
3+阅读 · 2019年3月6日
OMNIA Faster R-CNN: Detection in the wild through dataset merging and soft distillation
Alexandre Rame,Emilien Garreau,Hedi Ben-Younes,Charles Ollion
4+阅读 · 2018年12月6日
Rohit Girdhar,Georgia Gkioxari,Lorenzo Torresani,Manohar Paluri,Du Tran
5+阅读 · 2018年5月2日
Pengfei Zhu,Longyin Wen,Xiao Bian,Haibin Ling,Qinghua Hu
6+阅读 · 2018年4月23日
Yuanlu Xu,Lei Qin,Xiaobai Liu,Jianwen Xie,Song-Chun Zhu
5+阅读 · 2018年3月29日
Xingping Dong,Jianbing Shen,Yu Liu,Wenguan Wang,Fatih Porikli
9+阅读 · 2018年3月17日
Aljoša Ošep,Paul Voigtlaender,Jonathon Luiten,Stefan Breuers,Bastian Leibe
3+阅读 · 2017年12月23日
Aishwarya Agrawal,Jiasen Lu,Stanislaw Antol,Margaret Mitchell,C. Lawrence Zitnick,Dhruv Batra,Devi Parikh
8+阅读 · 2016年10月27日