计算机视觉近一年进展综述

2017 年 11 月 25 日 机器学习研究会
计算机视觉近一年进展综述

·Introduction

Computer Vision typically refers to the scientific discipline of giving machines the ability of sight, or perhaps more colourfully, enabling machines to visually analyse their environments and the stimuli within them. This process typically involves the evaluation of an image, images or video. The British Machine Vision Association (BMVA) defines Computer Vision as the automatic extraction, analysis and understanding of useful information from a single image or a sequence of images.[1]

The term understanding provides an interesting counterpoint to an otherwise mechanical definition of vision, one which serves to demonstrate both the significance and complexity of the Computer Vision field. True understanding of our environment is not achieved through visual representations alone. Rather, visual cues travel through the optic nerve to the primary visual cortex and are interpreted by the brain, in a highly stylised sense. The interpretations drawn from this sensory information encompass the near-totality of our natural programming and subjective experiences, i.e. how evolution has wired us to survive and what we learn about the world throughout our lives.

In this respect, vision only relates to the transmission of images for interpretation; whilecomputing said images is more analogous to thought or cognition, drawing on a multitude of the brain’s faculties. Hence, many believe that Computer Vision, a true understanding of visual environments and their contexts, paves the way for future iterations of Strong Artificial Intelligence, due to its cross-domain mastery.

However, put down the pitchforks as we’re still very much in the embryonic stages of this fascinating field. This piece simply aims to shed some light on 2016’s biggest Computer Vision advancements. And hopefully ground some of these advancements in a healthy mix of expected near-term societal-interactions and, where applicable, tongue-in-cheek prognostications of the end of life as we know it.

While our work is always written to be as accessible as possible, sections within this particular piece may be oblique at times due to the subject matter. We do provide rudimentary definitions throughout, however, these only convey a facile understanding of key concepts. In keeping our focus on work produced in 2016, often omissions are made in the interest of brevity.

One such glaring omission relates to the functionality of Convolutional Neural Networks (hereafter CNNs or ConvNets), which are ubiquitous within the field of Computer Vision. The success of AlexNet [2] in 2012, a CNN architecture which blindsided ImageNet competitors, proved instigator of a de facto revolution within the field, with numerous researchers adopting neural network-based approaches as part of Computer Vision’s new period of ‘normal science’.[3] 

Over four years later and CNN variants still make up the bulk of new neural network architectures for vision tasks, with researchers reconstructing them like legos; a working testament to the power of both open source information and Deep Learning. However, an explanation of CNNs could easily span several postings and is best left to those with a deeper expertise on the subject and an affinity for making the complex understandable.

For casual readers who wish to gain a quick grounding before proceeding we recommend the first two resources below. For those who wish to go further still, we have ordered the resources below to facilitate that:

  • What a Deep Neural Network thinks about your #selfie from Andrej Karpathy is one of our favourites for helping people understand the applications and functionalities behind CNNs.[4] 

  • Quora: “what is a convolutional neural network?” - Has no shortage of great links and explanations. Particularly suited to those with no prior understanding.[5]

  • CS231n: Convolutional Neural Networks for Visual Recognition from Stanford University is an excellent resource for more depth.[6]

  • Deep Learning (Goodfellow, Bengio & Courville, 2016) provides detailed explanations of CNN features and functionality in Chapter 9. The textbook has been kindly made available for free in HTML format by the authors.[7]

For those wishing to understand more about Neural Networks and Deep Learning in general we suggest:

  • Neural Networks and Deep Learning (Nielsen, 2017) is a free online textbook which provides the reader with a really intuitive understanding of the complexities of Neural Networks and Deep Learning. Even just completing chapter one should greatly illuminate the subject matter of this piece for first-timers.[8]

As a whole this piece is disjointed and spasmodic, a reflection of the authors’ excitement and the spirit in which it was intended to be utilised, section by section. Information is partitioned using our own heuristics and judgements, a necessary compromise due to the cross-domain influence of much of the work presented.

We hope that readers benefit from our aggregation of the information here to further their own knowledge, regardless of previous experience.

From all our contributors,

The M Tank

Part One: Classification/Localisation, Object Detection, Object Tracking

Classification/Localisation

The task of classification, when it relates to images, generally refers to assigning a label to the whole image, e.g. ‘cat’. Assuming this, Localisation may then refer to finding where the object is in said image, usually denoted by the output of some form of bounding box around the object. Current classification/localisation techniques on ImageNet[9] have likely surpassed an ensemble of trained humans.[10] For this reason, we place greater emphasis on subsequent sections of the blog.

Figure 1: Computer Vision Tasks




Source: Fei-Fei Li, Andrej Karpathy & Justin Johnson (2016) cs231n, Lecture 8 - Slide 8, Spatial Localization and Detection (01/02/2016). Available: http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf  

However, the introduction of larger datasets with an increased number of classes[11] will likely provide new metrics for progress in the near future. On that point, François Chollet, the creator of Keras,[12] has applied new techniques, including the popular architecture Xception, to an internal google dataset with over 350 million multi-label images containing 17,000 classes. [13],[14]  


Figure 2: Classification/Localisation results from ILSVRC (2010-2016)

Note: ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The change in results from 2011-2012 resulting from the AlexNet submission. For a review of the challenge requirements relating to Classification and Localization see: http://www.image-net.org/challenges/LSVRC/2016/index#comp 

Source: Jia Deng (2016). ILSVRC2016 object localisation: introduction, results. Slide 2. Available:http://image-net.org/challenges/talks/2016/ILSVRC2016_10_09_clsloc.pdf 

Interesting takeaways from the ImageNet LSVRC (2016): 

  • Scene Classification refers to the task of labelling an image with a certain scene class like ‘greenhouse’, ‘stadium’, ‘cathedral’, etc. ImageNet held a Scene Classification challenge last year with a subset of the Places2[15] dataset: 8 million images for training with 365 scene categories. 
    Hikvision
    [16] won with a 9% top-5 error with an ensemble of deep Inception-style networks, and not-so-deep residuals networks.

  • Trimps-Soushen won the ImageNet Classification task with 2.99% top-5 classification error and 7.71% localisation error. The team employed an ensemble for classification (averaging the results of Inception, Inception-Resnet, ResNet and Wide Residual Networks models[17]) and Faster R-CNN for localisation based on the labels.[18] The dataset was distributed across 1000 image classes with 1.2 million images provided as training data. The partitioned test data compiled a further 100 thousand unseen images.

  • ResNeXt by Facebook came a close second in top-5 classification error with 3.03% by using a new architecture that extends the original ResNet architecture.[19] 


链接:

http://www.themtank.org/a-year-in-computer-vision


原文链接:

https://m.weibo.cn/1402400261/4177997059278344

登录查看更多
6

相关内容

计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取‘信息’的人工智能系统。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: https://github.com/Jyouhou/SceneTextPapers.

成为VIP会员查看完整内容
0
29

The quest of `can machines think' and `can machines do what human do' are quests that drive the development of artificial intelligence. Although recent artificial intelligence succeeds in many data intensive applications, it still lacks the ability of learning from limited exemplars and fast generalizing to new tasks. To tackle this problem, one has to turn to machine learning, which supports the scientific study of artificial intelligence. Particularly, a machine learning problem called Few-Shot Learning (FSL) targets at this case. It can rapidly generalize to new tasks of limited supervised experience by turning to prior knowledge, which mimics human's ability to acquire knowledge from few examples through generalization and analogy. It has been seen as a test-bed for real artificial intelligence, a way to reduce laborious data gathering and computationally costly training, and antidote for rare cases learning. With extensive works on FSL emerging, we give a comprehensive survey for it. We first give the formal definition for FSL. Then we point out the core issues of FSL, which turns the problem from "how to solve FSL" to "how to deal with the core issues". Accordingly, existing works from the birth of FSL to the most recent published ones are categorized in a unified taxonomy, with thorough discussion of the pros and cons for different categories. Finally, we envision possible future directions for FSL in terms of problem setup, techniques, applications and theory, hoping to provide insights to both beginners and experienced researchers.

0
280
下载
预览

Generic object detection, aiming at locating object instances from a large number of predefined categories in natural images, is one of the most fundamental and challenging problems in computer vision. Deep learning techniques have emerged in recent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the field of generic object detection. Given this time of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought by deep learning techniques. More than 250 key contributions are included in this survey, covering many aspects of generic object detection research: leading detection frameworks and fundamental subproblems including object feature representation, object proposal generation, context information modeling and training strategies; evaluation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance. We finish by identifying promising directions for future research.

0
7
下载
预览

Most of the internet today is composed of digital media that includes videos and images. With pixels becoming the currency in which most transactions happen on the internet, it is becoming increasingly important to have a way of browsing through this ocean of information with relative ease. YouTube has 400 hours of video uploaded every minute and many million images are browsed on Instagram, Facebook, etc. Inspired by recent advances in the field of deep learning and success that it has gained on various problems like image captioning and, machine translation , word2vec , skip thoughts, etc, we present DeepSeek a natural language processing based deep learning model that allows users to enter a description of the kind of images that they want to search, and in response the system retrieves all the images that semantically and contextually relate to the query. Two approaches are described in the following sections.

0
11
下载
预览
小贴士
相关资讯
【资源】2019年计算机视觉综述论文汇聚
专知
33+阅读 · 2019年11月26日
【机器视觉】计算机视觉研究入门全指南
产业智能官
11+阅读 · 2018年9月23日
【综述】计算机视觉简介:历史、现状和发展趋势【可下载】
机器学习算法与Python学习
8+阅读 · 2018年9月21日
【推荐】RNN最新研究进展综述
机器学习研究会
13+阅读 · 2018年1月6日
综述 | 知识图谱发展概述
PaperWeekly
59+阅读 · 2017年11月3日
深度学习医学图像分析文献集
机器学习研究会
13+阅读 · 2017年10月13日
【论文】图上的表示学习综述
机器学习研究会
6+阅读 · 2017年9月24日
【推荐】深度学习目标检测全面综述
机器学习研究会
17+阅读 · 2017年9月13日
【推荐】GAN架构入门综述(资源汇总)
机器学习研究会
8+阅读 · 2017年9月3日
【推荐】全卷积语义分割综述
机器学习研究会
17+阅读 · 2017年8月31日
相关VIP内容
专知会员服务
46+阅读 · 2020年5月1日
专知会员服务
61+阅读 · 2020年4月24日
专知会员服务
118+阅读 · 2020年3月6日
专知会员服务
191+阅读 · 2020年1月1日
人工智能顶刊TPAMI2019最新《多模态机器学习综述》
专知会员服务
53+阅读 · 2019年10月18日
强化学习最新教程,17页pdf
专知会员服务
45+阅读 · 2019年10月11日
[综述]深度学习下的场景文本检测与识别
专知会员服务
29+阅读 · 2019年10月10日
相关论文
Extending Machine Language Models toward Human-Level Language Understanding
James L. McClelland,Felix Hill,Maja Rudolph,Jason Baldridge,Hinrich Schütze
4+阅读 · 2019年12月12日
The Measure of Intelligence
François Chollet
3+阅读 · 2019年11月5日
AutoML: A Survey of the State-of-the-Art
Xin He,Kaiyong Zhao,Xiaowen Chu
32+阅读 · 2019年8月14日
Few-shot Learning: A Survey
Yaqing Wang,Quanming Yao
280+阅读 · 2019年4月10日
Joaquin Vanschoren
105+阅读 · 2018年10月8日
Deep Learning for Generic Object Detection: A Survey
Li Liu,Wanli Ouyang,Xiaogang Wang,Paul Fieguth,Jie Chen,Xinwang Liu,Matti Pietikäinen
7+阅读 · 2018年9月6日
Vikram Mullachery,Vishal Motwani
7+阅读 · 2018年5月13日
Tanya Piplani,David Bamman
11+阅读 · 2018年1月11日
Tadas Baltrušaitis,Chaitanya Ahuja,Louis-Philippe Morency
112+阅读 · 2017年8月1日
B. V. Patel,B. B. Meshram
3+阅读 · 2012年11月20日
Top