Annals of Mathematical Sciences and Applications

Volume 3 (2018)

Number 1

Special issue in honor of Professor David Mumford, dedicated to the memory of Jennifer Mumford

Guest Editors: Stuart Geman, David Gu, Stanley Osher, Chi-Wang Shu, Yang Wang, and Shing-Tung Yau

Visual concepts and compositional voting

Pages: 151 – 188

DOI: https://dx.doi.org/10.4310/AMSA.2018.v3.n1.a5

Authors

Jianyu Wang (Baidu Research USA, Sunnyvale, California, U.S.A.)

Zhishuai Zhang (Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, U.S.A.)

Cihang Xie (Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, U.S.A.)

Yuyin Zhou (Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, U.S.A.)

Vittal Premachandran (Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, U.S.A.)

Jun Zhu (Intelligent Driving Group, Baidu USA, Sunnyvale, California, U.S.A.)

Lingxi Xie (Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, U.S.A.)

Alan Yuille (Department of Cognitive Science and Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, U.S.A.)

Abstract

It is very attractive to formulate vision in terms of pattern theory, where patterns are defined hierarchically by compositions of elementary building blocks. But applying pattern theory to real world images is very challenging and is currently less successful than discriminative methods such as deep networks. Deep networks, however, are black-boxes which are hard to interpret and, as we will show, can easily be fooled by adding occluding objects. It is natural to wonder whether by better understanding deep networks we can extract building blocks which can be used to develop pattern theoretic models. This motivates us to study the internal feature vectors of a deep network using images of vehicles from the PASCAL3D+ dataset with the scale of objects fixed. We use clustering algorithms, such as $K$-means, to study the population activity of the features and extract a set of visual concepts which we show are visually tight and correspond to semantic parts of the vehicles. To analyze this in more detail, we annotate these vehicles by their semantic parts to create a new dataset which we call VehicleSemanticParts, and evaluate visual concepts as unsupervised semantic part detectors. Our results show that visual concepts perform fairly well but are outperformed by supervised discriminative methods such as Support Vector Machines. We next give a more detailed analysis of visual concepts and how they relate to semantic parts. Following this analysis, we use the visual concepts as building blocks for a simple pattern theoretical model, which we call compositional voting. In this model several visual concepts combine to detect semantic parts. We show that this approach is significantly better than discriminative methods like Support Vector machines and deep networks trained specifically for semantic part detection. Finally, we return to studying occlusion by creating an annotated dataset with occlusion, called Vehicle Occlusion, and show that compositional voting outperforms even deep networks when the amount of occlusion becomes large.

Keywords

pattern theory, deep networks, visual concepts

We gratefully acknowledge support from the National Science Foundations with NSF STC award CCF-1231216 (Center for Brains, Minds, and Machines) and NSF Expedition in Computing award CCF-1317376. We would also like to acknowledge support from the Office of Naval Research (ONR) with N00014-15-1-2356.

Received 27 July 2017

Published 27 March 2018