Reference
Bansal A, Chen X, Russell B, Ramanan D, Gupta A. PixelNet: Towards a General Pixel-Level Architecture. arXiv preprint arXiv:1609.06694, 2016.(link)(homepage)
Bansal A, Russell B, Gupta A. Marr revisited: 2d-3d alignment via surface normal prediction. arXiv preprint arXiv:1604.01347. 2016. (link)
Summary
The paper seems to propose some new ideas at the level of “implementation details”. But, such details are so important that the performance can be greatly improved.
The objective of this paper is to propose a general CNN to make dense prediction without any post-processing, like CRF, in different applications. The CNN network is simple: a few alternating 3x3 convolution layers and pooling layers (we may call it Conv part), followed by three 1x1 convolution layers, just like fully connected layers (we may call it MLP part). The core in the model is a hyper-column, a set of multiscale features, is collected for each pixel at original image resolution from the Conv part, and used as the input to the MLP part to make dense prediction.
The main contribution of the paper is how to compute the hyper-column efficiently. In the test stage, it is normal, namely up-pooling each convolution layer to the original scale and collecting the corresponding value at each layer. In the training stage, the authors only want to make predictions on a small subset of pixels (~2%) in each image. So, up-pooling the whole feature maps will definitely waster a huge amount of computation. The proposed solution is a sparse on-demand sampling strategy. For example, in order to compute the hyper column value in layer c_i for pixel p, four positions in c_i that closest to the corresponding position of p are extracted and interpolated as one hyper-column value for pixel p.
The philosophy behind such design is as follows. Suppose each training step is able to deal with 10000 pixels and the image size is 100x100. Normally, people will do training on the whole 100x100 image. But, a more effective way is to run this iteration with 50 images while only 200 pixels are selected per image. It motivates the design of the on-demand sampling strategy to compute hyper-column during training, which achieves a good trade off between amortized computation and reduced storage.
Related Papers
Kokkinos I. UberNet: Training aUniversal’Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory. arXiv preprint arXiv:1609.02132. 2016 Sep 7.