Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task
Mateusz Malinowski, Ashkan Mokarian and Mario Fritz
Abstract
We present a method that pools over CNN representations of a large number, highly overlapping object proposals for the Visual Madlibs task. We show that such representation together with nCCA, a successful multimodal embedding technique, achieves state-of-the-art performance on this task. Moreover, inspired by the nCCA's objective function, we extend classical CNN+LSTM approach to train the network by maximizing similarity between the internal representation of the deep learning architecture and candidate answers. Again, such approach achieves a significant improvement over the prior work that also uses CNN+LSTM approach on Visual Madlibs.
Session
Posters 2
Files
Extended Abstract (PDF, 392K)
Paper (PDF, 3M)
DOI
10.5244/C.30.111
https://dx.doi.org/10.5244/C.30.111
Citation
Mateusz Malinowski, Ashkan Mokarian and Mario Fritz. Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task. In Richard C. Wilson, Edwin R. Hancock and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 111.1-111.12. BMVA Press, September 2016.
Bibtex
@inproceedings{BMVC2016_111,
title={Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task},
author={Mateusz Malinowski, Ashkan Mokarian and Mario Fritz},
year={2016},
month={September},
pages={111.1-111.12},
articleno={111},
numpages={12},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Richard C. Wilson, Edwin R. Hancock and William A. P. Smith},
doi={10.5244/C.30.111},
isbn={1-901725-59-6},
url={https://dx.doi.org/10.5244/C.30.111}
}