Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task

Mateusz Malinowski, Ashkan Mokarian and Mario Fritz

Abstract

We present a method that pools over CNN representations of a large number, highly overlapping object proposals for the Visual Madlibs task. We show that such representation together with nCCA, a successful multimodal embedding technique, achieves state-of-the-art performance on this task. Moreover, inspired by the nCCA's objective function, we extend classical CNN+LSTM approach to train the network by maximizing similarity between the internal representation of the deep learning architecture and candidate answers. Again, such approach achieves a significant improvement over the prior work that also uses CNN+LSTM approach on Visual Madlibs.

Session

Posters 2

Files

Extended Abstract (PDF, 392K)

Paper (PDF, 3M)

DOI

10.5244/C.30.111
https://dx.doi.org/10.5244/C.30.111

Citation

Mateusz Malinowski, Ashkan Mokarian and Mario Fritz. Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task. In Richard C. Wilson, Edwin R. Hancock and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 111.1-111.12. BMVA Press, September 2016.

Bibtex

        @inproceedings{BMVC2016_111,
        	title={Mean Box Pooling: A Rich Image   Representation and Output Embedding for  the Visual Madlibs Task},
        	author={Mateusz Malinowski, Ashkan Mokarian and Mario Fritz},
        	year={2016},
        	month={September},
        	pages={111.1-111.12},
        	articleno={111},
        	numpages={12},
        	booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
        	publisher={BMVA Press},
        	editor={Richard C. Wilson, Edwin R. Hancock and William A. P. Smith},
        	doi={10.5244/C.30.111},
        	isbn={1-901725-59-6},
        	url={https://dx.doi.org/10.5244/C.30.111}
        }