Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval
In Proceedings British Machine Vision Conference 2014
http://dx.doi.org/10.5244/C.28.97
Abstract
Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image ("Im2Text"), and (ii) predicting image(s) given a piece of text ("Text2Im"). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an "independent" text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of ``unannotated'' images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.
Session
Poster Session
Files
Extended Abstract (PDF, 1 page, 799K)Paper (PDF, 13 pages, 1.0M)
Supplemental Materials (ZIP, 89K)
Bibtex File
Citation
Yashaswi Verma, and C. V. Jawahar. Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval. Proceedings of the British Machine Vision Conference. BMVA Press, September 2014.
BibTex
@inproceedings{BMVC.28.97 title = {Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval}, author = {Verma, Yashaswi and Jawahar, C. V.}, year = {2014}, booktitle = {Proceedings of the British Machine Vision Conference}, publisher = {BMVA Press}, editors = {Valstar, Michel and French, Andrew and Pridmore, Tony} doi = { http://dx.doi.org/10.5244/C.28.97 } }