Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval

Yashaswi Verma and C. V. Jawahar

In Proceedings British Machine Vision Conference 2014
http://dx.doi.org/10.5244/C.28.97

Abstract

Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image ("Im2Text"), and (ii) predicting image(s) given a piece of text ("Text2Im"). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an "independent" text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of ``unannotated'' images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.

Session

Poster Session

Files

Extended Abstract (PDF, 1 page, 799K)
Paper (PDF, 13 pages, 1.0M)
Supplemental Materials (ZIP, 89K)
Bibtex File

Citation

Yashaswi Verma, and C. V. Jawahar. Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval. Proceedings of the British Machine Vision Conference. BMVA Press, September 2014.

BibTex

@inproceedings{BMVC.28.97
	title = {Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval},
	author = {Verma, Yashaswi and Jawahar, C. V.},
	year = {2014},
	booktitle = {Proceedings of the British Machine Vision Conference},
	publisher = {BMVA Press},
	editors = {Valstar, Michel and French, Andrew and Pridmore, Tony}
	doi = { http://dx.doi.org/10.5244/C.28.97 }
}