Cross-modal Retrieval via Memory Network
Ge Song and Xiaoyang Tan
Abstract
With the explosive growth of multimedia data on the Internet, cross-modal retrieval
has attracted a great deal of attention in computer vision and multimedia community.
However, this task is very challenging due to the heterogeneity gap between different
modalities. Current approaches typically involve a common representation learning process that maps different data into a common space by linear or nonlinear functions. Yet
most of them 1) only handle the dual-modal situation and generalize poorly to complex
cases; 2) require example-level alignment of training data, which is often prohibitively
expensive in practical applications; and 3) do not fully exploit prior knowledge about
different modalities during the mapping process. In this paper, we address above issues
by casting common representation learning as a Question Answer problem via a cross-modal memory neural network (CMMN). Specifically, raw features of all modalities are
seemed as ’Question’, and extra discriminator is exploited to select high-quality ones as
’Statements’ for storage whereby common features are desired ’Answer’.
Session
Posters
Files
Paper (PDF)
DOI
10.5244/C.31.178
https://dx.doi.org/10.5244/C.31.178
Citation
Ge Song and Xiaoyang Tan. Cross-modal Retrieval via Memory Network. In T.K. Kim, S. Zafeiriou, G. Brostow and K. Mikolajczyk, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 178.1-178.12. BMVA Press, September 2017.
Bibtex
@inproceedings{BMVC2017_178,
title={Cross-modal Retrieval via Memory Network},
author={Ge Song and Xiaoyang Tan},
year={2017},
month={September},
pages={178.1-178.12},
articleno={178},
numpages={12},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Tae-Kyun Kim, Stefanos Zafeiriou, Gabriel Brostow and Krystian Mikolajczyk},
doi={10.5244/C.31.178},
isbn={1-901725-60-X},
url={https://dx.doi.org/10.5244/C.31.178}
}