Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma
Jifei Song, Yi-zhe Song, Tony Xiang and Timothy Hospedales
Abstract
Fine-grained image retrieval (FGIR) enables a user to search for a photo of an object
instance based on a mental picture. Depending on how the object is described by the user,
two general approaches exist: sketch-based FGIR or text-based FGIR, each of which has
its own pros and cons. However, no attempt has been made to systematically investigate
how informative each of these two input modalities is, and more importantly whether
they are complementary to each thus should be modelled jointly. In this work, for the
first time we introduce a multi-modal FGIR dataset with both sketches and sentences
description provided as query modalities. A multi-modal quadruplet deep network is
formulated to jointly model the sketch and text input modalities as well as the photo
output modality.
Session
Posters
Files
Paper (PDF)
Supplementary (PDF)
DOI
10.5244/C.31.45
https://dx.doi.org/10.5244/C.31.45
Citation
Jifei Song, Yi-zhe Song, Tony Xiang and Timothy Hospedales. Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma. In T.K. Kim, S. Zafeiriou, G. Brostow and K. Mikolajczyk, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 45.1-45.12. BMVA Press, September 2017.
Bibtex
@inproceedings{BMVC2017_45,
title={Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma},
author={Jifei Song, Yi-zhe Song, Tony Xiang and Timothy Hospedales},
year={2017},
month={September},
pages={45.1-45.12},
articleno={45},
numpages={12},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Tae-Kyun Kim, Stefanos Zafeiriou, Gabriel Brostow and Krystian Mikolajczyk},
doi={10.5244/C.31.45},
isbn={1-901725-60-X},
url={https://dx.doi.org/10.5244/C.31.45}
}