Generating Multi-sentence Natural Language Descriptions of Indoor Scenes
Dahua Lin, Sanja Fidler, Chen Kong and Raquel Urtasun
Abstract
This paper proposes a novel framework for generating lingual descriptions of indoor scenes. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Our approach is distinguished from conventional ones in several aspects: (1) a 3D visual parsing system that jointly infers objects, attributes, and relations; (2) a generative grammar learned automatically from training text; and (3) a text generation algorithm that takes into account coherence among sentences. Experiments on the NYU-v2 dataset show that our framework is able to generate natural multi-sentence descriptions, outperforming those produced by a baseline.
Dahua Lin, Sanja Fidler, Chen Kong and Raquel Urtasun. Generating Multi-sentence Natural Language Descriptions of Indoor Scenes. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 93.1-93.13. BMVA Press, September 2015.
Bibtex
@inproceedings{BMVC2015_93,
title={Generating Multi-sentence Natural Language Descriptions of Indoor Scenes},
author={Dahua Lin and Sanja Fidler and Chen Kong and Raquel Urtasun},
year={2015},
month={September},
pages={93.1-93.13},
articleno={93},
numpages={13},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Xianghua Xie, Mark W. Jones, and Gary K. L. Tam},
doi={10.5244/C.29.93},
isbn={1-901725-53-7},
url={https://dx.doi.org/10.5244/C.29.93}
}