Visual Textbook Network: Watch Carefully before Answering Visual Questions
Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen
Abstract
Recent deep neural networks have achieved promising results on Visual Question
Answering (VQA) tasks. However, many works have shown that a high accuracy does
not always guarantee that the VQA system correctly understands the contents of images
and questions, which are what we really care about. Attention based models can locate
the regions related to answers, and may demonstrate a promising understanding of image
and question. However, the key components of generating correct location, i.e. visual
semantic alignments and semantic reasoning, are still obscure and invisible. To deal
with this problem, we introduce a two-stage model Visual Textbook Network (VTN),
which is made up by two modules to produce more reasonable answers. Specifically, in
the first stage, a textbook module watches the image carefully by performing a novel task
named sentence reconstruction, which encodes a word to a visual region feature, and then
decodes the visual feature to the input word. This procedure forces VTN to learn visual
semantic alignments without much concerning on question answering. This stage is just
like studying from textbooks where people mainly concentrate on the knowledge in the
book and pay little attention to the test. At the second stage, we propose a simple network
as exam module, which utilizes both the visual features generated by the first module
and the question to predict the answer. To validate the effectiveness of our method, we
conduct evaluations on Visual7W dataset and show the quantitive and qualitative results
on answering questions.
Session
Posters
Files
Paper (PDF)
DOI
10.5244/C.31.131
https://dx.doi.org/10.5244/C.31.131
Citation
Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen. Visual Textbook Network: Watch Carefully before Answering Visual Questions. In T.K. Kim, S. Zafeiriou, G. Brostow and K. Mikolajczyk, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 131.1-131.12. BMVA Press, September 2017.
Bibtex
@inproceedings{BMVC2017_131,
title={Visual Textbook Network: Watch Carefully before Answering Visual Questions},
author={Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen},
year={2017},
month={September},
pages={131.1-131.12},
articleno={131},
numpages={12},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Tae-Kyun Kim, Stefanos Zafeiriou, Gabriel Brostow and Krystian Mikolajczyk},
doi={10.5244/C.31.131},
isbn={1-901725-60-X},
url={https://dx.doi.org/10.5244/C.31.131}
}