Visual Textbook Network: Watch Carefully before Answering Visual Questions

Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen

Abstract

Recent deep neural networks have achieved promising results on Visual Question Answering (VQA) tasks. However, many works have shown that a high accuracy does not always guarantee that the VQA system correctly understands the contents of images and questions, which are what we really care about. Attention based models can locate the regions related to answers, and may demonstrate a promising understanding of image and question. However, the key components of generating correct location, i.e. visual semantic alignments and semantic reasoning, are still obscure and invisible. To deal with this problem, we introduce a two-stage model Visual Textbook Network (VTN), which is made up by two modules to produce more reasonable answers. Specifically, in the first stage, a textbook module watches the image carefully by performing a novel task named sentence reconstruction, which encodes a word to a visual region feature, and then decodes the visual feature to the input word. This procedure forces VTN to learn visual semantic alignments without much concerning on question answering. This stage is just like studying from textbooks where people mainly concentrate on the knowledge in the book and pay little attention to the test. At the second stage, we propose a simple network as exam module, which utilizes both the visual features generated by the first module and the question to predict the answer. To validate the effectiveness of our method, we conduct evaluations on Visual7W dataset and show the quantitive and qualitative results on answering questions.

Session

Posters

Files

PDF iconPaper (PDF)

DOI

10.5244/C.31.131
https://dx.doi.org/10.5244/C.31.131

Citation

Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen. Visual Textbook Network: Watch Carefully before Answering Visual Questions. In T.K. Kim, S. Zafeiriou, G. Brostow and K. Mikolajczyk, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 131.1-131.12. BMVA Press, September 2017.

Bibtex

            @inproceedings{BMVC2017_131,
                title={Visual Textbook Network: Watch Carefully before Answering Visual Questions},
                author={Difei Gao, Ruiping Wang, Shiguang Shan and Xilin Chen},
                year={2017},
                month={September},
                pages={131.1-131.12},
                articleno={131},
                numpages={12},
                booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
                publisher={BMVA Press},
                editor={Tae-Kyun Kim, Stefanos Zafeiriou, Gabriel Brostow and Krystian Mikolajczyk},
                doi={10.5244/C.31.131},
                isbn={1-901725-60-X},
                url={https://dx.doi.org/10.5244/C.31.131}
            }