Is a Picture Really Worth a Thousand Words?
How Alibaba’s new approach to text and image matching is improving the accuracy of image searches
This article is part of the Academic Alibaba series and is taken from the paper entitled “Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models” by Jiuxiang Gu, Jianfei Cai , Shafiq Joty, Li Niu and Gang Wang, accepted by CVPR 2018. The full paper can be read here.
Retrieving relevant images from a text search query is a trending topic these days in both the computer vision and language processing communities, especially in the age of Big Data and the huge growth of text, video, and image data modalities. The challenge stems from large amounts of data and the numerous properties it can exhibit, preventing it from being easily searched by, for example, a text search query.
Researchers from the Alibaba AI Labs and Nanyang Technological University have carried out experiments using a new framework that matches images and sentences with complex content, achieving sophisticated cross-modal result retrieval on the MSCOCO dataset.
Current Frameworks
Currently, the most common method is to encode individual modalities (e.g. images) into their respective features first, and then map them into a common semantic space. This includes optimization using a ranking loss system that encourages the similarity of the mapped features of ground-truth image-text pairs to be greater than that of any other negative pair. Once the commonality is found, the similarity between the two modalities can easily be measured by computing the distance between their representations in the common space.
Although this has been used successfully for high-level semantic concepts in multi-modal data, this method is not sufficient for retrieving images with detailed local similarity (e.g., spatial layout) or sentences with word-level similarity. On the other hand, humans can relate the finer details of an image, allowing us to relate a textual query to relevant images more accurately. So, if we can ground the representation of one modality to the objects in the other modality, we can learn a better mapping.
Using Generative Models
To improve on cross-modal retrieval, the Alibaba tech team proposed to integrate generative models into textual-visual feature embedding. This means that in addition to the conventional cross-modal feature embedding at the global semantic level, an additional cross-modal feature embedding is introduced at the local level. This is grounded by two generative models: image-to-text and text-to-image. The team proposed cross-modal feature embedding with generative models at a high level with three distinct learning steps: look, imagine, and match. First, an abstract representation is extracted from the query. Then, the target item (text or image) in the other modality is imagined and a more concrete grounded representation is formed. This is done by asking the representation of one modality to generate the item in the other modality and make a comparison. Finally, the right image-text pairs are matched using a relevance score calculated based on a combination of grounded and abstract representations. The experiments were done on the benchmark dataset MSCOCO.
Results
Using two generative models with conventional textual-visual feature embedding enabled the researchers to make use of concrete grounded representations to capture the detailed similarity between two modalities. They found a combination of grounded and abstract representations can significantly improve performance on cross-modal image-caption retrieval. This framework significantly outperforms other advanced textual-visual, cross-modal retrieval methods on the MSCOCO dataset.
The full paper can be read here.
Alibaba Tech
First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook