Is a Picture Really Worth a Thousand Words?

How Alibaba’s new approach to text and image matching is improving the accuracy of image searches

Image for post
Image for post

This article is part of the Academic Alibaba series and is taken from the paper entitled “Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models” by Jiuxiang Gu, Jianfei Cai , Shafiq Joty, Li Niu and Gang Wang, accepted by CVPR 2018. The full paper can be read here.

Retrieving relevant images from a text search query is a trending topic these days in both the computer vision and language processing communities, especially in the age of Big Data and the huge growth of text, video, and image data modalities. The challenge stems from large amounts of data and the numerous properties it can exhibit, preventing it from being easily searched by, for example, a text search query.

Researchers from the Alibaba AI Labs and Nanyang Technological University have carried out experiments using a new framework that matches images and sentences with complex content, achieving sophisticated cross-modal result retrieval on the MSCOCO dataset.

Current Frameworks

Although this has been used successfully for high-level semantic concepts in multi-modal data, this method is not sufficient for retrieving images with detailed local similarity (e.g., spatial layout) or sentences with word-level similarity. On the other hand, humans can relate the finer details of an image, allowing us to relate a textual query to relevant images more accurately. So, if we can ground the representation of one modality to the objects in the other modality, we can learn a better mapping.

Image for post
Image for post

Using Generative Models

Image for post
Image for post

Results

Image for post
Image for post

The full paper can be read here.

Alibaba Tech

Written by

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store