Is a Picture Really Worth a Thousand Words?

How Alibaba’s new approach to text and image matching is improving the accuracy of image searches

Image for post
Image for post

Current Frameworks

Currently, the most common method is to encode individual modalities (e.g. images) into their respective features first, and then map them into a common semantic space. This includes optimization using a ranking loss system that encourages the similarity of the mapped features of ground-truth image-text pairs to be greater than that of any other negative pair. Once the commonality is found, the similarity between the two modalities can easily be measured by computing the distance between their representations in the common space.

Image for post
Image for post

Using Generative Models

To improve on cross-modal retrieval, the Alibaba tech team proposed to integrate generative models into textual-visual feature embedding. This means that in addition to the conventional cross-modal feature embedding at the global semantic level, an additional cross-modal feature embedding is introduced at the local level. This is grounded by two generative models: image-to-text and text-to-image. The team proposed cross-modal feature embedding with generative models at a high level with three distinct learning steps: look, imagine, and match. First, an abstract representation is extracted from the query. Then, the target item (text or image) in the other modality is imagined and a more concrete grounded representation is formed. This is done by asking the representation of one modality to generate the item in the other modality and make a comparison. Finally, the right image-text pairs are matched using a relevance score calculated based on a combination of grounded and abstract representations. The experiments were done on the benchmark dataset MSCOCO.

Image for post
Image for post

Results

Using two generative models with conventional textual-visual feature embedding enabled the researchers to make use of concrete grounded representations to capture the detailed similarity between two modalities. They found a combination of grounded and abstract representations can significantly improve performance on cross-modal image-caption retrieval. This framework significantly outperforms other advanced textual-visual, cross-modal retrieval methods on the MSCOCO dataset.

Image for post
Image for post

Alibaba Tech

First hand, detailed, and in-depth information about Alibaba’s latest technology → Search “Alibaba Tech” on Facebook

First-hand & in-depth information about Alibaba's tech innovation in Artificial Intelligence, Big Data & Computer Engineering. Follow us on Facebook!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store