Turning Words into Images: Bilinear Representation for Language-based Image Editing

This article is part of the Academic Alibaba series and is taken from the ICASSP 2019 paper entitled “Bilinear Representation for Language-base Image Editing Using Conditional Generative Adversarial Networks” by Xiaofeng Mao, Yuefeng Chen, Yuhong Li, Tao Xiong, Yuan He, and Hui Xue. The full paper can be read here.

It’s often said that a picture is worth a thousand words, but it’s easy to see how an image description might spiral to well over that amount if you were to include everything of relevance, from the basic colors and shapes it is composed of, to the people, objects, and environments it depicts, to all the layers of meaning it produces.

It might come as a surprise, then, to learn that in the field of language-base image editing (LBIE), researchers are training machine learning models to edit an image based on a one-sentence description of what the new image should look like:

LBIE applied in fashion generation

LBIE is a relatively new field, but it is already starting to see application in the field of fashion generation, where the system “redresses” a model using a description of the new outfit, as well as in VR and computer-aided design (CAD). However, existing models still struggle to represent high-level features accurately. This means that when dealing with more complex images and edits, the system may, for example, identify the correct color but fail to identify the part of the image to use it on. In other cases, the system may become distracted by a complex background and output a meaningless image.

Now, Alibaba’s tech team has developed an improved method that builds on what has become the mainstream of existing approaches. Specifically, it improves the conditional generative adversarial network (cGAN) model by improving its ability to learn fine-grained representations of multi modal features (i.e. both image features and text features).

From Linear to Bilinear cGAN

1. Two-stage GAN: This approach divides LBIE into two subtasks. The first, language-based image segmentation, outputs a segmentation map as an intermediary step; the second, image generation, outputs the final image.

2. Conditional GAN (cGAN): The cGAN approach edits the image based on fused visual-text representations using one of two conditioning methods. The first is concatenation. The second approach is Feature-wise Linear Modulation (FiLM), which seeks to mimic the human attention mechanism.

Opting for the cGAN approach, the team identified that the conditioning methods currently used lack representational power. This is because they employ linear transformation, which means that they cannot learn second-order correlation between two conditioning embeddings. They reasoned that a natural progression of the conditional models would be to generalize from the existing linear methods to produce a more powerful bilinear approach. Accordingly, they added a bilinear residual layer (BRL) into the network architecture, which is shown below.

Network architecture of Alibaba’s bilinear cGAN (bilinear residual layer shown in the dashed box)

A Litmus Test of Birds, Flowers and Fashion

The edits used were:

· “This little bird is mostly white with a black superciliary and primary.”

· “This flower has petals that are yellow at the edges and spotted orange near the center.”

· “The lady was wearing a blue short-sleeved blouse.”

Counterclockwise from top: Original images; edits using FiLM; edits using cGAN with BRL

As can be observed from the figure above, Alibaba’s cGAN method more precisely effected the edits in each case. The effects are most easily observed in the flower images, where the bilinear model is much more effective at identifying not just the colors to use in the new image, but the precise location to use them in.

In addition to the qualitative analysis shown above, the team also conducted quantitative analysis using inception scores (IS), which also indicated that the learned bilinear representation is more powerful than linear approaches and generates higher-quality images.

The full paper can be read here.

Alibaba Tech