This article is part of the Academic Alibaba series and is taken from the paper entitled “A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection” by Qiangpeng Yang, Mengli Cheng, Wenmeng Zhou, Yan Chen, Minghui Qiu, Wei Lin, and Wei Chu, accepted by IJCAI 2018. The full paper can be read here.
Scene text detection — deciphering text that appears in the environment directly from camera footage — is one of the biggest challenges facing computer vision applications. It is also one of the most tantalizing areas for researchers, because the potential of a powerful OCR technology is huge. It has major implications for the fields of multilingual translation, image retrieval, and automatic driving (imagine a car that can read road signs and license plates).
The reason scene text recognition is so challenging is that firstly, scene text covers a huge range of contexts, such as street views, posters, menus, indoor scenes, and much more. Secondly, scene text varies greatly in foreground and background content, lighting level, burring, and orientation.
Now, Alibaba’s tech team have developed IncepText, their new scene text recognition tool which gives state-of-the-art performance, by departing from previous trends and opting instead for an instance segmentation approach. Given its strong test performance, they have since incorporated it as an API into their OCR tool for the general public.
Progressing from Regression to Instance Segmentation
Generally, previous approaches to scene text detection have used indirect or direct regressions, with indirect regression methods predicting offsets from box proposals while direct regression performs boundary regression by predicting offsets from a given point.
Alibaba decided instead to use an instance-aware segmentation approach, drawing on the example of FCIS. Given that text detection, unlike standard object detection, is limited by the huge variation in scale, aspect ratio, and orientation of text, the team designed a purpose-built Inception-Text module to target these challenges. This module was inspired by GoogLeNet’s Inception module. The other major innovation was replacing the PSROI pooling layer in FCIS with deformable PSROI pooling. Standard PSROI pooling can only handle horizontal text, while scene text almost always exists in arbitrary orientations.
Branching Out in All Directions
To deal with the problem of different aspect ratios and scales, the Inception-Text module uses convolutional kernels with multiple branches.
At the end of each branch, a deformable convolutional layer is then added, in the form of deformable PSROI pooling, to tackle the issue of multiple orientations.
The deformable convolution layer is able to use an adaptive receptive field to capture regions with different offsets. Deformable convolution allows free-form deformation of the sampling grid, unlike the regular sampling grid in standard convolution. This deformation is conditioned over the input features, meaning the receptive field is adjusted when the input text is rotated.
In tests against other industry-leading models using three publicly available benchmark datasets, variations of IncepText routinely displayed state-of-the-art performance. Significantly, no extra data was used to enhance the IncepText approaches, unlike the other methods tested.
The full paper can be read here.