A SPARC of Genius: How Alibaba is Revolutionizing Rare Category Analysis
This article is part of the Academic Alibaba series and is taken from the paper entitled “SPARC: Self-Paced Network Representation for Few-Shot Rare Category Characterization” by Dawei Zhou, Jingrui He, Hongxia Yang, and Wei Fan, accepted by KDD 2018. The full paper can be read here.
On web platforms, rare data is frequently among the most valuable. Rare category analysis is essential in protecting against computer network intrusion, discovering trending topics on social media, and detecting fraudulent online transactions. Locating this data, however, can prove difficult. Like the proverbial needle in the haystack, they are most often hidden among and are inseparable from normal data points, and labelling rare category examples is extremely expensive. In order to effectively analyze a rare category in a data set, an algorithm must be a fast learner.
Alibaba’s tech team, in collaboration with Arizona State University and the Tencent Medical AI Lab, has created SPARC, a self-paced framework that gradually learns the rare-category-oriented network representation and the characterization model in a way that is mutually beneficial.
Learning at Its Own Pace
SPARC takes its inspiration from curriculum learning. The curriculum learning paradigm imitates the cognitive process of humans: the underlying model is trained from easy aspects of a task to more difficult ones based on a predetermined curriculum.
While this concept has been applied to many different situations, the trial-and-error design can be difficult to apply in the real world. A new learning paradigm, however, automatically learns a curriculum by minimizing the loss function with a self-paced regulator. This is referred to as self-paced learning.
The research team built upon this idea of self-paced learning, applying it to the scenario of rare category analysis with a scarcity of labeled examples, in order to gradually and simultaneously learn the rare category embedding representation and the characterization model.
SPARC simultaneously learns the graph embedding and predicts the rare category examples in a mutually beneficial way. The framework is able to model imbalanced class memberships in given networks such as different network sizes. SPARC is capable of learning from a small number of labeled rare category examples, minimizing the labelling cost. Additionally, category oriented representation with SPARC is able to widely separate majority and minority classes in embedding space, classes that are inseparable in terms of network topology and features.
Putting SPARC Into Action
To test SPARC’s effectiveness in the field, researchers collected data sets from bibliographic collaboration networks, NLP networks, and social networks. The algorithm was compared to two unsupervised network embedding algorithms, DeepWalk and LINE, as well as one other semi-supervised framework, PLANETOID, along the following criteria:
· Accuracy of classification
· Percentage of discovered rare category examples
· Ratio of true rare examples being retrieved
SPARC outperformed the competing state-of-the-art methods across all the data sets and evaluation metrics in most cases. While the semi-supervised embedding networks performed better than the unsupervised methods at separating classes, SPARC was superior at clustering rare examples and was able to be trained with only one labelled rare category example. Moreover, SPARC was more robust, having a smaller error bar than the comparison methods.
The full paper can be read here .