Does Your Mobile Say a Lot About You? It Should — and Here’s Why.
This article is part of the Academic Alibaba series and is taken from the paper entitled “Mobile Access Record Resolution on Large-Scale Identifier-Linkage Graphs” by Shen Xin, Hongxia Yang, Weizhao Xian, Martin Ester, Jiajun Bu, Zhongyao Wang, and Can Wang, accepted by KDD 2018. The full paper can be read here.
The e-commerce era is witnessing a rapid increase of mobile internet users. Major e-commerce companies are now seeing billions of instances of mobile access every day, and hidden in these records are valuable user behavioral characteristics such as shopping preferences and browsing patterns. However, in order to extract this information from the huge dataset, the records need to be linked to the corresponding mobile devices, a process known as Mobile Access Records Resolution (MARR).
There are two major challenges confronting MARR:
· Device identifiers and other attributes in access records might be missing or unreliable.
· The dataset contains billions of access records from millions of devices.
Currently, no existing methods have been developed to resolve this problem of using mobile device identifiers on such a massive scale.
Enter the Alibaba tech team.
What Device Are You Using Right Now? We Probably Don’t Know.
According to a new report by the International Telecommunication Union (ITU), the number of global mobile inte net subscriptions (not users) reached 7.74 billion in 2017. As mobile phones have overtaken desktop computers as the most widely used digital platform, characterizing mobile user preferences and behavioral patterns from their access records has become incredibly important. Compared with traditional weblogs, which mostly depend on cookies to track user behavior, mobile access records provide a clearer picture of internet users with various IDs in the access records.
These IDs include:
· International Mobile Equipment Identity (IMEI) — a unique identifier designed to identify a device.
· International Mobile Subscriber Identity (IMSI) — designed to identify a user in a cellular network, which is stored in a SIM card.
· UserTrack Device IDentity (UTDID).
IMEI and IMSI are identifiers for one’s smartphone and mobile number respectively. UTDID, on the other hand, is quite different from these two hardware-based identifiers in that it is generated and used by Alibaba (China’s multinational e-commerce company) for device identification. With these IDs, access records can be mapped to the corresponding mobile phones or apps, which in turn generates higher-quality user profiles.
Mapping an access record to the mobile phone or app is then seemingly a simple matter since IDs such as IMEI, IMSI and UTDID can be used to uniquely identify the device and app. However, data collected from practical applications is far from perfect. There will be missing attribute values, noisy (problematic and misleading) IDs, and ID shift problems. One-way ID Shift happens when a device gets a new IMSI upon a new SIM card being installed.
Giving Our Devices Their Own Voice
The Alibaba technical team have observed that ID shift in one or two IDs in an access record might occur from time to time, but it rarely happens in all three IDs. Inspired by this observation, they use the combination of the three IDs (IMEI, IMSI, UTDID), which they call “IDSET”, to reliably identify an access record from a specific mobile device. An example of an IDSET is given above where each record is identified by the IDSET, i.e. a combination of IMEI, IMSI and UTDID.
Based on the concept of IDSET, they have introduced the Mobile Access Records Resolution (MARR) problem. The objective of the MARR problem is to identify the physical device for each access record, since each access record is generated by one specific mobile device. The team aim to group access records according to the device, which can be used to generate profiles for the device users.
Considering the sheer size of the dataset and poor data quality mainly due to the ID shift problems, MARR is a highly challenging problem. So the team are also proposing a SParse Identifier-linkage Graph (SPI-Graph) accompanied with the abundant mobile device profiling data to accurately match mobile access records to devices. (Data is considered ‘sparse’ when certain expected values in a dataset are missing, which is a common phenomenon in general large scaled data analysis.)
Extensive experimental results on large-scale real-world datasets have so far validated the effectiveness and efficiency of the team’s algorithm. These results have also meant that the team is now hoping to investigate how they can further group mobile access records for a specific device into access sessions and thus better characterize user profiles.
The full paper can be read here.