User identity linkage : data collection, dataset biases, method, control and application

Kaushal, Rishabh; Kumaraguru, Ponnurangam (Advisor)

Home
→
Computer Science and Engineering
→
PhD Theses
→
Year-2020
→
View Item

User identity linkage : data collection, dataset biases, method, control and application

Kaushal, Rishabh; Kumaraguru, Ponnurangam (Advisor)

URI: http://repository.iiitd.edu.in/xmlui/handle/123456789/831

Date: 2020-10

Abstract:

Online Social Networks (OSNs) have become popular platforms for online users. Users typically register and maintain their accounts (user identities) across different OSNs to share a variety of content and remain connected with their friends. Consequently, linking user identities across OSN platforms, referred to as user identity linkage becomes a critical problem. Solving this problem enables us to build a more comprehensive view of a user’s activities across OSNs, which is highly beneficial for targeted advertisements, recommendations, and many more applications. In this the-sis, we define the core research statement as follows. Computational approaches can be proposed for the analysis of data collection methods, investigation of biases in identity linkage datasets, linkage of user identities across social networks, control-ability of user identity linkage, and application of user identity linkage solution to solve extraneous problems. To that end, we make contributions starting from the computational interventions at the data collection stage, methodology stage, and finally at the implication (privacy and security) stage, for the problem of user identity linkage, as outlined below. The collection of ground truth data comprising user identity pairs belonging to the same individual is a very important first step. Specifically, we provide a detailed methodology of five methods, namely Advanced Search Operator (ASO), Social Aggregator (SA), Cross-Platform Sharing (CPS), Self-Disclosure (SD), and Friend Finding Feature (FFF) for data collection. Taken together, we collect linked identities of 208,120 individuals distributed across 43 different OSNs. Subsequently, we compare these methods, both qualitatively and quantitatively. Furthermore, we find that user identity datasets obtained from different data collection methods have inherent biases driven by user behaviors. For instance, we find that user identities collected through SD method have more similar usernames and display names than those user identities collected through CPS method. We detect, quantify, and mitigate these dataset biases. We study these biases on more than 1million user identity pairs obtained by leveraging two user behaviors, namely cross-posting and self-disclosure. We find that biases manifest in the form of lexical differences in user-generated content, particularly in usernames and display names configured by users. These behavioral biases lower down the performance (precision and recall) of learning models by 5-20%. Inspired by discrimination measurement metrics, we propose and implement a framework to quantify the extent of these biases and find that 15-20% of test data get affected. Lastly, we propose an approach to mitigate these biases in the dataset. At the level of methodology, we propose a node embedding based framework, referred to as NeXLink that leverages state-of-the-art node embedding algorithms to learn projections of cross-network linkages (CNLs). A CNL is a pair of user identities across two different social networks belong to the same individual. The NeXLink framework’s goal is to project CNLs into an embedding space such that user pairs across OSNs that belong to the same individual are closer than other pairs. Our modular and flexible node embedding framework referred to as NeXLink, which comprises three steps. First, we obtain local node embeddings by preserving the local structure of nodes within the same social network. Second, we learn the global node embeddings by preserving the global structure, which is present in the form of common friendship exhibited by nodes involved in CNLs across social networks. Third, we combine the local and global node embeddings, which preserve local and global structures to facilitate the detection of CNLs across social networks. We evaluate our proposed framework on an augmented (synthetically generated) dataset of 63,713 nodes &817,090 edges and a real-world dataset of 3,338 Twitter-Foursquare node pairs. Our approach achieves an average hit rate of 98% and 88% in augmented and real-word dataset, respectively, for detecting CNLs across social networks and significantly outperforms previous state-of-the-art methods. From a privacy perspective, linking user identities across OSNs could potentially result in information leak, particularly for privacy-conscious users. Therefore, we develop a system, which we refer to as Nudging Nemo, to help users understand the factors leading to the linkage of their identities across OSNs. Besides, our system helps users control the link ability of their identities across OSN platforms. We evaluate the nudge’s effectiveness by conducting a controlled user study onprivacy-conscious users who maintain their accounts on Facebook, Twitter, and Instagram. Out-comes of user study confirmed that the proposed framework helped most of the participants to make informed decisions, thereby preventing inadvertent exposure of their personal information across social network services. Lastly, we apply the methods to detect identities belonging to the same person across social networks on to the single social network scenario to find identity clones, who are those users who create their online identities impersonating a real user (victim). We investigate behaviors of clones of celebrities and find them indulging in misbehaviors like spreading indecency, misinformation, and many others.

Show full item record