Abstract:
Online professional platforms such as Indeed, LinkedIn, Naukri, Stack Overflow, and Blind serve as digital ecosystems to connect professionals, employers, and job seekers. These platforms witness online user activities, including finding jobs, candidate matching, and posting a job, which results in a voluminous amount of User Generated Content (UGC). UGC content primarily includes rich information about job postings, CVs, candidate posts, recruiter profiles, etc. The quality of this content varies from meaningful information to misleading content, de- pending on the expertise, reliability, and intention of the users. Even though information helps job seekers find the right jobs, the unmonitored nature of the content (including ambiguous, re- dundant, missing, off-topic, scams, misleading, or irrelevant information) makes it difficult to assess the content quality, thereby affecting the platform’s trustworthiness, reducing the value to its customers, and, in turn, hampering the user experience. For instance, the content comes with multiple variations of each entity name (e.g., ‘economictimes.com’; ‘eco. times’; ‘the economic times’; ‘economic times’; ‘ET’). These multiple non-standardized variations (noisy, redundant, and ambiguous), when directly incorporated into downstream applications such as semantic search, question answering, and recommender systems, result in poor system performance. Similarly, statistics from the Singaporean recruitment platform shows that 65% of the job de- scriptions (JDs) do not include relevant and popular skills. At the same time, 40% of JDs miss listing 20% or more explicitly stated skills in the prose description. It reduces the number of relevant applications for the job posting and affects the performance of significant recruitment tasks such as job-to-resume matching. With millions of job seekers per month, these candidates often come across dishonest, money-seeking, intentionally and verifiably false information in jobs such as offering more wages, flexible working hours, and appealing career growth opportunities. The proliferation of these jobs not only hampers candidate’s experience but also acts as a repressing factor in an enterprise’s reputation. Given the platforms are open-source and anyone ranging from novices and experts can upload content to these platforms, low-quality questions (lack of clarity, off-topic content, primarily opinion-based, too broad) also come up. Therefore, it is crucial to maintain the quality of content posted on these platforms. The thesis adopts a four-fold approach to address content quality issues for online professional platforms. The first phase of this work centres around normalizing content on online professional platforms. The second phase aims to predict missing skills to enhance job quality over these platforms. The third phase involves modeling a framework to detect misleading content on recruitment platforms. This requires mining unstructured recruitment data from various sources to obtain structured information and creating domain-specific knowledge graphs. We also delve into understanding employment scam complaints to help platforms continuously refine their advisories based on user complaint base and feedback to ensure they stay updated with the dynamically evolving tactics used by scammers. The fourth phase focuses on identifying low-quality information for question-answering services. In conclusion, we contribute by building automated solutions to improve content quality for online professional activities using domain-specific learning and knowledge.