Decoding the DNA of code: an AI-infused approach to detect code cloning in software systems

Mehrotra, Nikita; Purandare, Rahul (Advisor)

Home
→
Computer Science and Engineering
→
PhD Theses
→
Year-2024
→
View Item

dc.contributor.author	Mehrotra, Nikita
dc.contributor.author	Purandare, Rahul (Advisor)
dc.date.accessioned	2024-07-12T05:25:20Z
dc.date.available	2024-07-12T05:25:20Z
dc.date.issued	2024-06
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1642
dc.description.abstract	Code clones, duplicate code fragments sharing similar syntax or semantics, have become increasingly prevalent due to the success of software management tools like GitHub and advancements in Open Source Software (OSS). Previous research has shown that an astonishing 70% of the code hosted on GitHub consists of clones derived from previously existing files. Furthermore, research has also found that between 9% and 31% of software projects on Github contain a substantial portion, sometimes up to 80%, of files that have identical counterparts elsewhere. While clones facilitate code reuse and refactoring, they simultaneously complicate software evolution, necessitating effective clone detection techniques. Historically, substantial amount of research has been conducted on code clone detection, most traditional approaches focus on syntactic clones by leveraging lexical and syntactic information. However, only a few of them target semantic clones. Furthermore, the evolution of software engineering has led to the development of modern multilingual software from traditional mono-language systems, where functionality replication across multiple programming languages is common. This results in clones having similar functionality but belonging to different languages. Since such code snippets are syntactically unrelated, traditional single-language clone detection approaches are not feasible for their detection. Motivated by the success of deep learning models in various domains, researchers have explored deep learning techniques for code clone detection. These techniques leverage the power of machine learning to learn the underlying patterns and features of code to measure code similarity. However, the majority of these techniques rely on supervised learning, which necessitates a substantial volume of labeled data to achieve optimal performance. The acquisition and creation of such labeled datasets present considerable challenges, as they involve not only the scarcity of accurately labeled examples but also the laborious and time-consuming process of manual annotation. In the face of inadequate labeled data, the supervised techniques often encounter significant limitations when applied to new benchmarks or datasets, as they struggle to adapt to the issue of domain shift. This limits the generalizability of supervised techniques, impeding their practical applicability and effectiveness in diverse and evolving software systems. To address the challenges and enhance semantic code clone detection in modern software systems, this thesis presents the following contributions through the investigation of three innovative approaches: 1) We propose a novel method to model semantic similarity between code snippets by utilizing customized graph neural networks for code combined with program dependency graphs. This approach effectively leverages the structured syntactic and semantic information present within the code snippets. We have developed a prototype tool, called “HOLMES”, based on this approach and rigorously evaluated its performance on popular code i clone benchmarks. 2) To model cross-language code similarity, we introduce a semi-supervised deep learning tool, “RUBHUS”, which leverages control and data flow-enriched abstract syntax trees (ASTs). We demonstrate the effectiveness of “RUBHUS” through experiments conducted on datasets consisting of Java, C, and Python programs, showcasing its ability to detect cross- language clones compared to other state-of-the-art cross-language and single-language clone detection tools. 3) We propose an adversarial unsupervised domain adaptation approach and tool, “CLODIA”, which employs multiple latent spaces for domain adaptation. This approach enhances the performance and generalization capabilities of learning-based clone detection techniques on unseen domains, reducing the need for human annotation. We conduct extensive evaluations of “CLODIA” on various datasets, including programming competition and open-source datasets, as well as a handcrafted dataset of clones curated from real-world software systems. In summary, our research addresses a critical gap in the realm of semantic and cross-language code clone detection, offering innovative solutions and prototype tools as proof-of-concept. Rigorously tested against popular code clone benchmarks, these tools showcase their effectiveness by outperforming state-of-the-art counterparts. The pivotal findings and insights gleaned from this thesis not only advance our comprehension of clone detection in contemporary software systems but also tackle the challenges stemming from labeled data scarcity and semantic clone detection, paving the way for future research in this field.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Graph-Based Siamese Networks	en_US
dc.subject	Clone Detection	en_US
dc.subject	Graph Neural Networks	en_US
dc.title	Decoding the DNA of code: an AI-infused approach to detect code cloning in software systems	en_US
dc.type	Thesis	en_US