Code variants and their retrieval using knowledge discovery based approaches

Vinayakarao, Venkatesh; Purandare, Rahul (Advisor)

Home
→
Computer Science and Engineering
→
PhD Theses
→
Year-2018
→
View Item

dc.contributor.author	Vinayakarao, Venkatesh
dc.contributor.author	Purandare, Rahul (Advisor)
dc.date.accessioned	2018-09-04T09:09:23Z
dc.date.available	2018-09-04T09:09:23Z
dc.date.issued	2018-04
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/625
dc.description.abstract	Code variants represent alternative implementations of a code snippet, where each alternative provides the same functionality, but has different properties that make some of them better suited to the overall project requirements. Developers routinely need to analyze existing code, find better reuse alternatives, and look to develop high-quality code that meets some desired properties. However, searching for such code variants over the web has several challenges. In this dissertation, we address this problem. This dissertation presents new techniques to search for code variants. Classical program analysis techniques do not scale well to analyze partial programs at webscale. Hence, we apply search techniques to mine code variants using human annotated natural language descriptions found in the posts of Stack Overflow1 (SO) which is a popular discussion forum. Here, we make four major contributions. Unlike clones and examples, existing literature lacks a rigorous characterization of code variants. So, as our first contribution, we present a characterization of code variants where we discuss the code context, desired properties, and types of variants along with implications for tool builders. With this knowledge about variants, we propose techniques to search for variants in SO, as our second major contribution. We propose a novel structural model for source code which is based on developers’ perspective of similarity. To leverage the text and code components that we index from SO, we adapt an existing state of the art term-weighting method to propose a Multi-Component Multi-Aspect Term Frequency - Inverse Document Frequency (MCMATF-IDF) model to retrieve code variants. Existing text-retrieval models do not work well on source code. Expressing natural language queries on source code is an open problem. Many query terms in natural language have multiple surface forms in source code. We address this problem by perceiving source code as a collection of entities. This becomes our third major contribution. Further, as a bottleneck to the success of our approaches, we notice that our work depends on parsing code snippets in SO. We observe that only 31.3% of code snippets in SO parse. Hence, in our fourth contribution, we apply grounded theory approach to study these parsing problems. Based on this study, we develop a tool which increases the code snippets that can generate Abstract Syntax Trees for 63% of the code snippets in SO. Overall, the ability to perform semantic search over source code snippets assisted by developer knowledge in the form of discussion forum data opens up a new way to solve several important problems. It can lead to improvements in a variety of software engineering tasks and tools such as semantic clone detection, code comprehension and defect detection. Apart from supporting software engineering applications, as future work, we plan to explore enhancing static analysis over source code snippets using the data from discussion forums.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Code Variants	en_US
dc.title	Code variants and their retrieval using knowledge discovery based approaches	en_US
dc.type	Thesis	en_US