Abstract:
With the rapid growth of the World Wide Web and the Internet of Things, huge amounts of digital data are being produced everyday. Digital forensics investigators face an uphill battle when they have to manually screen through and examine such humongous data during an investigation. A major requirement of modern forensic investigations to perform automatic filtering of correlated data, thereby reducingand focusing the manual effort of the investigator. There are two types of _altering: blacklisting and white listing. Blacklisting is the process of _altering data by matching them with the set of known-to-be-bad fles (as determined by the investigator). The resultant _les after this process are the ones which an investigator needs to examine closely. On the other hand, white listing is the process of filltering by matching the _les with a set of already known-to-be-good _les. The _les passing this process need not be examined by the investigator. Approximate matching algorithm, also known as Similarity hash-ing, is a generic term used to describe the techniques that are used to perform the _altering process by measuring similarity between two digital objects, typically by assigning a `similarity score'. Over the years, several approximate matching algorithms have been proposed and are being used in practice. Some of the prominent approximate matching schemes are steep, sdhash, mvHash-B, etc.This dissertation presents security analyses of existing approximate matching tools and techniques. We show that most of the existing schemes are prone to active adversary attacks. An attacker by making feasible changes in the content of the _le can intelligently change the_nal similarity score produced to evade detection. Thus, an alternate hashing scheme is required which can resist this attack. As a core contributions of this dissertation, we develop a new ap-proximate matching algorithm FbHash. We show that our algorithmic secure against active attacks and can detect similarity with 98%accuracy in some common use-cases. We also provide a detailed comparative analysis of our construction with other existing schemes and show that our scheme has a 28% higher accuracy than other schemes for uncompressed _le formats (e.g., text _les) and a 50% higher ac-curacy for compressed _le formats (e.g., docx, etc.) Our proposed algorithm is able to correlate a _le fragment as small as 1% to the source le with an observed 100% detection rate and is able to detect commonality as small as 1% between two documents with an appropriate similarity score. Further, we show that our scheme also produces the least false negatives among all such schemes. In order to identify the capabilities of similarity matching schemes, it is important to have a systematic method to evaluate the exist-ing and future algorithms. This dissertation also presents a general platform-independent approximate matching algorithm evaluation tool to assess existing and future algorithm on four pragmatic test cases on the following metrics: true negative rate, false positive rate, precision, recall, F-score, and Matthews Correlation Coincident (MCC). In order to understand the true capabilities of an algorithm, it is important to evaluate them on a real-world dataset. Our tool provides a real-world dataset for each of the four test cases previously referred. The tool also provides an automated way to generate a real-world dataset for other cases, which will help support future research in this domain.