Abstract:
Healthcare datasets are not easily available to researchers and innovators due to patients’ privacy and/or government regulations. We propose to create a novel pipeline for bespoke synthetic datasets matching original clinical datasets that very well represent the real-world data but a specific sample in the data cannot be traced back to a patient. In this project, we will work on publicly available clinical datasets because clinical data in healthcare are important for PCOR i.e Patient Centered Outcomes Research which focuses on effective prevention and treatment measures of an individual. So the hospitals, as well as clinical data researchers and innovators, need large sample sizes and low-cost clinical datasets to work on. Here comes our prototype, which provides several synthetic clinical datasets of the publicly available datasets created with different machine learning algorithms. The prototype will contain a thorough analysis of the different synthetic datasets with respect to various metrics like classification accuracy, precision, recall, and F1 score so that any researcher who needs to analyze the datasets has suffcient information to develop their theory or research. We can help the hospitals by providing them with an analysis of how much augmentation is needed for a particular task. This will be done by checking the efficacy and evaluation of various algorithms used to create synthetic clinical data. We expect to attain patterns through our work, indicating a best performing synthetic data augmentation model choice based on specific characteristics of clinical data like column/feature types, sample count, and more.