Abstract:
Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts gives a specific practical application for the same. The current investigation was undertaken with a purpose to formulate a concrete platform where one can assess the quality of data and get a nutrition label for the same. The proposed system quantifies and qualifies the provided data and assesses them at subjective as well as objective levels. In our research, we have pro-posed a metric which generates a Data Quality Label Approach, Data Quality Score and a Comprehensive Report for its quality judgment. In this empirical study, the Demographics and Health Surveys (DHS) Program dataset is used to judge the quality of data and assign a nutrition label using statistical modeling approaches. The value of the nutrition label would instill confidence in the user in deploying the data for his/her respective application. The results of the current empirical study revealed that due to the growing technology up gradations in data collection and processing, there is a constant gradient increase in the nutrition label score over the years in the DHS dataset. The nutrition label would successfully define the quality of the dataset using nine "ingredients", namely provenance, dataset characteristics, uniformity, metadata coupling, statistics encompassing percentage of missing cells and duplicate rows, skewness of data, number of continuous and categorical columns, the correlation between columns of a dataset and inconsistencies between the highly correlated columns. The output of ibid research generates data quality metric that helps our model to formulate comprehensive report which gives an overview of the "ingredients" of the dataset and predicts data quality score that helps the end-users to adjudge the overall quality of data.