Operationalizing the data quality framework : tindering datasets

Chug, Sezal; Kaushal, Priya; Kumaraguru, Ponnurangam (Advisor); Sethi, Tavpritesh (Advisor)

dc.contributor.author	Chug, Sezal
dc.contributor.author	Kaushal, Priya
dc.contributor.author	Kumaraguru, Ponnurangam (Advisor)
dc.contributor.author	Sethi, Tavpritesh (Advisor)
dc.date.accessioned	2022-04-01T06:31:19Z
dc.date.available	2022-04-01T06:31:19Z
dc.date.issued	2021-05
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1002
dc.description.abstract	Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts gives a specific practical application for the same. The current investigation was undertaken with a purpose to formulate a concrete platform where one can assess the quality of data and get a nutrition label for the same. The proposed system quantifies and qualifies the provided data and assesses them at subjective as well as objective levels. In our research, we have pro-posed a metric which generates a Data Quality Label Approach, Data Quality Score and a Comprehensive Report for its quality judgment. In this empirical study, the Demographics and Health Surveys (DHS) Program dataset is used to judge the quality of data and assign a nutrition label using statistical modeling approaches. The value of the nutrition label would instill confidence in the user in deploying the data for his/her respective application. The results of the current empirical study revealed that due to the growing technology up gradations in data collection and processing, there is a constant gradient increase in the nutrition label score over the years in the DHS dataset. The nutrition label would successfully define the quality of the dataset using nine "ingredients", namely provenance, dataset characteristics, uniformity, metadata coupling, statistics encompassing percentage of missing cells and duplicate rows, skewness of data, number of continuous and categorical columns, the correlation between columns of a dataset and inconsistencies between the highly correlated columns. The output of ibid research generates data quality metric that helps our model to formulate comprehensive report which gives an overview of the "ingredients" of the dataset and predicts data quality score that helps the end-users to adjudge the overall quality of data.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT- Delhi	en_US
dc.subject	Data Quality	en_US
dc.subject	Demographics and Health Surveys (DHS) Program	en_US
dc.subject	Dataset Nutri- tion Label	en_US
dc.subject	MetaData Matching	en_US
dc.subject	Pearson Correlation	en_US
dc.subject	Data Quality Metric	en_US
dc.title	Operationalizing the data quality framework : tindering datasets	en_US
dc.type	Other	en_US