IIIT-Delhi Institutional Repository

Operationalizing the data quality framework : tindering datasets

Show simple item record

dc.contributor.author Chug, Sezal
dc.contributor.author Kaushal, Priya
dc.contributor.author Kumaraguru, Ponnurangam (Advisor)
dc.contributor.author Sethi, Tavpritesh (Advisor)
dc.date.accessioned 2022-04-01T06:31:19Z
dc.date.available 2022-04-01T06:31:19Z
dc.date.issued 2021-05
dc.identifier.uri http://repository.iiitd.edu.in/xmlui/handle/123456789/1002
dc.description.abstract Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts gives a specific practical application for the same. The current investigation was undertaken with a purpose to formulate a concrete platform where one can assess the quality of data and get a nutrition label for the same. The proposed system quantifies and qualifies the provided data and assesses them at subjective as well as objective levels. In our research, we have pro-posed a metric which generates a Data Quality Label Approach, Data Quality Score and a Comprehensive Report for its quality judgment. In this empirical study, the Demographics and Health Surveys (DHS) Program dataset is used to judge the quality of data and assign a nutrition label using statistical modeling approaches. The value of the nutrition label would instill confidence in the user in deploying the data for his/her respective application. The results of the current empirical study revealed that due to the growing technology up gradations in data collection and processing, there is a constant gradient increase in the nutrition label score over the years in the DHS dataset. The nutrition label would successfully define the quality of the dataset using nine "ingredients", namely provenance, dataset characteristics, uniformity, metadata coupling, statistics encompassing percentage of missing cells and duplicate rows, skewness of data, number of continuous and categorical columns, the correlation between columns of a dataset and inconsistencies between the highly correlated columns. The output of ibid research generates data quality metric that helps our model to formulate comprehensive report which gives an overview of the "ingredients" of the dataset and predicts data quality score that helps the end-users to adjudge the overall quality of data. en_US
dc.language.iso en_US en_US
dc.publisher IIIT- Delhi en_US
dc.subject Data Quality en_US
dc.subject Demographics and Health Surveys (DHS) Program en_US
dc.subject Dataset Nutri- tion Label en_US
dc.subject MetaData Matching en_US
dc.subject Pearson Correlation en_US
dc.subject Data Quality Metric en_US
dc.title Operationalizing the data quality framework : tindering datasets en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository


Advanced Search

Browse

My Account