Abstract:
AI systems have achieved domain expert level performance in a number of healthcare tasks involving patients. However, these systems might also incorporate and amplify human biases contained in the datasets fed to them. These biases render the system infeasible to be used in case of historically under-served populations such as female patients, infants and senior citizens by classifying a person with disease as healthy, thus delaying access to healthcare services and raising serious ethical concerns. In this project, we explore language models for healthcare applications and highlight this bias in terms of gender and age groups by performing phenotyping on benchmark datasets and segregating the data categorically. Then we show the difference in results for different groups in terms of differences in evaluation metrics used by phenotyping benchmark papers, namely - accuracy, precision, recall and F1-score. Keywords: