Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study is a research paper published in The Lancet Digital Health (2024). On theSindex it has a DataRank of 0.909. It has been cited 427 times.
BackgroundLarge language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care.MethodsUsing the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain-namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups.FindingsWe found that GPT-4 did not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardised clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and genders. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception.InterpretationOur findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies before clinical implementation.FundingPriscilla Chan and Mark Zuckerberg.
FAIR checklist signals are shown for context only and do not affect DataRank scoring.
Base Score Contribution
0.909
From this paper's citation signal
Citation Network Contribution
0
Citation network not refreshed for this result
This paper's DataRank is currently driven only by its base citation score. Citation network data was not refreshed for this result.
Learn more about DataRank methodology →DataRank blends this paper's own citation count with the influence of the papers that cite it. Here, roughly 100% comes from its base citations and 0% from the citation network.
Citers are pulled from OpenAlex sorted by cited_by_count:descand capped per paper, so when the cap binds we keep the highest-signal references and the score is reproducible across reruns.