Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored.
In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains.
Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models.
Zero-shot classification with CLIP-based models
Performances of the MedCLIP model across different conditions and subgroups
We compute the embeddings for the image and "FINDINGS" section of radiology reports from MIMIC and project them in 2-dimensions using PCA. When grouped by the different sensitive attributes we do not see any patterns, however we can see for most models a gap between the image and text embeddings.
PCA of MedImageInsight image and text embeddings grouped on different attributes
However, using simple models such as a linear probe, a k-NN, and a single-hidden-layer MLP, we could still classify the sensitive attribute from the image embeddings. It is important to note that while such results may show the encoding of sensitive attributes in the embeddings, it is not enough to conclude that they are actually used as shortcuts for other downstream tasks.
@article{sourget2025fairnessclip,
title={Fairness and Robustness of CLIP-Based Models for Chest X-rays},
author={Théo Sourget and David Restrepo and Céline Hudelot and Enzo Ferrante and Stergios Christodoulidis and Maria Vakalopoulou},
journal={arXiv preprint arXiv:2507.21291},
year={2025},
}