Fairness and Robustness of CLIP-Based Models for Chest X-rays

1MICS, CentraleSupélec - Université Paris-Saclay, 2CONICET, Universidad de Buenos Aires

Abstract

Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored.

In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains.

Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models.

Zero-shot evaluation of CLIP-based models

We apply the models in a zero-shot setting using six labels from the MIMIC-CXR dataset: atelectasis, cardiomegaly, consolidation, pleural effusion, pneumonia, and pneumothorax. To classify the images, we compute for each label the similarities between the embeddings of an image and two templates "Chest {CLASS}" and "Chest No Findings". We then apply a softmax function between the two similarities to obtain the probability of the disease.
Our setup to perform zero-shot classification with two templates.

Zero-shot classification with CLIP-based models


All models except CXR-CLIP obtain results above random, especially MedCLIP, MedImageInsight, and CheXzero, confirming their application in zero-shot settings.
Evolution of auc when increasing the mask size.

We evalute the performance across different subgroups of patients based on their age, sex and race. We found performance gaps across different patient's ages while the results seems closer for the other sensitive attributes. See an example for MedCLIP:
barplot of performances for the MedCLIP model.

Performances of the MedCLIP model across different conditions and subgroups


Encoding of sensitive attributes

Not visible in PCA...

We compute the embeddings for the image and "FINDINGS" section of radiology reports from MIMIC and project them in 2-dimensions using PCA. When grouped by the different sensitive attributes we do not see any patterns, however we can see for most models a gap between the image and text embeddings.

PCA of MedImageInsight image and text embeddings grouped on different attributes.

PCA of MedImageInsight image and text embeddings grouped on different attributes


...But predictable with simple models

However, using simple models such as a linear probe, a k-NN, and a single-hidden-layer MLP, we could still classify the sensitive attribute from the image embeddings. It is important to note that while such results may show the encoding of sensitive attributes in the embeddings, it is not enough to conclude that they are actually used as shortcuts for other downstream tasks.

Evolution of auc when increasing the mask size.

Robustness to shortcut learning

We use the NIH-CXR14 and the NEATX datasets to assess the impact of shortcut learning. We evaluate the models on the classification of pneumothorax for patients with and without chest drains, the common treatment for this disease. We found that all the models except CXR-CLIP had better performance on images with chest drains, indicating the possible reliance on this spurious correlation.
Evolution of auc when increasing the mask size.

See more in the paper

Check out the full paper for additional results on the gaps between the embeddings, the models' calibrations, and more details on the experimental setup.

BibTeX


@article{sourget2025fairnessclip,
    title={Fairness and Robustness of CLIP-Based Models for Chest X-rays}, 
    author={Théo Sourget and David Restrepo and Céline Hudelot and Enzo Ferrante and Stergios Christodoulidis and Maria Vakalopoulou},
    journal={arXiv preprint arXiv:2507.21291},
    year={2025},
}