Beyond images: convnext vs. vision-language models for automated breast density classification in screening mammography
Citation
Share
Abstract
This study evaluates and compares the effectiveness of different deep learning approaches for automated breast density classification according to the BI-RADS system. Specifically, the research examines two distinct architectures: ConvNeXt, a CNN-based model, and BioMed- CLIP, a vision-language model that integrates textual information through token-based labels. Using mammographic images from TecSalud at Tecnol´ogico de Monterrey, the study assesses these models across three distinct learning paradigms: zero-shot classification, linear probing with token-based descriptions, and fine-tuning with numerical class labels. The experimental results demonstrate that while vision-language models offer theoretical advantages in terms of interpretability and zero-shot capabilities, based CNN architectures with end-to-end fine-tuning currently deliver superior performance for this specialized medical imaging task. ConvNeXt achieves an accuracy of up to 0.71 and F1 scores of 0.67, compared to BioMedCLIP’s best performance of 0.57 accuracy with linear probing. A comprehensive analysis of classification patterns revealed that all models encountered difficulties in distinguishing between adjacent breast density categories, particularly heterogeneously dense tissue. This challenge mirrors known difficulties in clinical practice, where even experienced radiologists exhibit inter-observer variability in density assessment. The performance discrepancy between models was further examined through detailed loss curve analysis and confusion matrices, revealing specific strengths and limitations of each approach. A key limitation in BioMedCLIP’s performance stemmed from insufficient semantic richness in the textual tokens representing each density class. When category distinctions relied on subtle linguistic differences—such as ”extremely” versus ”heterogeneously”—the model struggled to form robust alignments between visual features and textual descriptions. The research contributes to the growing body of knowledge on AI applications in breast imaging by systematically comparing traditional and multimodal approaches under consistent experimental conditions. The findings highlight both the current limitations and future potential of vision-language models in mammographic analysis, suggesting that enhanced textual descriptions and domain-specific adaptations could potentially bridge the performance gap while preserving the interpretability benefits of multimodal approaches for clinical applications.
Description
https://orcid.org/0000-0001-5235-7325