Academic

On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

arXiv:2603.17246v1 Announce Type: new Abstract: Vision-Language Models (VLMs) exhibit a characteristic "cone effect" in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -particularly in medical domains- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {{\lambda}}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive m

arXiv:2603.17246v1 Announce Type: new Abstract: Vision-Language Models (VLMs) exhibit a characteristic "cone effect" in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -particularly in medical domains- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {{\lambda}}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.

Executive Summary

This study examines the 'cone effect' and 'modality gap' in Medical Vision-Language Embeddings (VLEs), a phenomenon where the concentrated nature of VLEs contributes to the modality gap between vision and language. The authors propose a lightweight post-hoc mechanism to control the modality gap through a single hyperparameter. They evaluate the impact of modality gap reduction on downstream multimodal performance using various medical and natural datasets. The results indicate that reducing excessive modality gap improves performance, with medical datasets showing stronger sensitivity to gap modulation. The findings position the modality gap as a tunable property rather than a universally minimizable quantity. This study contributes to the understanding of VLEs and their applications in medical domains.

Key Points

  • The 'cone effect' and 'modality gap' in Medical VLEs contribute to cross-modal separation.
  • A lightweight post-hoc mechanism is proposed to control the modality gap through a single hyperparameter.
  • Modality gap reduction improves downstream multimodal performance, particularly in medical datasets.
  • The modality gap is a tunable property rather than a universally minimizable quantity.

Merits

Strength in Methodology

The study employs a systematic and controlled approach to evaluate the impact of modality gap reduction on downstream performance, using both generalist and medically specialized models.

Strength in Contributions

The findings provide new insights into the role of modality gap in Medical VLEs and highlight the importance of tuning the modality gap for optimal performance in medical domains.

Demerits

Limitation in Generalizability

The study focuses on supervised multimodal learning, and it is unclear whether the findings can be generalized to other learning paradigms or applications.

Limitation in Transferability

The proposed lightweight post-hoc mechanism may not be transferable to other types of VLEs or datasets, requiring further adaptation and fine-tuning.

Expert Commentary

This study makes a significant contribution to the understanding of Medical VLEs and their applications in medical domains. The proposed lightweight post-hoc mechanism provides a practical solution for controlling the modality gap, which is a critical aspect of VLEs. The findings highlight the importance of tuning the modality gap for optimal performance in medical tasks, particularly in datasets with strong domain-specific characteristics. The study also raises questions about the generalizability and transferability of the proposed mechanism and highlights the need for further research in this area. Overall, this study is a valuable addition to the field of Medical VLEs and has the potential to inform the development of more accurate and effective models for medical applications.

Recommendations

  • Further investigation into the generalizability and transferability of the proposed lightweight post-hoc mechanism is necessary to ensure its applicability to diverse VLE architectures and datasets.
  • The study's findings should be replicated and extended to other medical domains and tasks to validate the universality of the modality gap's impact on Medical VLEs.

Sources