Abstract

With the introduction of vision and language founcational models, language guidance shown to be powerful in many computer vision applications. However, aligning images and text is not straightforward due to the intrinsic differences in each modality. In this project, we develop new approaches to align vision and language modalities while preserving the modality-specific information wherever applicable.

Publications

Contrastive Difference Alignment Between Vision and Language for Composed Image Retrieval.
Y. Duan, S. Ramasinghe, A. Long, S. Gould, and T. Ajanthan.
Preprint, 2024.

Accept the Modality Gap: An Exploration in the Hyperbolic Space.
S. Ramasinghe, V. Shevchenko, G. Avraham, and T. Ajanthan.
Computer Vision and Pattern Recognition (CVPR), June 2024. (highlight)
[pdf] [talk] [code] [bib]

Modality-Aware Adaptation of Contrastive Language-Image Models.
A. Long, T. Ajanthan, and A. van den Hengel.
ICLR Workshop: Mathematical and Empirical Understanding of Foundation Models, May 2023.
[pdf] [bib]