Vision language models fail to translate detailed visual features into words

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen

Paper Poster

Abstract

Recent studies demonstrate that large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, struggle with low-level vision tasks that are easy to humans. Specifically, on BlindTest, the suite of 7 very simple tasks, including identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate, on average. In this work, we investigate the potential reasons behind this phenomenon. We find that VLMs, including slow-thinking models, consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close. Yet, VLMs perform at near-100% accuracy when much more space is added to separate shapes and letters. Linear probing experiments show that vision encoders contain sufficient visual information to solve BlindTest and that language models fail to decode this information into correct answers.

Type

Non-Proceedings Track

Publication

Non-Proceedings Track