Vision language models are blind

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, they are still surprisingly struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. Sonnet-3.5 performs the best at 77.84% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close together.

Publication
Non-Proceedings Track