VLM Comparative Benchmark Visualizer

Select a dataset to load evaluation samples. The interface will display the same question/task evaluated across four different VLMs.

1. Select a Dataset
2. Select a Sample / Episode Step

GPT-4o

OpenAI o1

Gemini 2.5 Pro

Qwen 2.5 VL