Select a dataset to load evaluation samples. The interface will display the same question/task evaluated across four different VLMs.