Efficient Test-Time Scaling for Small Vision-Language Models

Kaya, Mehmet Onurcan; Elliott, Desmond; Papadopoulos, Dim P.

Efficient Test-Time Scaling
for Small Vision-Language Models

Mehmet Onurcan Kaya^1,2, Desmond Elliott^3,2, Dim P. Papadopoulos^1,2

¹ Technical University of Denmark ² Pioneer Center for AI ³ University of Copenhagen

Paper Code arXiv

Our framework consists of two main pipelines: (1) Test-Time Augmentation: Given an input image and text prompt, we apply various transformations to create multiple augmented versions. VLM processes each augmented input to produce next token probability distributions, which are then aggregated at the token level to generate the final response. (2) Test-Time Adaptation: We create pseudolabels through test-time augmentation and fine-tune the VLM parameters, then repeat the process.

Abstract

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

Results

Spider chart of results

Our simple methods consistently boost performance across diverse benchmarks covering various task types:
visual question answering, multiple-choice questions, yes/no questions, and image captioning.

Comparison with Other Test-Time Scaling Methods

Our methods are significantly more efficient than existing test-time scaling approaches that rely on repeated/parallel sampling, due to two key design choices: token-level aggregation and augmentation-based diversity inducement. Existing test-time scaling methods employ answer-level aggregation with temperature sampling for diversity inducement:

① Self-Consistency aggregates candidate answers via majority voting across multiple sampled outputs. While effective for tasks where final answers can be parsed, it struggles in creative or open-ended settings.

② Self-Selector uses the VLM itself as a verifier to select one response among the candidates. This approach extends applicability beyond tasks suited to majority voting.

③ Sample-and-Rank Self-Consistency ignores the model's internal signals for selection; majority voting treats all reasoning traces equally, ignoring quality variations. Sample-and-Rank leverages next-token distribution statistics to assess response quality by selecting the response with the highest log probability.

④ Self-Synthesizer The selection of only one answer, as in previous strategies, ignores information from other responses. To combine potentially correct parts from different responses, this method uses the tested VLM to aggregate responses into one coherent final answer.

⑤ Token-Level Aggregation with Simple Averaging (Our TTAug Method). Our approach aggregates the predictions at each step using a token-level aggregation of the final logits, providing more granular control and computational efficiency.

Key insight #1: Augmentation-based diversity beats temperature sampling.
Generating diverse, high-quality candidate answers is critical for TTS. We compare two approaches for inducing diversity.
Temperature Sampling introduces diversity by sampling from a softened probability distribution.
Input Perturbations applies classic semantic-preserving augmentations to inputs and then decodes greedily.
To evaluate which method is more effective, we consider two test-time scaling strategies: Self-Consistency and Self-Selector.

Key insight #2:
Token-level aggregation is superior to answer-level.
By aggregating at the token level rather than final answers, we achieve better performance. Token-level aggregation captures fine-grained local confidence signals.

Scaling Behavior

Scaling with number of augmentations

Average performance shows monotonic improvement with saturation.

Cross-Model Generalization

Cross-model performance

Despite hyperparameters being optimized for SmolVLM2-2.2B, our methods provide consistent improvements across diverse models,
though transferability varies by family and size.
Even with suboptimal hyperparameters, our methods yield robust improvements, though dedicated tuning is recommended for best results.

Comprehensive Analyses

We also provide comprehensive ablation studies exploring optimal aggregation strategies, layer selection for aggregation, augmentation techniques, adaptation objectives, and the balance between computational efficiency and performance gains.

Qualitative Results

Benchmark: ChartQA

Original Input Image

Original Input Prompt

Which country had the most visitors to Italy in 2018?

Baseline Output

Answer: France

Accuracy: 0.0%

TTAug Output

Answer: Germany

Accuracy: 100.0%

TTAdapt Output

Answer: Germany

Accuracy: 100.0%

Augmented Input Images

Augmented Prompts

Loading prompts...

Benchmark: OCRBench

Original Input Image

Augmented Input Images

Original Input Prompt

what is the total amount of this receipt? Answer this question using the text in the image directly.

Baseline Output

Answer: 100.00

Accuracy: 0.0%

TTAug Output

Answer: 71.10

Accuracy: 100.0%

TTAdapt Output

Answer: 71.10

Accuracy: 100.0%

Augmented Prompts

Loading prompts...

Benchmark: OCRVQA

Original Input Image

Baseline Output

Answer: Brushy.

Accuracy: 0.0%

TTAug Output

Answer: Brush Dance.

Accuracy: 100.0%

TTAdapt Output

Answer: Brush Dance.

Accuracy: 100.0%

Original Input Prompt

Who is the author of this book?

Augmented Input Images

Augmented Prompts

Loading prompts...

Benchmark: GQA

Original Input Image

Original Input Prompt

What's in front of the window?

Baseline Output

Answer: Blinds.

Accuracy: 0.0%

TTAug Output

Answer: Desk.

Accuracy: 100.0%

TTAdapt Output

Answer: Desk.

Accuracy: 100.0%

Augmented Input Images

Augmented Prompts

Loading prompts...

Benchmark: TextVQA

Original Input Image

Original Input Prompt

which of these books was recently adapted by netflix?

Baseline Output

Answer: broken angels

Accuracy: 0.0%

TTAug Output

Answer: altered carbon

Accuracy: 100.0%

TTAdapt Output

Answer: altered carbon

Accuracy: 100.0%

Augmented Input Images

Augmented Prompts

Loading prompts...

Benchmark: AI2D

Original Input Image

Baseline Output

Answer: C

Accuracy: 0.0%

TTAug Output

Answer: A

Accuracy: 100.0%

TTAdapt Output

Answer: A

Accuracy: 100.0%

Original Input Prompt

What would happen if the cricket population decreased? The choices are listed below:
(A) lizards would decrease
(B) eagle would increase
(C) king brown snake would increast
(D) salt bush would decrease

Augmented Input Images

Augmented Prompts

Loading prompts...

Benchmark: MME-RealWorld

Original Input Image

Baseline Output

Answer: E

Accuracy: 0.0%

TTAug Output

Answer: B

Accuracy: 100.0%

TTAdapt Output

Answer: B

Accuracy: 100.0%

Original Input Prompt

This image shows the front view of the ego car. What is the future state of the black pants pedestrian in the middle? The choices are listed below:
(A) Turn left.
(B) Stationary.
(C) Keep going straight.
(D) Turn right.
(E) The image does not feature the object.

Augmented Input Images

Augmented Prompts

Loading prompts...

Benchmark: AMBER

Original Input Image

Original Input Prompt

Does the pigeon stand in this image?

Baseline Output

Answer: No, the pigeon is perched on top of the coin return machine.

Accuracy: 0.0%

TTAug Output

Answer: Yes

Accuracy: 100.0%

TTAdapt Output

Answer: Yes

Accuracy: 100.0%

Augmented Input Images

Augmented Prompts

Loading prompts...

Benchmark: COCO Captions

Original Input Image

Augmented Input Images

Original Input Prompt

Please describe this image in general. Directly provide the description, do not include prefix like "This image depicts".

Augmented Prompts

Loading prompts...

Baseline Output

Answer: The image features a white bird with a long, curved beak and black legs standing on a green bench. The bird appears to be a type of ibis, characterized by its long, slender neck and beak. The bench is made of wood and has a slatted design, with the bird perched on the top slat. The background includes a paved area with a few scattered leaves and some greenery, suggesting an outdoor setting, possibly a park or a zoo. The bird is facing to the right, and its body is oriented towards the left side of the image.

Accuracy:
ROUGE-L: 15.59%, BLEU-1: 12.24%, BLEU-2: 7.94%, BLEU-3: 5.08%, BLEU-4: 3.43%, CIDEr: 0.00

TTAug Output

Answer: A white bird with a long curved beak is standing on a green bench. The bird is looking to the left.

Accuracy:
ROUGE-L: 45.02%, BLEU-1: 52.38%, BLEU-2: 36.19%, BLEU-3: 27.45%, BLEU-4: 21.89%, CIDEr: 12.62

TTAdapt Output

Answer: A white bird with a long beak and black legs is standing on a green bench.

Accuracy:
ROUGE-L: 53.20%, BLEU-1: 62.50%, BLEU-2: 40.82%, BLEU-3: 32.93%, BLEU-4: 27.23%, CIDEr: 60.98

BibTeX Citation

@article{Kaya2025EfficientTTS,
  title={Efficient Test-Time Scaling for Small Vision-Language Models},
  author={Mehmet Onurcan Kaya and Desmond Elliott and Dim P. Papadopoulos},
  journal={arXiv preprint arXiv:2510.03574},
  year={2025},
  url={https://monurcan.github.io/efficient_test_time_scaling}
}

Efficient Test-Time Scalingfor Small Vision-Language Models

Abstract

Results

Comparison with Other Test-Time Scaling Methods

Scaling Behavior

Cross-Model Generalization

Comprehensive Analyses

Qualitative Results

BibTeX Citation

Efficient Test-Time Scaling
for Small Vision-Language Models