Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
Results
Our simple methods consistently boost performance across diverse benchmarks covering various task
types:
visual question answering, multiple-choice questions, yes/no questions, and image captioning.
Comparison with Other Test-Time Scaling Methods
Our methods are significantly more efficient than existing test-time scaling approaches that rely on parallel sampling, due to two key design choices: token-level aggregation and augmentation-based diversity inducement. Existing test-time scaling methods employ answer-level aggregation with temperature sampling for diversity inducement:
â‘ Self-Consistency aggregates candidate answers via majority voting across multiple sampled outputs. While effective for tasks where final answers can be parsed, it struggles in creative or open-ended settings.
â‘¡ Self-Selector uses the VLM itself as a verifier to select one response among the candidates. This approach extends applicability beyond tasks suited to majority voting.
â‘¢ Sample-and-Rank Self-Consistency ignores the model's internal signals for selection; majority voting treats all reasoning traces equally, ignoring quality variations. Sample-and-Rank leverages next-token distribution statistics to assess response quality by selecting the response with the highest log probability.
â‘£ Self-Synthesizer The selection of only one answer, as in previous strategies, ignores information from other responses. To combine potentially correct parts from different responses, this method uses the tested VLM to aggregate responses into one coherent final answer.
⑤ Token-Level Aggregation with Simple Averaging (Our TTAug Method). Our approach aggregates the predictions at each step using a token-level aggregation of the final logits, providing more granular control and computational efficiency.
Augmentation-based diversity beats temperature sampling.
Our augmentation strategy induces better diversity than temperature-based parallel sampling.

Token-level aggregation is superior to answer-level.
By aggregating at the token level rather than final answers, we achieve better performance.

Scaling Behavior
Average performance shows monotonic improvement with saturation.
Cross-Model Generalization
Despite hyperparameters being optimized for SmolVLM2-2.2B, our methods provide
consistent improvements across diverse models,
though transferability varies by family and size.
Even with suboptimal hyperparameters, our methods yield robust improvements,
though dedicated tuning is recommended for best results.
Comprehensive Analyses
We also provide comprehensive ablation studies exploring optimal aggregation strategies, layer selection for aggregation, augmentation techniques, adaptation objectives, and the balance between computational efficiency and performance gains.
Qualitative Results
Original Input Image

Original Input Prompt
Which country had the most visitors to Italy in 2018?
Baseline Output
Answer: France
Accuracy: 0.0%
TTAug Output
Answer: Germany
Accuracy: 100.0%
TTAdapt Output
Answer: Germany
Accuracy: 100.0%
Augmented Input Images

Augmented Prompts
Original Input Image

Augmented Input Images

Original Input Prompt
what is the total amount of this receipt? Answer this question using the text in the image directly.
Baseline Output
Answer: 100.00
Accuracy: 0.0%
TTAug Output
Answer: 71.10
Accuracy: 100.0%
TTAdapt Output
Answer: 71.10
Accuracy: 100.0%
Augmented Prompts
Original Input Image

Baseline Output
Answer: Brushy.
Accuracy: 0.0%
TTAug Output
Answer: Brush Dance.
Accuracy: 100.0%
TTAdapt Output
Answer: Brush Dance.
Accuracy: 100.0%
Original Input Prompt
Who is the author of this book?
Augmented Input Images

Augmented Prompts
Original Input Image

Original Input Prompt
What's in front of the window?
Baseline Output
Answer: Blinds.
Accuracy: 0.0%
TTAug Output
Answer: Desk.
Accuracy: 100.0%
TTAdapt Output
Answer: Desk.
Accuracy: 100.0%
Augmented Input Images

Augmented Prompts
Original Input Image

Original Input Prompt
which of these books was recently adapted by netflix?
Baseline Output
Answer: broken angels
Accuracy: 0.0%
TTAug Output
Answer: altered carbon
Accuracy: 100.0%
TTAdapt Output
Answer: altered carbon
Accuracy: 100.0%
Augmented Input Images

Augmented Prompts
Original Input Image

Baseline Output
Answer: C
Accuracy: 0.0%
TTAug Output
Answer: A
Accuracy: 100.0%
TTAdapt Output
Answer: A
Accuracy: 100.0%
Original Input Prompt
What would happen if the cricket population decreased? The choices are listed below:
(A) lizards would decrease
(B) eagle would increase
(C) king brown snake would increast
(D) salt bush would decrease
Augmented Input Images

Augmented Prompts
Original Input Image

Baseline Output
Answer: E
Accuracy: 0.0%
TTAug Output
Answer: B
Accuracy: 100.0%
TTAdapt Output
Answer: B
Accuracy: 100.0%
Original Input Prompt
This image shows the front view of the ego car. What is the future state of the black pants
pedestrian in the middle? The choices are listed below:
(A) Turn left.
(B) Stationary.
(C) Keep going straight.
(D) Turn right.
(E) The image does not feature the object.
Augmented Input Images

Augmented Prompts
Original Input Image

Original Input Prompt
Does the pigeon stand in this image?
Baseline Output
Answer: No, the pigeon is perched on top of the coin return machine.
Accuracy: 0.0%
TTAug Output
Answer: Yes
Accuracy: 100.0%
TTAdapt Output
Answer: Yes
Accuracy: 100.0%
Augmented Input Images

Augmented Prompts
Original Input Image

Augmented Input Images

Original Input Prompt
Please describe this image in general. Directly provide the description, do not include prefix like "This image depicts".
Augmented Prompts
Baseline Output
Answer: The image features a white bird with a long, curved beak and black legs standing on a green bench. The bird appears to be a type of ibis, characterized by its long, slender neck and beak. The bench is made of wood and has a slatted design, with the bird perched on the top slat. The background includes a paved area with a few scattered leaves and some greenery, suggesting an outdoor setting, possibly a park or a zoo. The bird is facing to the right, and its body is oriented towards the left side of the image.
Accuracy:
ROUGE-L: 15.59%,
BLEU-1: 12.24%,
BLEU-2: 7.94%,
BLEU-3: 5.08%,
BLEU-4: 3.43%,
CIDEr: 0.00
TTAug Output
Answer: A white bird with a long curved beak is standing on a green bench. The bird is looking to the left.
Accuracy:
ROUGE-L: 45.02%,
BLEU-1: 52.38%,
BLEU-2: 36.19%,
BLEU-3: 27.45%,
BLEU-4: 21.89%,
CIDEr: 12.62
TTAdapt Output
Answer: A white bird with a long beak and black legs is standing on a green bench.
Accuracy:
ROUGE-L: 53.20%,
BLEU-1: 62.50%,
BLEU-2: 40.82%,
BLEU-3: 32.93%,
BLEU-4: 27.23%,
CIDEr: 60.98
BibTeX Citation
@article{Kaya2025EfficientTTS,
title={Efficient Test-Time Scaling for Small Vision-Language Models},
author={Mehmet Onurcan Kaya and Desmond Elliott and Dim P. Papadopoulos},
journal={arXiv preprint arXiv:2510.03574},
year={2025},
url={https://monurcan.github.io/efficient_test_time_scaling}
}