Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.
翻译:扩散多模态大语言模型(dMLLMs)作为一种统一图像生成与理解的新型架构,近期受到广泛关注。然而,如何开发有效且高效的测试时扩展方法以充分释放其生成潜力,仍是一个尚未深入探索的挑战。为此,我们提出dMLLM-TTS,一种在两个互补扩展轴上运行的新型框架:(1)轨迹探索扩展,以增强生成假设的多样性;(2)迭代优化扩展,以实现稳定生成。传统的TTS方法通常在这两个维度上进行线性搜索,导致O(NT)的庞大计算开销,且需要外部验证器进行最佳N选一。为克服这些局限,我们提出两项创新。首先,我们设计了一种复杂度为O(N+T)的高效分层搜索算法,能够自适应地扩展和剪枝采样轨迹。其次,我们引入了一种自验证反馈机制,该机制利用dMLLMs固有的图像理解能力来评估文本-图像对齐性,从而无需外部验证器。在GenEval基准测试中,针对三种代表性dMLLM(如Lumina-DiMOO、MMaDA、Muddit)的广泛实验表明,我们的框架在显著提升生成质量的同时,实现了相比线性搜索高达6倍的效率提升。项目页面:https://github.com/Alpha-VLLM/Lumina-DiMOO。