The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.
翻译:当前高性能图像生成模型领域主要由专有系统主导,例如Nano Banana Pro和Seedream 4.0。领先的开源替代方案,包括Qwen-Image、Hunyuan-Image-3.0和FLUX.2,其特点是参数量巨大(200亿至800亿),这使得它们在消费级硬件上进行推理和微调变得不切实际。为了弥补这一差距,我们提出了Z-Image,这是一个高效的60亿参数基础生成模型,建立在可扩展的单流扩散Transformer(S3-DiT)架构之上,挑战了“不计成本扩大规模”的范式。通过系统优化整个模型生命周期——从精心策划的数据基础设施到精简的训练流程——我们仅用31.4万H800 GPU小时(约合63万美元)就完成了完整的训练工作流。我们采用奖励后训练的少步蒸馏方案进一步产生了Z-Image-Turbo,它既能在企业级H800 GPU上实现亚秒级推理延迟,又与消费级硬件(<16GB显存)兼容。此外,我们的全预训练范式也支持高效训练Z-Image-Edit,这是一个具有出色指令跟随能力的编辑模型。定性和定量实验均表明,我们的模型在多个维度上实现了与领先竞争对手相当或超越的性能。最值得注意的是,Z-Image在逼真图像生成和双语文本渲染方面展现出卓越能力,其效果可与顶级商业模型相媲美,从而证明了以显著降低的计算开销也能实现最先进的结果。我们公开了代码、权重和在线演示,以促进可访问、经济实惠且技术领先的生成模型的发展。