Guided End-to-End AutoRegression for Image Synthesis GEAR:用于图像生成的引导式端到端自回归

Peking University  ·  Tencent Hunyuan
* Work done during internship at Tencent Hunyuan  ·  Corresponding author * 实习于腾讯混元  ·  通讯作者
GEAR converges faster and improves quality at every scale
GEAR converges up to 10× faster than LlamaGen-REPA on ImageNet and improves gFID at every model size, while the naive straight-through end-to-end variant diverges (gFID≈105). GEAR 在 ImageNet 上比 LlamaGen-REPA 收敛快达 10×,并在每个模型规模上都改善 gFID,而朴素直通端到端方案则会发散(gFID≈105)。

Method: Guided End-to-End Training方法:引导式端到端训练

Naive end-to-end VQ-AR training fails for three reasons. GEAR resolves all of them while keeping the tokenizer and generator fully decoupled.朴素的端到端 VQ-AR 训练会因三个原因失败。GEAR 在保持分词器与生成器解耦的同时逐一解决。

The problem问题
1

Frozen tokenizer mismatch冻结分词器的目标错配

The tokenizer is optimized for reconstruction only, not for next-token predictability.分词器只为重建而优化,从不考虑产生的 token 是否易于被 next-token 预测。

2

Discrete bottleneck离散化瓶颈

Token indices come from an argmax, so gradients from the AR cannot directly reach the tokenizer.token 索引来自 argmax,因此自回归模型的梯度无法直接回传到分词器。

3

STE collapseSTE 坍塌

Naive straight-through end-to-end training collapses codebook usage and severely hurts quality.朴素直通(STE)端到端训练会让码本使用坍塌,严重损害生成与重建质量。

Our solution我们的解法

Two read-outs, disjoint updates双路读出,互不相交的更新

The discrete index is decided by the codebook $\mathcal{C}$, but the vector fed to the AR is read from its own embedding table $\mathbf{E}$ in two ways: a hard branch ($\arg\max$, what the AR consumes at inference) and a differentiable soft branch. The end-to-end gradient flows only through the soft branch, so the tokenizer and the AR update on disjoint paths and never through the unstable straight-through estimator. The prediction loss never touches the tokenizer, which is what prevents the codebook collapse that STE induces. 离散索引由码本 $\mathcal{C}$ 决定,但输入给自回归模型的向量以两种方式从其自身嵌入表 $\mathbf{E}$ 读出:一条 hard 分支(即 $\arg\max$,推理时真正使用的)和一条可导的 soft 分支。端到端梯度只走 soft 分支,于是分词器与自回归模型在互不相交的路径上更新,完全不经过不稳定的直通估计器。预测损失绝不触及分词器,这正是避免 STE 式码本坍塌的关键。

Formulation公式表述

Read-out读出 $\mathbf{u}^{\mathrm{h}}_i=\mathbf{E}_{\arg\max_k \mathbf{A}_{ik}}\quad\;\; \mathbf{u}^{\mathrm{s}}_i=\mathrm{softmax}(\mathbf{A}_i/\tau)\,\mathbf{E}$
Decoupled updates解耦更新 $\theta_{\mathrm{tok}}\!\leftarrow\!\theta_{\mathrm{tok}}-\eta\nabla(\mathcal{L}_{\mathrm{VQ}}+\lambda\mathcal{L}^{\mathrm{s}}_{\mathrm{align}})$
$\theta_{\mathrm{AR}}\!\leftarrow\!\theta_{\mathrm{AR}}-\eta\nabla(\mathcal{L}_{\mathrm{NTP}}+\lambda\mathcal{L}^{\mathrm{h}}_{\mathrm{align}})$

hard branch → NTP & hard alignment (updates AR only)
soft branch → alignment (updates tokenizer only)
hard 分支 → NTP 与 hard 对齐(只更新 AR)
soft 分支 → 对齐(只更新分词器)

Conventional vs naive STE vs GEAR pipelines

(a) conventional frozen tokenizer
(b) naive end-to-end via STE (collapses)
(c) GEAR's dual hard/soft read-out
(a) 传统冻结分词器
(b) 朴素 STE 端到端(坍塌)
(c) GEAR 的 hard/soft 双读出

Key Insights核心发现

GEAR's behavior is the opposite of diffusion-side recipes, and that is exactly why it works.GEAR 的作用机制与扩散侧方案截然相反,而这正是它有效的原因。

Finding 1

Tokenizer becomes less semantic, not more分词器变得更不语义化

Unlike REPA-E / VA-VAE / MAETok, which push the continuous latent to be more DINOv2-like, GEAR's tokenizer ends up less DINOv2-like, most sharply at the patch level. Reconstruction is preserved, so this is a re-organization rather than a loss of information.不同于把连续 latent 推得像 DINOv2 的 REPA-E/VA-VAE/MAETok,GEAR 的分词器反而更不像 DINOv2,在 patch 级下降最明显。重建质量保持不变,这是一种重组而非信息丢失。

Similarity to DINOv2 (patch)与 DINOv2 相似度(patch)LlamaGen-REPAGEAR
CKA (pre-quant)0.1730.107 (−0.066)
CKA (post-quant)0.1610.100 (−0.061)
CKNNA (pre-quant)0.1190.083 (−0.036)
CKNNA (post-quant)0.1050.075 (−0.031)
Finding 2

Semantics and locality shift to the AR语义与局部性转移到 AR

The alignment burden moves from the tokenizer to the AR. GEAR's AR hidden states track DINOv2 far more closely at the patch level, with stronger locally-coherent, spatially-causal structure that is exactly what makes next-token prediction easier.对齐负担从分词器转移到 AR。GEAR 的 AR 隐藏状态在 patch 级更贴近 DINOv2,具有更强的局部一致、空间因果结构,这正是让 next-token 预测更容易的关键。

AR per-patch alignment to DINOv2 across depth

GEAR (red) vs. LlamaGen-REPA (blue): stronger patch-level DINOv2 alignment across all depths.GEAR(红)vs. LlamaGen-REPA(蓝):各层均有更强的 patch 级 DINOv2 对齐。

Finding 3

Codebook sharpens without collapsing码本变尖锐但不坍塌

End-to-end training reshapes which codes the tokenizer uses, moving toward lower entropy and a smaller effective codebook size, an easier next-token target. The alignment signal prevents the few-code collapse that raw NTP pressure alone would cause.端到端训练重塑分词器使用哪些码字,朝更低熵、更小有效码本的方向演化,得到更易预测的 next-token 目标。对齐信号阻止了纯 NTP 压力会导致的少数码坍塌。

Codebook usage sharpens but does not collapse

Usage entropy and effective size fall, peak near 30k steps, then converge above collapse.使用熵与有效码本下降,约 30k 步达到峰值后收敛,始终高于坍塌水平。

Finding 4

The "predictable grid" transfers to text-to-image"可预测网格"迁移到文生图

Freeze GEAR's end-to-end tokenizer and train a fresh text-to-image AR on 100M images (GPIC), changing only the tokenizer. It reaches the baseline's NTP loss 2.5× faster and its REPA alignment loss 11.1× faster. The token structure learned on ImageNet transfers to a completely different task.冻结 GEAR 端到端分词器,在 1 亿张图(GPIC)上从头训练文生图 AR,仅分词器不同。它达到基线 NTP 损失快 2.5×,REPA 对齐损失快 11.1×。在 ImageNet 上学到的 token 结构迁移到了完全不同的任务。

GPIC text-to-image training dynamics

GPIC (T2I) training dynamics. A fresh AR on GEAR's frozen tokenizer (red) converges far faster on both objectives.GPIC(文生图)训练曲线。在 GEAR 冻结分词器(红)上训练的全新 AR 在两个目标上都收敛得更快。

Quantitative Evaluation定量评测

Controlled comparisons against LlamaGen-REPA at matched parameters and budget.在相同参数量与训练预算下,与 LlamaGen-REPA 的受控对比。

Class-conditional ImageNet 256×256类别条件 ImageNet 256×256

Class-conditional ImageNet 256x256 generation results comparing GEAR with LlamaGen-REPA and latent-diffusion baselines.
MethodEpochsParamsgFID↓gFID↓ (CFG)IS↑ (CFG)
Latent Diffusion (675M)
SiT-REPA800675M5.901.90297.5
REPA-E800675M1.691.12302.9
Autoregressive (300 epochs)
LlamaGen-REPA-B300111M20.166.00145.0
GEAR-B (Ours)300111M16.96 (−3.20)4.95 (−1.05)166.1 (+21.1)
LlamaGen-REPA-L300343M12.703.15208.1
GEAR-L (Ours)300343M8.66 (−4.04)2.95 (−0.20)239.8 (+31.7)
LlamaGen-REPA-XL300775M8.202.68232.2
GEAR-XL (Ours)300775M6.76 (−1.44)2.52 (−0.16)262.9 (+30.7)
Autoregressive (800 epochs)
LlamaGen-REPA-L800343M10.442.92216.0
GEAR-L (Ours)800343M8.61 (−1.83)2.72 (−0.20)240.3 (+24.3)
LlamaGen-REPA-XL800775M7.462.57236.3
GEAR-XL (Ours)800775M6.28 (−1.18)2.45 (−0.12)254.3 (+18.0)

CFG scale 1.5 for all w/CFG columns. GEAR rows differ from the baseline only in the (end-to-end-tuned) frozen tokenizer.所有带 CFG 结果均用 1.5 引导强度。GEAR 行与基线仅在(端到端调过的)冻结分词器上不同。

Text-to-Image Generation (GPIC)文生图 (GPIC)

Text-to-image generation results on GPIC comparing GEAR with LlamaGen-REPA.
MethodStepsFDD↓FDD↓ (CFG)Prec.↑Rec.↑
JiT-GPIC-1.1B390k204.00.910.53
Controlled comparison (same encoder, data, architecture)
LlamaGen-REPA-1.0B50k414.8279.60.870.46
GEAR-1.0B (Ours)50k381.6 (−33.2)256.9 (−22.7)0.870.50 (+0.04)
LlamaGen-REPA-1.0B100k319.2198.60.890.61
GEAR-1.0B (Ours)100k281.3 (−37.9)177.4 (−21.2)0.91 (+0.02)0.65 (+0.04)
LlamaGen-REPA-1.0B200k261.4153.50.900.72
GEAR-1.0B (Ours)200k230.0 (−31.4)138.0 (−15.5)0.91 (+0.01)0.75 (+0.03)
LlamaGen-REPA-1.0B390k228.9127.90.920.78
GEAR-1.0B (Ours)390k200.9 (−28.0)115.3 (−12.6)0.920.80 (+0.02)

Same Qwen3-1.7B encoder and 100M-image GPIC corpus, differing only in the frozen tokenizer. CFG scale 1.75. FDD = Fréchet distance in DINOv2 feature space.相同 Qwen3-1.7B 编码器与 1 亿图 GPIC 语料,仅冻结分词器不同。CFG 强度 1.75。FDD = DINOv2 特征空间 Fréchet 距离。

Ablation Studies消融实验

Every design knob swept. Defaults highlighted.每个设计选项都做了扫描,默认配置已高亮。

Components组件

SettinggFID↓rFID↓
w/ STE104.9359.72
w/o GAN16.355.857
GEAR10.631.640

Quantizer量化器

TypegFID↓rFID↓
VQVAE10.631.640
LFQ14.782.129
IBQ12.971.716

Temperature温度 τ

τgFID↓rFID↓
0.509.151.698
0.1010.631.640
0.0510.751.638
0.0111.421.628

Coefficient系数 λ

λgFID↓rFID↓
0.2512.011.653
0.5010.631.640
0.7510.821.647
1.0010.891.715

Alignment Depth对齐层

LayergFID↓rFID↓
6th16.961.714
8th16.841.721
10th17.991.714

Target Encoder目标编码器

EncodergFID↓rFID↓
DINOv216.841.721
DINOv318.971.696
SigLIPv219.231.803
V-JEPA 2.119.531.879

Model Size模型规模

SizegFID↓rFID↓
B (111M)21.521.658
L (343M)10.631.640
XL (775M)7.691.624

Tokenizer Init.分词器初始化

SettinggFID↓rFID↓
w/o init.13.442.256
w/ init.10.631.640

Discussion讨论

Where GEAR stands today, the architectural bottleneck behind it, and why the discrete route is still worth it.GEAR 目前的位置、背后的架构瓶颈,以及为何离散路线仍然值得坚持。

🧱

Reconstruction ceiling重建上限

GEAR's reconstruction (rFID 1.64) upper-bounds its generation (gFID 2.52), and it still trails the continuous-VAE REPA-E (rFID 0.28, gFID 1.12). Closing the discrete-vs-continuous reconstruction gap is the single largest lever for further gains.GEAR 的重建(rFID 1.64)制约了其生成(gFID 2.52),仍落后于连续 VAE 的 REPA-E(rFID 0.28、gFID 1.12)。缩小离散与连续重建之间的差距,是进一步提升的最大杠杆。

⚙️

Compression vs. compute压缩与算力耦合

In VQ-AR every token is one decoding step, which ties the down-sampling rate to sequence length. Borrowing the diffusion side's decoupling, namely a milder tokenizer plus AR-side grouping (patchified or multi-token prediction), could raise the ceiling without inflating the sequence.在 VQ-AR 中每个 token 即一步解码,把下采样率与序列长度绑死。借鉴扩散侧的解耦,即更温和的分词器加上 AR 侧分组(patch 化或多 token 预测),有望在不加长序列的情况下抬高上限。

🚀

Toward unified generation走向统一生成

The discrete next-token form is uniform across modalities, bounds per-step error over long contexts, and inherits the mature LLM alignment stack (RLHF, DPO, GRPO), making it a promising substrate for unified understanding-and-generation systems.离散 next-token 形式跨模态统一,在长上下文中限制每步误差,并继承成熟的 LLM 对齐技术栈(RLHF、DPO、GRPO),因此是构建统一理解与生成系统的有力基底。

Citation引用

@article{gear2026,
  title   = {GEAR: Guided End-to-End AutoRegression for Image Synthesis},
  author  = {Lin, Bin and Liu, Zheyuan and Lin, Chenguo and Chen, Sixiang and
             Ge, Yunyang and Lin, Yunlong and Zhang, Jianwei and Yang, Miles and
             Zhong, Zhao and Bo, Liefeng and Yuan, Li},
  year    = {2026}
}
@article{ifsq_llamagenrepa,
  title   = {iFSQ: Improving FSQ for Image Generation with 1 Line of Code},
  author  = {Lin, Bin and Li, Zongjian and Niu, Yuwei and Gong, Kaixiong and
             Ge, Yunyang and Lin, Yunlong and Zheng, Mingzhe and Zhang, JianWei and
             Yang, Miles and Zhong, Zhao and others},
  journal = {arXiv preprint arXiv:2601.17124},
  year    = {2026}
}