GEAR: Guided End-to-End AutoRegression for Image Synthesis

Naive end-to-end VQ-AR training fails for three reasons. GEAR resolves all of them while keeping the tokenizer and generator fully decoupled.朴素的端到端 VQ-AR 训练会因三个原因失败。GEAR 在保持分词器与生成器解耦的同时逐一解决。

Two read-outs, disjoint updates双路读出，互不相交的更新

The discrete index is decided by the codebook $\mathcal{C}$, but the vector fed to the AR is read from its own embedding table $\mathbf{E}$ in two ways: a hard branch ($\arg\max$, what the AR consumes at inference) and a differentiable soft branch. The end-to-end gradient flows only through the soft branch, so the tokenizer and the AR update on disjoint paths and never through the unstable straight-through estimator. The prediction loss never touches the tokenizer, which is what prevents the codebook collapse that STE induces. 离散索引由码本 $\mathcal{C}$ 决定，但输入给自回归模型的向量以两种方式从其自身嵌入表 $\mathbf{E}$ 读出：一条 hard 分支（即 $\arg\max$，推理时真正使用的）和一条可导的 soft 分支。端到端梯度只走 soft 分支，于是分词器与自回归模型在互不相交的路径上更新，完全不经过不稳定的直通估计器。预测损失绝不触及分词器，这正是避免 STE 式码本坍塌的关键。

Formulation公式表述

Read-out读出 $\mathbf{u}^{\mathrm{h}}_i=\mathbf{E}_{\arg\max_k \mathbf{A}_{ik}}\quad\;\; \mathbf{u}^{\mathrm{s}}_i=\mathrm{softmax}(\mathbf{A}_i/\tau)\,\mathbf{E}$

Decoupled updates解耦更新 $\theta_{\mathrm{tok}}\!\leftarrow\!\theta_{\mathrm{tok}}-\eta\nabla(\mathcal{L}_{\mathrm{VQ}}+\lambda\mathcal{L}^{\mathrm{s}}_{\mathrm{align}})$
$\theta_{\mathrm{AR}}\!\leftarrow\!\theta_{\mathrm{AR}}-\eta\nabla(\mathcal{L}_{\mathrm{NTP}}+\lambda\mathcal{L}^{\mathrm{h}}_{\mathrm{align}})$

hard branch → NTP & hard alignment (updates AR only)
soft branch → alignment (updates tokenizer only) hard 分支 → NTP 与 hard 对齐（只更新 AR）
soft 分支 → 对齐（只更新分词器）

Conventional vs naive STE vs GEAR pipelines

(a) conventional frozen tokenizer
(b) naive end-to-end via STE (collapses)
(c) GEAR's dual hard/soft read-out (a) 传统冻结分词器
(b) 朴素 STE 端到端（坍塌）
(c) GEAR 的 hard/soft 双读出

Similarity to DINOv2 (patch)与 DINOv2 相似度（patch）	LlamaGen-REPA	GEAR
CKA (pre-quant)	0.173	0.107 (−0.066)
CKA (post-quant)	0.161	0.100 (−0.061)
CKNNA (pre-quant)	0.119	0.083 (−0.036)
CKNNA (post-quant)	0.105	0.075 (−0.031)

Method	Epochs	Params	gFID↓	gFID↓ (CFG)	IS↑ (CFG)
Latent Diffusion (675M)
SiT-REPA	800	675M	5.90	1.90	297.5
REPA-E	800	675M	1.69	1.12	302.9
Autoregressive (300 epochs)
LlamaGen-REPA-B	300	111M	20.16	6.00	145.0
GEAR-B (Ours)	300	111M	16.96 (−3.20)	4.95 (−1.05)	166.1 (+21.1)
LlamaGen-REPA-L	300	343M	12.70	3.15	208.1
GEAR-L (Ours)	300	343M	8.66 (−4.04)	2.95 (−0.20)	239.8 (+31.7)
LlamaGen-REPA-XL	300	775M	8.20	2.68	232.2
GEAR-XL (Ours)	300	775M	6.76 (−1.44)	2.52 (−0.16)	262.9 (+30.7)
Autoregressive (800 epochs)
LlamaGen-REPA-L	800	343M	10.44	2.92	216.0
GEAR-L (Ours)	800	343M	8.61 (−1.83)	2.72 (−0.20)	240.3 (+24.3)
LlamaGen-REPA-XL	800	775M	7.46	2.57	236.3
GEAR-XL (Ours)	800	775M	6.28 (−1.18)	2.45 (−0.12)	254.3 (+18.0)

Method	Steps	FDD↓	FDD↓ (CFG)	Prec.↑	Rec.↑
JiT-GPIC-1.1B	390k	–	204.0	0.91	0.53
Controlled comparison (same encoder, data, architecture)
LlamaGen-REPA-1.0B	50k	414.8	279.6	0.87	0.46
GEAR-1.0B (Ours)	50k	381.6 (−33.2)	256.9 (−22.7)	0.87	0.50 (+0.04)
LlamaGen-REPA-1.0B	100k	319.2	198.6	0.89	0.61
GEAR-1.0B (Ours)	100k	281.3 (−37.9)	177.4 (−21.2)	0.91 (+0.02)	0.65 (+0.04)
LlamaGen-REPA-1.0B	200k	261.4	153.5	0.90	0.72
GEAR-1.0B (Ours)	200k	230.0 (−31.4)	138.0 (−15.5)	0.91 (+0.01)	0.75 (+0.03)
LlamaGen-REPA-1.0B	390k	228.9	127.9	0.92	0.78
GEAR-1.0B (Ours)	390k	200.9 (−28.0)	115.3 (−12.6)	0.92	0.80 (+0.02)

Setting	gFID↓	rFID↓
w/ STE	104.93	59.72
w/o GAN	16.35	5.857
GEAR	10.63	1.640

Type	gFID↓	rFID↓
VQVAE	10.63	1.640
LFQ	14.78	2.129
IBQ	12.97	1.716

Guided End-to-End AutoRegression for Image Synthesis GEAR：用于图像生成的引导式端到端自回归

Guided end-to-end引导式端到端

Alignment flips to the AR对齐转移到 AR

Faster & better更快也更好

General & drop-in通用、即插即用

Method: Guided End-to-End Training方法：引导式端到端训练

Frozen tokenizer mismatch冻结分词器的目标错配

Discrete bottleneck离散化瓶颈

STE collapseSTE 坍塌

Two read-outs, disjoint updates双路读出，互不相交的更新

Formulation公式表述

Key Insights核心发现

Tokenizer becomes less semantic, not more分词器变得更不语义化

Semantics and locality shift to the AR语义与局部性转移到 AR

Codebook sharpens without collapsing码本变尖锐但不坍塌

The "predictable grid" transfers to text-to-image"可预测网格"迁移到文生图

Quantitative Evaluation定量评测

Class-conditional ImageNet 256×256类别条件 ImageNet 256×256

Text-to-Image Generation (GPIC)文生图 (GPIC)

Ablation Studies消融实验

Components组件

Quantizer量化器

Temperature温度 τ

Coefficient系数 λ

Alignment Depth对齐层

Target Encoder目标编码器

Model Size模型规模

Tokenizer Init.分词器初始化

Discussion讨论

Reconstruction ceiling重建上限

Compression vs. compute压缩与算力耦合

Toward unified generation走向统一生成

Citation引用

Encoder	gFID↓	rFID↓
DINOv2	16.84	1.721
DINOv3	18.97	1.696
SigLIPv2	19.23	1.803
V-JEPA 2.1	19.53	1.879

τ	gFID↓	rFID↓
0.50	9.15	1.698
0.10	10.63	1.640
0.05	10.75	1.638
0.01	11.42	1.628

λ	gFID↓	rFID↓
0.25	12.01	1.653
0.50	10.63	1.640
0.75	10.82	1.647
1.00	10.89	1.715

Layer	gFID↓	rFID↓
6th	16.96	1.714
8th	16.84	1.721
10th	17.99	1.714

Size	gFID↓	rFID↓
B (111M)	21.52	1.658
L (343M)	10.63	1.640
XL (775M)	7.69	1.624