Frozen tokenizer mismatch冻结分词器的目标错配
The tokenizer is optimized for reconstruction only, not for next-token predictability.分词器只为重建而优化,从不考虑产生的 token 是否易于被 next-token 预测。
A soft-assignment bridge lets the AR guide the tokenizer, succeeding exactly where the straight-through estimator collapses.soft 软分配桥让自回归模型引导分词器,恰好在直通估计器(STE)坍塌之处成功。
Opposite to diffusion-side REPA, the tokenizer becomes less DINOv2-like and lower-entropy, while the AR's per-patch features track DINOv2 far more closely.与扩散侧 REPA 相反,分词器变得更不像 DINOv2、熵更低,而自回归模型的 patch 级特征更贴近 DINOv2。
ImageNet gFID converges about 10× faster. On GPIC text-to-image, a fresh AR reaches the baseline's NTP loss 2.5× and REPA loss 11.1× faster.ImageNet gFID 收敛约 10× 更快。在 GPIC 文生图上,全新 AR 达到基线 NTP 损失快 2.5×、REPA 损失快 11.1×。
Works across VQVAE / LFQ / IBQ and across class-conditional ImageNet and text-to-image. Just freeze the tuned tokenizer and drop it into a standard pipeline.在 VQVAE / LFQ / IBQ 上均有效,并覆盖类条件 ImageNet 与文生图。冻结调好的分词器即可直接放入标准流程。
Naive end-to-end VQ-AR training fails for three reasons. GEAR resolves all of them while keeping the tokenizer and generator fully decoupled.朴素的端到端 VQ-AR 训练会因三个原因失败。GEAR 在保持分词器与生成器解耦的同时逐一解决。
The tokenizer is optimized for reconstruction only, not for next-token predictability.分词器只为重建而优化,从不考虑产生的 token 是否易于被 next-token 预测。
Token indices come from an argmax, so gradients from the AR cannot directly reach the tokenizer.token 索引来自 argmax,因此自回归模型的梯度无法直接回传到分词器。
Naive straight-through end-to-end training collapses codebook usage and severely hurts quality.朴素直通(STE)端到端训练会让码本使用坍塌,严重损害生成与重建质量。
The discrete index is decided by the codebook $\mathcal{C}$, but the vector fed to the AR is read from its own embedding table $\mathbf{E}$ in two ways: a hard branch ($\arg\max$, what the AR consumes at inference) and a differentiable soft branch. The end-to-end gradient flows only through the soft branch, so the tokenizer and the AR update on disjoint paths and never through the unstable straight-through estimator. The prediction loss never touches the tokenizer, which is what prevents the codebook collapse that STE induces. 离散索引由码本 $\mathcal{C}$ 决定,但输入给自回归模型的向量以两种方式从其自身嵌入表 $\mathbf{E}$ 读出:一条 hard 分支(即 $\arg\max$,推理时真正使用的)和一条可导的 soft 分支。端到端梯度只走 soft 分支,于是分词器与自回归模型在互不相交的路径上更新,完全不经过不稳定的直通估计器。预测损失绝不触及分词器,这正是避免 STE 式码本坍塌的关键。
hard branch → NTP & hard alignment (updates AR only)
soft branch → alignment (updates tokenizer only)
hard 分支 → NTP 与 hard 对齐(只更新 AR)
soft 分支 → 对齐(只更新分词器)
(a) conventional frozen tokenizer
(b) naive end-to-end via STE (collapses)
(c) GEAR's dual hard/soft read-out
(a) 传统冻结分词器
(b) 朴素 STE 端到端(坍塌)
(c) GEAR 的 hard/soft 双读出
GEAR's behavior is the opposite of diffusion-side recipes, and that is exactly why it works.GEAR 的作用机制与扩散侧方案截然相反,而这正是它有效的原因。
Unlike REPA-E / VA-VAE / MAETok, which push the continuous latent to be more DINOv2-like, GEAR's tokenizer ends up less DINOv2-like, most sharply at the patch level. Reconstruction is preserved, so this is a re-organization rather than a loss of information.不同于把连续 latent 推得更像 DINOv2 的 REPA-E/VA-VAE/MAETok,GEAR 的分词器反而更不像 DINOv2,在 patch 级下降最明显。重建质量保持不变,这是一种重组而非信息丢失。
| Similarity to DINOv2 (patch)与 DINOv2 相似度(patch) | LlamaGen-REPA | GEAR |
|---|---|---|
| CKA (pre-quant) | 0.173 | 0.107 (−0.066) |
| CKA (post-quant) | 0.161 | 0.100 (−0.061) |
| CKNNA (pre-quant) | 0.119 | 0.083 (−0.036) |
| CKNNA (post-quant) | 0.105 | 0.075 (−0.031) |
The alignment burden moves from the tokenizer to the AR. GEAR's AR hidden states track DINOv2 far more closely at the patch level, with stronger locally-coherent, spatially-causal structure that is exactly what makes next-token prediction easier.对齐负担从分词器转移到 AR。GEAR 的 AR 隐藏状态在 patch 级更贴近 DINOv2,具有更强的局部一致、空间因果结构,这正是让 next-token 预测更容易的关键。
GEAR (red) vs. LlamaGen-REPA (blue): stronger patch-level DINOv2 alignment across all depths.GEAR(红)vs. LlamaGen-REPA(蓝):各层均有更强的 patch 级 DINOv2 对齐。
End-to-end training reshapes which codes the tokenizer uses, moving toward lower entropy and a smaller effective codebook size, an easier next-token target. The alignment signal prevents the few-code collapse that raw NTP pressure alone would cause.端到端训练重塑分词器使用哪些码字,朝更低熵、更小有效码本的方向演化,得到更易预测的 next-token 目标。对齐信号阻止了纯 NTP 压力会导致的少数码坍塌。
Usage entropy and effective size fall, peak near 30k steps, then converge above collapse.使用熵与有效码本下降,约 30k 步达到峰值后收敛,始终高于坍塌水平。
Freeze GEAR's end-to-end tokenizer and train a fresh text-to-image AR on 100M images (GPIC), changing only the tokenizer. It reaches the baseline's NTP loss 2.5× faster and its REPA alignment loss 11.1× faster. The token structure learned on ImageNet transfers to a completely different task.冻结 GEAR 端到端分词器,在 1 亿张图(GPIC)上从头训练文生图 AR,仅分词器不同。它达到基线 NTP 损失快 2.5×,REPA 对齐损失快 11.1×。在 ImageNet 上学到的 token 结构迁移到了完全不同的任务。
GPIC (T2I) training dynamics. A fresh AR on GEAR's frozen tokenizer (red) converges far faster on both objectives.GPIC(文生图)训练曲线。在 GEAR 冻结分词器(红)上训练的全新 AR 在两个目标上都收敛得更快。
Controlled comparisons against LlamaGen-REPA at matched parameters and budget.在相同参数量与训练预算下,与 LlamaGen-REPA 的受控对比。
| Method | Epochs | Params | gFID↓ | gFID↓ (CFG) | IS↑ (CFG) |
|---|---|---|---|---|---|
| Latent Diffusion (675M) | |||||
| SiT-REPA | 800 | 675M | 5.90 | 1.90 | 297.5 |
| REPA-E | 800 | 675M | 1.69 | 1.12 | 302.9 |
| Autoregressive (300 epochs) | |||||
| LlamaGen-REPA-B | 300 | 111M | 20.16 | 6.00 | 145.0 |
| GEAR-B (Ours) | 300 | 111M | 16.96 (−3.20) | 4.95 (−1.05) | 166.1 (+21.1) |
| LlamaGen-REPA-L | 300 | 343M | 12.70 | 3.15 | 208.1 |
| GEAR-L (Ours) | 300 | 343M | 8.66 (−4.04) | 2.95 (−0.20) | 239.8 (+31.7) |
| LlamaGen-REPA-XL | 300 | 775M | 8.20 | 2.68 | 232.2 |
| GEAR-XL (Ours) | 300 | 775M | 6.76 (−1.44) | 2.52 (−0.16) | 262.9 (+30.7) |
| Autoregressive (800 epochs) | |||||
| LlamaGen-REPA-L | 800 | 343M | 10.44 | 2.92 | 216.0 |
| GEAR-L (Ours) | 800 | 343M | 8.61 (−1.83) | 2.72 (−0.20) | 240.3 (+24.3) |
| LlamaGen-REPA-XL | 800 | 775M | 7.46 | 2.57 | 236.3 |
| GEAR-XL (Ours) | 800 | 775M | 6.28 (−1.18) | 2.45 (−0.12) | 254.3 (+18.0) |
CFG scale 1.5 for all w/CFG columns. GEAR rows differ from the baseline only in the (end-to-end-tuned) frozen tokenizer.所有带 CFG 结果均用 1.5 引导强度。GEAR 行与基线仅在(端到端调过的)冻结分词器上不同。
| Method | Steps | FDD↓ | FDD↓ (CFG) | Prec.↑ | Rec.↑ |
|---|---|---|---|---|---|
| JiT-GPIC-1.1B | 390k | – | 204.0 | 0.91 | 0.53 |
| Controlled comparison (same encoder, data, architecture) | |||||
| LlamaGen-REPA-1.0B | 50k | 414.8 | 279.6 | 0.87 | 0.46 |
| GEAR-1.0B (Ours) | 50k | 381.6 (−33.2) | 256.9 (−22.7) | 0.87 | 0.50 (+0.04) |
| LlamaGen-REPA-1.0B | 100k | 319.2 | 198.6 | 0.89 | 0.61 |
| GEAR-1.0B (Ours) | 100k | 281.3 (−37.9) | 177.4 (−21.2) | 0.91 (+0.02) | 0.65 (+0.04) |
| LlamaGen-REPA-1.0B | 200k | 261.4 | 153.5 | 0.90 | 0.72 |
| GEAR-1.0B (Ours) | 200k | 230.0 (−31.4) | 138.0 (−15.5) | 0.91 (+0.01) | 0.75 (+0.03) |
| LlamaGen-REPA-1.0B | 390k | 228.9 | 127.9 | 0.92 | 0.78 |
| GEAR-1.0B (Ours) | 390k | 200.9 (−28.0) | 115.3 (−12.6) | 0.92 | 0.80 (+0.02) |
Same Qwen3-1.7B encoder and 100M-image GPIC corpus, differing only in the frozen tokenizer. CFG scale 1.75. FDD = Fréchet distance in DINOv2 feature space.相同 Qwen3-1.7B 编码器与 1 亿图 GPIC 语料,仅冻结分词器不同。CFG 强度 1.75。FDD = DINOv2 特征空间 Fréchet 距离。
Every design knob swept. Defaults highlighted.每个设计选项都做了扫描,默认配置已高亮。
| Setting | gFID↓ | rFID↓ |
|---|---|---|
| w/ STE | 104.93 | 59.72 |
| w/o GAN | 16.35 | 5.857 |
| GEAR | 10.63 | 1.640 |
| Type | gFID↓ | rFID↓ |
|---|---|---|
| VQVAE | 10.63 | 1.640 |
| LFQ | 14.78 | 2.129 |
| IBQ | 12.97 | 1.716 |
| τ | gFID↓ | rFID↓ |
|---|---|---|
| 0.50 | 9.15 | 1.698 |
| 0.10 | 10.63 | 1.640 |
| 0.05 | 10.75 | 1.638 |
| 0.01 | 11.42 | 1.628 |
| λ | gFID↓ | rFID↓ |
|---|---|---|
| 0.25 | 12.01 | 1.653 |
| 0.50 | 10.63 | 1.640 |
| 0.75 | 10.82 | 1.647 |
| 1.00 | 10.89 | 1.715 |
| Layer | gFID↓ | rFID↓ |
|---|---|---|
| 6th | 16.96 | 1.714 |
| 8th | 16.84 | 1.721 |
| 10th | 17.99 | 1.714 |
| Encoder | gFID↓ | rFID↓ |
|---|---|---|
| DINOv2 | 16.84 | 1.721 |
| DINOv3 | 18.97 | 1.696 |
| SigLIPv2 | 19.23 | 1.803 |
| V-JEPA 2.1 | 19.53 | 1.879 |
| Size | gFID↓ | rFID↓ |
|---|---|---|
| B (111M) | 21.52 | 1.658 |
| L (343M) | 10.63 | 1.640 |
| XL (775M) | 7.69 | 1.624 |
| Setting | gFID↓ | rFID↓ |
|---|---|---|
| w/o init. | 13.44 | 2.256 |
| w/ init. | 10.63 | 1.640 |
Where GEAR stands today, the architectural bottleneck behind it, and why the discrete route is still worth it.GEAR 目前的位置、背后的架构瓶颈,以及为何离散路线仍然值得坚持。
GEAR's reconstruction (rFID 1.64) upper-bounds its generation (gFID 2.52), and it still trails the continuous-VAE REPA-E (rFID 0.28, gFID 1.12). Closing the discrete-vs-continuous reconstruction gap is the single largest lever for further gains.GEAR 的重建(rFID 1.64)制约了其生成(gFID 2.52),仍落后于连续 VAE 的 REPA-E(rFID 0.28、gFID 1.12)。缩小离散与连续重建之间的差距,是进一步提升的最大杠杆。
In VQ-AR every token is one decoding step, which ties the down-sampling rate to sequence length. Borrowing the diffusion side's decoupling, namely a milder tokenizer plus AR-side grouping (patchified or multi-token prediction), could raise the ceiling without inflating the sequence.在 VQ-AR 中每个 token 即一步解码,把下采样率与序列长度绑死。借鉴扩散侧的解耦,即更温和的分词器加上 AR 侧分组(patch 化或多 token 预测),有望在不加长序列的情况下抬高上限。
The discrete next-token form is uniform across modalities, bounds per-step error over long contexts, and inherits the mature LLM alignment stack (RLHF, DPO, GRPO), making it a promising substrate for unified understanding-and-generation systems.离散 next-token 形式跨模态统一,在长上下文中限制每步误差,并继承成熟的 LLM 对齐技术栈(RLHF、DPO、GRPO),因此是构建统一理解与生成系统的有力基底。
@article{gear2026,
title = {GEAR: Guided End-to-End AutoRegression for Image Synthesis},
author = {Lin, Bin and Liu, Zheyuan and Lin, Chenguo and Chen, Sixiang and
Ge, Yunyang and Lin, Yunlong and Zhang, Jianwei and Yang, Miles and
Zhong, Zhao and Bo, Liefeng and Yuan, Li},
year = {2026}
}
@article{ifsq_llamagenrepa,
title = {iFSQ: Improving FSQ for Image Generation with 1 Line of Code},
author = {Lin, Bin and Li, Zongjian and Niu, Yuwei and Gong, Kaixiong and
Ge, Yunyang and Lin, Yunlong and Zheng, Mingzhe and Zhang, JianWei and
Yang, Miles and Zhong, Zhao and others},
journal = {arXiv preprint arXiv:2601.17124},
year = {2026}
}