Bin Lin林彬

Ph.D. Candidate in Computer Science计算机科学博士研究生

Peking University北京大学 · Shenzhen, China中国 · 深圳

linbin.ece@stu.pku.edu.cn · linbin203279@gmail.com

I work on multimodal large models — building open, scalable systems that understand and generate across vision and language, and increasingly unify the two. Advised by Prof. Li Yuan (袁粒) at Peking University.

我的研究方向是多模态大模型——构建开放、可扩展的系统，使其能够跨视觉与语言进行理解与生成，并逐步走向二者的统一。师从北京大学袁粒教授。

I'm a strong believer in open research: I lead and contribute to widely-used projects such as Open-Sora Plan, UniWorld, and Video-LLaVA, releasing the code, models, and data behind every paper.

我笃信开放研究：我主导并参与了多个被广泛使用的项目，如 Open-Sora Plan、UniWorld 和 Video-LLaVA，并为每篇论文开源代码、模型与数据。

Multimodal Understanding多模态理解 Multimodal Generation多模态生成 Unified Models统一模型 Video Generation视频生成

4,700+

Citations总引用量

1 paper 1,700+ · 7 papers 100+1 篇 1,700+ · 7 篇 100+

20,000+

GitHub StarsGitHub Stars

14,000,000+

Model & Data Downloads模型与数据下载

Top 100

Hugging Face InfluencerHugging Face 影响力

2024.02–2025.01

🔥

News最新动态

2026.06Released OSP-Next and GEAR — our latest work on efficient video and autoregressive visual generation.发布 OSP-Next 与 GEAR —— 高效视频生成与自回归视觉生成方向的最新工作。
2026.01Selected for the Tencent Project UP Scholarship (腾讯青云奖学金) — one of only 15 awardees nationwide in AI.入选腾讯青云奖学金 —— 全国 AI 领域仅 15 人获此殊荣。
2025.07MoE-LLaVA accepted to IEEE Transactions on Multimedia (TMM).MoE-LLaVA 被 IEEE Transactions on Multimedia (TMM) 接收。
2025.06Released Open-Sora Plan v1.5 (12K+ ⭐) and launched UniWorld.发布 Open-Sora Plan v1.5（12K+ ⭐），并推出 UniWorld。
2025.03Video-LLaVA ranked #1 among the Most Influential EMNLP 2025 Papers.Video-LLaVA 位列 EMNLP 2025 最具影响力论文第 1 名。
2025.02WF-VAE accepted to CVPR 2025.WF-VAE 被 CVPR 2025 接收。
2024.09Video-LLaVA accepted to EMNLP 2025.Video-LLaVA 被 EMNLP 2025 接收。
2024.01LanguageBind accepted to ICLR 2024.LanguageBind 被 ICLR 2024 接收。

🚀

Selected Work代表性工作

GEAR: up to 10x faster autoregressive image generation

PreprintFirst Author第一作者

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Bin Lin, Zheyuan Liu, et al., Li Yuan

Jointly trains a VQ tokenizer and an autoregressive generator end-to-end, guided by representation alignment — up to 10× faster AR training than LlamaGen-REPA, with better spatial features that generalize across quantizers and to text-to-image.将 VQ 分词器与自回归生成器端到端联合训练，并由表征对齐引导——自回归训练较 LlamaGen-REPA 最高加速 10×，特征更具空间一致性，且可泛化到多种量化器与文生图。

Project Page arXiv Code 0 🤗 Models

PreprintCo-First Author共同一作 #1 GitHub TrendingGitHub 趋势榜第一

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin*, Yunyang Ge*, Xinhua Cheng*, et al., Li Yuan

A fully open recipe for large-scale video generation — causal video VAE, 3D / sparse attention, and complete training pipelines. The first open model trained from scratch natively on NPUs (v1.5).面向大规模视频生成的全开放方案——因果视频 VAE、3D / 稀疏注意力以及完整的训练流程。首个原生在 NPU 上从零训练的开源模型（v1.5）。

arXiv Code 12.2k 🤗 Models·Dataset Cited by –

PreprintFirst Author第一作者 #9 GitHub TrendingGitHub 趋势榜第 9

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, et al., Li Yuan

One unified model for visual understanding, generation, and manipulation — 2.7M curated samples powering 20+ tasks within a single framework.单一统一模型即可完成视觉理解、生成与编辑——以 270 万条精选数据，在同一框架内支撑 20+ 种任务。

arXiv Code 875 🤗 Models·Dataset Cited by –

EMNLP 2025First Author第一作者 #6 GitHub TrendingGitHub 趋势榜第 6

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, et al., Li Yuan

#1 Most Influential EMNLP 2025 Paper with 1,000+ citations — reached video-QA SOTA in just two days on a single node.EMNLP 2025 最具影响力论文第 1 名，引用量 1,000+——在单机上仅用两天即达到视频问答 SOTA。

arXiv Code 3.5k 🤗 Models·Dataset Cited by –

📚

More Publications更多论文

For the complete list, see my Google Scholar. ^* denotes equal contribution.完整列表见我的 Google Scholar。^* 表示共同贡献。

IEEE TMMFirst Author第一作者 #6 GitHub TrendingGitHub 趋势榜第 6 MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. The first sparse multimodal LLM trainable on a single node.首个可在单机上训练的稀疏多模态大模型。
arXiv Code 2.3k 🤗 Models·Dataset Cited by –
CVPR 2025Co-First Author共同一作 WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion. A wavelet-driven video VAE with higher compression efficiency and reconstruction quality.小波驱动的视频 VAE，兼具更高的压缩效率与重建质量。
arXiv Code 203 🤗 Model Cited by –
ICLR 2024Co-First Author共同一作 LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. Binds N modalities through language, enabling unified cross-modal pretraining and retrieval.以语言为桥连接 N 种模态，实现统一的跨模态预训练与检索。
arXiv Code 881 🤗 Models Cited by –
NeurIPS 2025Co-First Author共同一作 ImgEdit: A Unified Image Editing Dataset and Benchmark. A 1.2M-pair image-editing dataset and benchmark, with the ImgEdit-E1 editing model.包含 120 万对样本的图像编辑数据集与基准，并提供 ImgEdit-E1 编辑模型。
arXiv Code 316 🤗 Dataset Cited by –
ICML 2026Co-Author合作作者 WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation. A world-knowledge-informed benchmark for evaluating text-to-image generation.引入世界知识的文本到图像生成评测基准。
arXiv Code 205 🤗 Dataset Cited by –
NeurIPS 2024Co-Author合作作者 ShareGPT4Video: Improving Video Understanding and Generation with Better Captions. Boosts video understanding and generation with dense, high-quality captions.借助密集且高质量的视频描述，提升视频理解与生成能力。
arXiv Code 1.1k 🤗 Dataset Cited by –

⭐

Open-Source Projects开源项目

Open source is at the core of my research — collectively 20K+ GitHub stars and 14M+ model & data downloads.开源是我研究的核心 —— 累计获得 20K+ GitHub Stars，相关模型与数据下载量超过 14M+。

Open-Sora Plan 12.2k

Open large-scale video generation model.开源大规模视频生成模型。

Video-LLaVA 3.5k

Unified image & video understanding.统一的图像与视频理解。

MoE-LLaVA 2.3k

Sparse Mixture-of-Experts multimodal LLM.稀疏专家混合的多模态大模型。

LanguageBind 881

N-modality alignment via language.以语言为桥的多模态对齐。

UniWorld 875

Unified visual understanding & generation.统一的视觉理解与生成。

WF-VAE 203

Efficient wavelet-driven video VAE.高效的小波驱动视频 VAE。

iFSQ 102

Improves FSQ with one line of code; speeds up LlamaGen-REPA AR image generation.一行代码改进 FSQ，加速 LlamaGen-REPA 的自回归图像生成训练。

OSP-Next 60

Efficient, high-quality video generation.高效的高质量视频生成。

GEAR 0

End-to-end VQ tokenizer + AR generator, 10× faster.端到端联合训练，自回归生成提速 10×。

🏆

Honors & Awards荣誉与奖项

Tencent Project UP Scholarship (腾讯青云奖学金)

One of only 15 AI scholars selected nationwide from ~400 applicants across 70+ universities.全国仅 15 名 AI 学者入选，从 70+ 所高校的约 400 名申请者中选出。

2026

PKU Hongqiao Scholarship (北大宏桥奖学金)

Peking University Shenzhen Graduate School honor for outstanding research students.北京大学深圳研究生院授予优秀科研学生的荣誉。

2025

2023.06Outstanding Graduate of Sichuan Province, China.四川省优秀毕业生。
2022.11Outstanding Student, Sichuan Agricultural University (top 10 university-wide).四川农业大学优秀学生（全校前 10）。
2022.10National Scholarship — highest scholarship from the Ministry of Education, China.国家奖学金 —— 教育部最高奖学金。
2021.11National First Prize, National Undergraduate Mathematical Modeling Contest.全国一等奖，全国大学生数学建模竞赛。
2021.10National Scholarship — highest scholarship from the Ministry of Education, China.国家奖学金 —— 教育部最高奖学金。

🎓

Education教育经历

2023.09 – now (exp. 2028)

Ph.D. in Computer Science计算机科学博士
Peking University北京大学

2019.09 – 2023.06

B.E. — Outstanding Graduate, ranked 1st / 263工学学士 —— 优秀毕业生，年级第 1 / 263
Sichuan Agricultural University四川农业大学

🛠️

Academic Service学术服务

Conference Reviewer会议审稿人

ICLR 2026 NeurIPS 2026 ICML 2026 CVPR 2026 ECCV 2026

Journal Reviewer期刊审稿人

IEEE TPAMI

Workshop Organizer研讨会组织

Working Committee, CVM @ AAAI 2026 Workshop — Consistency in Video Generative Models: From Clip to Wild, featuring a video-generation competition.工作委员会成员，CVM @ AAAI 2026 研讨会——视频生成模型的一致性：从片段到真实场景，并举办视频生成竞赛。