Zerui Cheng (程泽瑞)

Ph.D. Candidate at Princeton Univ., LLM and AI Researcher

Princeton University

Evaluating AI. Generating Data. Scaling Intelligence.

I am Zerui Cheng (程泽瑞), a Ph.D. candidate at Princeton University advised by Prof. Pramod Viswanath. My research focuses on Evaluation of LLMs and Agents and Synthetic Data, two pillars for building self-evolving agents for long-horizon tasks.

I’m a Quant Research Intern at Citadel Securities, and previously a Student Researcher at ByteDance Seed and Tencent Hy, contributing to Seed 1.8, Seed 2.0 Pro, Hy3 Preview and Hy3. Before Princeton, I received my B.Eng. in Computer Science from the Yao Class at Tsinghua University, graduating summa cum laude and receiving the Yao Award.

My research has been published in Nature and leading venues including NeurIPS, ICLR, ICML, COLM, AAAI, ACM CCS, EuroSys, and IEEE Transactions on Networking, with over 1,200 citations to date. My research has been adopted as the technical core by high-profile startups including Sentient, Kite AI, and PolyHedra, which have altogether raised over $180M backed by major VCs including Founders Fund, IDG, and PayPal Ventures.

My work has been covered by MIT Technology Review (on AI evaluation crisis) and Sciences et Avenir (on the philosophy of AI evaluation). Beyond research, I am a member of the Competitive Programming Hall of Fame, a contestant on TV Show Super Brain Season 10, and previously served as President of the Yao Class Students' Congress.

Google Scholar profile Curriculum Vitae

Interests

Evaluation of LLMs and Agents
Synthetic Data
Decentralized AI Systems
Blockchain & Cryptography

Education

Ph.D. student (2023 - now)

Electrical and Computer Engineering, Princeton University
B.Eng. in Computer Science (2019 - 2023)

Yao Class, the Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University

Recent Highlights

[Jun 2026] (paper acceptance)

3 papers (Multi-agent Debate, VeRA, TabularMath) accepted to ICML 2026 AI4Math!

[Jun 2026] (media coverage)

I’m honored to be interviewed by Sciences et Avenir, the leading popular science magazine in France, which later featured PeerBench and my broader views on the philosophy of LLM evaluation in the article Que valent les comparateurs d’IA ? in June 2026.

[May 2026] (paper acceptance)

Two papers (FrontierCS, the Generalization Spectrum) accepted to ICML 2026!

One paper (ValueMine) accepted to the journal IEEE Transactions on Networking!

[Feb 2026] (new papers)

Two first-authored papers done at ByteDance Seed are online now!

[Jan 2026] (paper acceptance)

Three papers are accepted in various venues this month!

One paper (HLE) accepted to Nature!
One paper (TAO) accepted to EuroSys 2026!
One paper (AutoCode) accepted to ICLR 2026!

[Dec 2025] (talk)

Dec 4: Gave a talk on “Open-Source AI for Competitive Programming” at the OpenAGI Symposium at NeurIPS! Ticket here

[Dec 2025] (paper acceptance)

CAIA gets accepted and selected for oral presentation (top 10%) to AAAI 2026 AI4Finance!

[Sep 2025] (paper acceptance)

Two papers (LiveCodeBench Pro, PeerBench) accepted to NeurIPS 2025!

[Jun 2025] (media coverage)

LiveCodeBench Pro is covered by MIT Technology Review in their article Can we fix AI’s evaluation crisis?.

Papers

For most recent updates, please refer to my Google Scholar profile. Here are some selected publications.

High-Real-Value Technical Whitepapers for Superstar Startups

OML: Open, Monetizable, Loyal AI (2024, NeurIPS 2025 Lock-LLM)
- Technical whitepaper for the AI startup Sentient, which raised $85M seed funding led by Peter Thiel’s Founders' Fund.
- Featured at Citadel Securities PhD Summit 2025
- Invited talk at University of Tübingen
- Invited talk at Decentralized AI Institute Link to recording
- Pioneering AI-native cryptography for IP protection
zkBridge (ACM CCS 2022)
- Trustless cross-chain bridges using zero-knowledge proofs
- Foundation for the blockchain startup Polyhedra Network (valued at $1 billion by the end of 2024)
Kite AI Whitepaper
- Revolutionary infrastructure design for a stablecoin payment network dedicated for AI agents
- The technical whitepaper of Kite AI, a blockchain payment startup which secured $33M funding led by PayPal Ventures in seed ($15M) and series A ($18M) combined.

(Selected) Research in Industry Grounded in Real Practice

VeRA: Verified Reasoning Data Augmentation at Scale
- Done at ByteDance Seed team, with the research question originated from the real practice of building a frontier large language model (i.e. Seed 2.0 Pro)
- It demonstrates a new way of generating high-quality reasoning data without bothering human expertise which is usually scarce and expensive.
CAIA: Crypto AI Agent Benchmark
- Take the advisory role for the great Surf AI team, Cybertino Labs, which secured $15M funding in their seed round.
- The paper builds the first ever benchmark for AI agents dedicated for crypto, and lays the foundation for the entire Surf AI agentic ecosystem.

(Selected) Publications in Academia with High Impact

LiveCodeBench Pro (NeurIPS 2025) - Comprehensive, hard, and contamination-free code generation benchmark
- Featured in MIT Technology Review on Jun 24, 2025;
- Accumulated over 1 million views on X;
- Invited talks at OpenAGI Symposium at UC Berkeley, AI4Science Community at AlphaXiv, OpenAGI Symposium at NeurIPS, 3rd Universal Cup Finals;
- Cited by Google DeepMind Team in their latest Gemini series model releases;
Humanity’s Last Exam (2025) - Ultimate test for AI capabilities
- Published in Nature, one of the most well-known scientific journals across the world;
- Adopted by nearly all frontier AI labs for performance reporting;

Other Publications with One-sentence Description

LLM and Agent Evaluation
- FrontierCS (ICML 2026): An evolving benchmark for evolving intelligence on open problems in computer science;
- FutureX Pro: Done at ByteDance Seed; An agent benchmark for real-life future prediction in various high-value domains;
- PeerBench (also part of Decentralized AI, NeurIPS 2025): A new paradigm on how we fairly evaluate LLM and agents in a robust and reliable way;
- SPIN-Bench (COLM 2025): A benchmark on LLM’s long-horizon reasoning and planning abilities.
Synthetic Data Generation
- AutoCode (ICLR 2026): An agentic framework for generating tests on competitive programming problems to scale training and evaluation in coding;
- TabularMath : Done at ByteDance Seed; A framework for generating high-quality tabular datasets for tabular foundation models.
Decentralized AI
- TAO (EuroSys 2026): Verifiable and reproducible LLM inference results to ensure accountability in MLaaS.
- Sakshi: A roadmap for ideal decentralized AI platform where every step is transparent and auditable, ensuring AI benefits the humanity at the end of the day.
- PoCW (IEEE Transactions in Networks): A paradigm for making Proof-of-Work in blockchains useful (e.g. for model training, inference, etc.) to avoid the huge waste in computation power caused by cryptocurrencies.