← 返回主报告 · Bench_Final

ICLR 错误证明题库 · 数据报告

从 54,851 篇 ICLR 论文(2016–2026)中,借 peer review 信号挖掘"作者承认的错误证明",经 96 道 hard-flaw 再精炼,最终定稿 49 道入库题目。
54,851
ICLR 论文总数
9,363
实质理论论文
255
作者承认过错误
49
终题库(入库定稿)

一、筛选漏斗

ICLR 全部论文 (2016–2026)54,851
11 个年份文件,含完整 peer review
正则预筛:含理论信号19,218
关键词/正则打分,STRONG/MEDIUM/WEAK 三档 保留 35.0%
主题分类:含实质理论9,363
gpt-5.4-nano 判 pure_theory / theory_core 保留 48.7%
一审:审稿人质疑证明正确性5,868
explicit_error 1,421 + soft_concern 4,447 保留 62.7%
二审:作者承认过错误255
读 rebuttal 线程,作者认账 保留 4.3%
可入库:错误成立 或 可复原131
confirmed 58 + recoverable 73 保留 51.4%
hard flaw:需推理的错误96
Codex 逐条 triage,剔除 35 个显然 typo 保留 73.3%
终题库:入库定稿49
再剔除一目了然的错误 & noise,严格复核入库 保留 51.0%
分支说明:一审 5,868 篇是"审稿人质疑正确性"的并集;二审只对其中作者认账的收敛到 255 篇;可入库 131 = 错误成立的 58 篇 + 从 rebuttal 线程完全可复原的 73 篇;Codex 逐条剔除 35 个"显然 typo"得 96 道 hard flaw;最后再剔除一目了然的错误与 noise、严格复核,定稿 49 道入库。

二、终题库数据分析(N=49)

错误类型分布

无效放缩/不等式步骤
15 30.6%
缺失/过弱假设
14 28.6%
循环论证/未证断言
8 16.3%
错误常数/系数/速率
7 14.3%
定义不当/病态
2 4.1%
量词/作用域错误
1 2.0%
运算次序不可交换
1 2.0%
边界/退化情形遗漏
1 2.0%
最常见:无效放缩 / 缺失假设 / 循环论证 / 错误速率——都是需要读懂证明才能发现的结构性缺陷,而非笔误。精炼后无效放缩与缺失假设两类合计近六成。

主题分布(6 类)

优化理论
19 38.8%
强化学习理论
11 22.4%
生成模型/采样
6 12.2%
概率/集中不等式
6 12.2%
学习理论/泛化
5 10.2%
线性代数/谱
2 4.1%
优化理论与强化学习理论占比最高,与 ICLR 理论论文的整体构成一致。

年份分布

2017
1 2.0%
2018
4 8.2%
2019
1 2.0%
2020
5 10.2%
2021
5 10.2%
2022
7 14.3%
2023
6 12.2%
2024
14 28.6%
2025
6 12.2%

来源 & 影响面

来源
错误成立 (PDF 仍带错)
22 44.9%
已修但可复原 (线程重构)
27 55.1%
是否影响主结果(Codex 复核)
影响主结果
42 85.7%
仅影响辅助引理
7 14.3%
Codex 复核平均置信度 0.90(0.84–0.98);逾八成缺陷直接波及论文主结果。

三、高置信度样例(affects_main_result = yes)

年份论文主题错误类型作者承认(节录)
2022A Novel Convergence Analysis for the Stochastic Proximal Point Alg… 优化理论循环论证/未证断言 No, Lemma 1 can't be fixed. The result simply does not hold…
2024Reward Adaptation Via Q-Manipulation 强化学习理论错误常数/系数/速率 “we thank the reviewers for pointing out the error in Lemma 5.”…
2024Farzi Data: Autoregressive Data Distillation 生成模型/采样缺失/过弱假设 Thank you for catching our honest mistake at the end of Theorem 3.1…
2018From Information Bottleneck To Activation Norm Penalty 优化理论循环论证/未证断言 As you rightly pointed out, the lack of treatment for the second term (log-determinant) is unsatisfying…
2020Policy Optimization with Stochastic Mirror Descent 强化学习理论循环论证/未证断言 The reviewer's comment of Eq.(40) is correct…
2020An Information Theoretic Perspective on Disentangled Representatio… 概率/集中不等式缺失/过弱假设 As you point out, now we realize that the factorized noise is sufficient for conditional independence…
2018Residual Loss Prediction: Reinforcement Learning With No Incremen… 强化学习理论缺失/过弱假设 indeed, you're right, there's a missing term. In going from Eq 7 to Eq 8…
2018Three factors influencing minima in SGD 优化理论缺失/过弱假设 We agree there is a mathematical mistake in allowing sigma to vary with theta…
2023Shuffle Gaussian Mechanism for Differential Privacy 概率/集中不等式量词/作用域错误 Thanks for the insightful comment! We have indeed overlooked the fact that normalization and shuffling…
2023Improving Out-of-distribution Generalization with Indirection Repr… 线性代数/谱无效放缩/不等式步骤 In the revision, we provided the definition for the ‖·‖∞ norm in Definition A.3…
2024Learning multi-modal generative models with permutation-invariant… 生成模型/采样缺失/过弱假设 We agree that Proposition 1 relies on a strong assumption about the encoding distribution…
2024Are Transformers with One Layer Self-Attention Using Low-Rank Weig… 学习理论/泛化无效放缩/不等式步骤 As for the bugs in equation (10), you are absolutely right…
数据管线:正则预筛 → gpt-5.4-nano 主题分类 → 双审(nano)→ gpt-5.4-mini 复原 → Codex 难度 triage · 总成本 ≈ \$12.9