Irony alert: Hallucinated citations found in papers from NeurIPS, theprestigious AI conference

AI detection startup GPTZero scanned all 4,841 papers accepted by the prestigious Conference on Neural Information Processing Systems, or NeurIPS, which took place last month in San Diego. The company found 100 hallucinated citations across 51 papers that it confirmed as fake.

Having a paper accepted by NeurIPS is a resume-worthy achievement in the world of AI. Given that these are the leading minds of AI research, one might assume they would use large language models for the catastrophically boring task of writing citations. Several caveats abound with this finding. One hundred confirmed hallucinated citations across 51 papers is not statistically significant. Each paper has dozens of citations, so out of tens of thousands of citations, this is, statistically, nearly zero.

It is also important to note that an inaccurate citation does not negate the paper’s research. As NeurIPS stated, even if a small percentage of papers have one or more incorrect references due to the use of large language models, the content of the papers themselves is not necessarily invalidated.

But having said all that, a faked citation is not nothing, either. NeurIPS prides itself on rigorous scholarly publishing in machine learning and artificial intelligence. Each paper is peer-reviewed by multiple people who are instructed to flag hallucinations. Citations are also a sort of currency for researchers. They are used as a career metric to show how influential a researcher’s work is among their peers. When AI makes them up, it waters down their value.

No one can fault the peer reviewers for not catching a few AI-fabricated citations given the sheer volume involved. GPTZero is quick to point this out. The goal of the exercise was to offer specific data on how AI-generated errors sneak in via what the startup calls a submission tsunami that has strained these conferences’ review pipelines to the breaking point. GPTZero even points to a recent paper that discussed the problem of peer review at premiere conferences, including NeurIPS.

Still, why could the researchers themselves not fact-check the large language model’s work for accuracy? Surely they must know the actual list of papers they used for their work. What the whole thing really points to is one big, ironic takeaway. If the world’s leading AI experts, with their reputations at stake, cannot ensure their large language model usage is accurate in the details, what does that mean for the rest of us?