When A.I. Errors Become Everyone Else’s Job

The mistakes are no longer surprising. What is changing, researchers, judges and software maintainers say, is the amount of labor now required to catch them.

Across medicine, law and software, a series of new studies and firsthand accounts is pointing to the same unsettling pattern: artificial intelligence systems are generating polished, plausible material that often contains false citations, invented evidence trails or confident but unreliable analysis — and those errors are slipping into places where accuracy is not optional.

The result is a quieter, less glamorous phase of the generative AI boom. The question is no longer simply whether chatbots hallucinate. It is who, exactly, must now spend time verifying everything they touch.

A growing problem in scientific literature

In biomedical research, that burden is becoming increasingly visible in the footnotes.

A Columbia-led audit of 2.5 million scholarly papers found that fabricated references have risen sharply since 2023, with researchers estimating that nearly 146,932 hallucinated citations appeared in papers published in 2025 alone. A related analysis in *The Lancet* identified nearly 3,000 biomedical papers containing fake references.

The concern is not just academic embarrassment. Biomedical papers feed review articles, meta-analyses and, in some cases, clinical guidelines that influence how patients are treated. When a citation does not exist, or appears to support a claim it never made, it can distort the evidence base in ways that are difficult to unwind.

Researchers say the newer generation of errors is particularly hard to detect. The fake references often look authentic: they match the subject of the paper, follow standard formatting conventions and are inserted into otherwise credible prose. According to the audit, publishers have largely been slow to respond; most affected papers had received no corrective action.

That reflects a broader shift in how AI-generated mistakes manifest. The problem is no longer only made-up facts. It is also made-up sourcing.

A new benchmark developed by researchers at Peking University found that leading AI systems frequently produce what they call “attribution hallucinations”: answers that may be correct, but that point to the wrong passage, wrong quotation or wrong supporting text in the underlying document. In regulated fields like medicine and law, that distinction matters enormously. A statement without a verifiable evidentiary chain is, functionally, not reliable enough.

The hazards of that gap have been underscored by practical testing outside academia. In a recent essay, a professional fact-checker described mainstream AI systems as wrong far more often than casual users tend to assume, recounting instances in which tools invented passages, mischaracterized sources and cited support that did not exist.

Courts confront an AI paperwork surge

The same dynamic is now straining the federal courts.

A study by researchers at M.I.T. and the University of Southern California found that lawsuits filed by people without lawyers have surged since ChatGPT entered the mainstream, with self-represented federal complaints nearly doubling over that period. In a sample of 2026 complaints, more than 18 percent showed signs of AI-generated text.

For some advocates, that rise reflects something hopeful at first glance: AI tools can help people who cannot afford lawyers draft documents that look formal and complete. But judges are increasingly confronting the downside of that accessibility — filings that are longer, more polished and more difficult to dismiss at a glance, yet may contain fabricated citations, nonexistent cases or muddled legal reasoning.

Federal courts have already been dealing with headline-making episodes in which lawyers submitted briefs citing invented cases produced by chatbots. What appears to be emerging now is a broader administrative burden. Clerks and judges must spend more time checking authorities, sorting through repetitive or machine-inflated filings and deciding when to sanction parties, require disclosure of AI use or rely on existing rules of professional responsibility.

The response remains uneven. Some courts and judges have adopted AI certification or disclosure requirements, asking attorneys to affirm that filings were checked by a human. Others have resisted AI-specific rules, arguing that existing sanctions authority is sufficient. But either way, the labor of verification remains.

Open-source maintainers feel the pressure

In software, the burden is landing on a group that already works with limited time and little slack: maintainers.

Daniel Stenberg, the founder and lead maintainer of curl, said recently that his project is now receiving AI-assisted security reports at four to five times the 2024 rate and at roughly double the pace of 2025. The reports, he said, are often long, detailed and credible-sounding, creating an avalanche of high-priority work for a team that feels obliged to investigate.

That does not necessarily mean the findings are all worthless. In fact, Stenberg said many incoming reports are serious enough to require attention, even if the vulnerabilities discovered in recent years have generally been of low or medium severity. The problem is that volume itself becomes a form of pressure. Every plausible report demands expert review.

Armin Ronacher, a prominent software developer, has described a similar problem in bug reporting. What frustrates him, he wrote, is not merely that users are employing AI, but that many reports no longer reflect direct observation. Instead, a real problem is fed into a model and returned as an overconfident and often inaccurate narrative: guessed-at root causes, fake minimal reproductions, irrelevant analogies and sprawling lists of speculative failure modes.

For maintainers, that can be worse than a short, messy human report. It is harder to parse, harder to trust and often more labor-intensive to debunk.

The ideal issue report, Ronacher argued, is now almost aggressively simple: what was run, what was expected, what happened instead and the exact error log. In other words, less synthetic analysis, more primary evidence.

Why this matters now

Taken together, the reports suggest that the operational cost of generative AI is migrating downstream.

The tools make it easier to produce competent-looking text at scale. That can be genuinely useful. It can also flood systems that depend on careful review with material that is superficially convincing but epistemically weak — output that sounds researched, sounds reasoned and sounds sourced, while quietly breaking the chain of evidence on which professional work depends.

That is especially consequential in domains where trust is procedural as much as substantive. In science, claims must be traceable to real studies. In law, arguments must rest on actual authorities. In security and software, bug reports and vulnerability claims must be tied to reproducible observations. If AI can produce the appearance of those things without the underlying substance, then institutions must compensate by adding more human scrutiny.

And that scrutiny is expensive. It takes expert labor, slows workflows and often falls on people — journal editors, court staff, volunteer maintainers, fact-checkers — who were already overextended.

There are open questions about whether better retrieval systems, stronger grounding techniques and automated citation checking can meaningfully reduce the problem. It is also unclear how responsibility will be divided among AI companies, publishers, law firms, courts and software platforms.

For now, though, the pattern is becoming harder to ignore. The challenge posed by generative AI is not only that it can be wrong. It is that it can be wrong in ways that look ready for publication, filing or deployment — leaving someone else to prove that they are not.

Sources

Further reading and reporting used to add context:

AI News