study is the first to systematically evaluate whether prominent LLMs, including ChatGPT, DeepSeek, and Claude, faithfully summarize scientific claims or exaggerate their scope. Our analysis of nearly 5000 LLM-generated science summaries revealed that most models produced broader generalizations of scientific results than the original texts—even when explicitly prompted for accuracy and across multiple tests. Notably, newer models exhibited significantly greater inaccuracies in generalization than earlier versions. These findings suggest a persistent generalization bias in many LLMs, i.e. a tendency to extrapolate scientific results beyond the claims found in the material that the models summarize
Paper showing LLMs have a “tendency to extrapolate scientific results beyond the claims found in the material”

Ben Harris-Roxas
@ben_hr