AI Search Engines Are Failing at News Citations—New Study Exposes Shocking Accuracy Issues

AI search engines like ChatGPT, Gemini, Grok, and Perplexity struggle with accurate news citations. A new study reveals they get sources wrong over 60% of the time—Grok 3 fails a shocking 94% of queries! Find out why.

· 6 min read
AI Search Engines Are Failing at News Citations—New Study Exposes Shocking Accuracy Issues

Key Points

  • Research suggests AI search engines like Grok, Gemini, ChatGPT, and Perplexity often struggle with accurate news citations.
  • The evidence leans toward these tools providing incorrect answers in over 60% of cases, with some, like Grok 3, erring in 94% of queries.
  • It seems likely that this issue stems from training data limitations and the probabilistic nature of AI, leading to misinformation risks.

Introduction

AI search engines are becoming popular for finding news, but recent studies highlight significant challenges in their ability to cite sources accurately. This article explores findings from the Columbia Journalism Review (CJR) and analyticsindiamag.com, revealing why these tools may not be reliable for news citations and what this means for users and publishers.

Study Findings

A CJR study compared eight AI search engines, testing their accuracy in citing news sources. It found that collectively, these tools gave incorrect answers over 60% of the time. Specific error rates included:

  • Perplexity at 37% incorrect.
  • Grok 3, notably, at a 94% error rate, making it particularly unreliable.

The analyticsindiamag.com article echoes this, warning against trusting these AI models for citations due to their struggle with providing accurate publisher information.

Implications and Context

These inaccuracies can lead to misinformation, eroding trust in both AI tools and news sources. For publishers, this means potential loss of traffic and revenue, as AI repackages content without proper attribution. An unexpected detail is that some AI models, like Grok 3, confidently provide wrong information, making it harder for users to spot errors.


Survey Note: Detailed Analysis of AI Search Engine Citation Accuracy

Overview and Background

As of March 18, 2025, the reliance on AI search engines for news consumption is growing, with approximately one in four Americans using these tools as alternatives to traditional search engines. However, recent research has uncovered significant issues with their accuracy, particularly in citing news sources. This survey note synthesizes findings from two key articles: a study by the Columbia Journalism Review's (CJR) Tow Center for Digital Journalism and an article from analyticsindiamag.com, both published in early 2025. These sources highlight the pervasive problem of citation inaccuracies in AI search engines like Grok, Gemini, ChatGPT, and Perplexity, raising concerns for users, publishers, and the broader information ecosystem.

Methodology and Findings from the CJR Study

The CJR study, conducted by researchers Klaudia Jaźwińska and Aisvarya Chandrasekar, involved a rigorous evaluation of eight AI search engines: ChatGPT Search, Perplexity, Perplexity Pro, Gemini, DeepSeek Search, Grok-2 Search, Grok-3 Search, and Copilot. The methodology included selecting 200 news article excerpts, ensuring each was within the top three results of a Google search, and then querying each AI tool to assess whether it correctly identified the article, its publisher, and the URL. The results, published in early March 2025, revealed:

  • Collectively, the AI search engines provided incorrect answers to more than 60% of the queries, indicating a systemic issue in citation accuracy.
  • Error rates varied significantly across platforms: Perplexity had a 37% error rate, while Grok 3 was the worst performer, with a 94% error rate, meaning it was wrong in 94 out of 100 queries.
  • Additional details from secondary sources, such as Ars Technica, noted that more than half of the citations from Gemini and Grok 3 led to fabricated or broken URLs, with 154 out of 200 citations from Grok 3 resulting in error pages.

This study underscores the challenge of AI search engines in handling news citations, with some models, like Grok 3, being particularly unreliable. The high error rates suggest that these tools are not yet suitable for tasks requiring precise source attribution, especially in journalism.

Insights from Analyticsindiamag.com

The article from analyticsindiamag.com, published on March 17, 2025, complements the CJR study by focusing on specific AI models: Grok, Gemini, ChatGPT, and Perplexity. It argues that it is a bad idea to trust these tools for citations, highlighting that their web search features struggle to provide accurate information about original publishers. While the full content was inaccessible due to rate-limiting, search results indicate it aligns with the CJR findings, emphasizing the risk of misinformation due to inaccurate citations. This article adds to the discourse by naming individual models and reinforcing the need for caution when using AI for news-related queries.

Reasons for Inaccuracy

Several factors contribute to the citation inaccuracies observed:

  • Training Data Limitations: AI models are trained on large datasets, but these may not include real-time or up-to-date news information, leading to outdated or incorrect citations.
  • Probabilistic Nature of AI: These models generate responses based on probability, which can result in hallucinations—fabricated details like URLs or sources that do not exist. This is particularly evident in models like Grok 3, which had a 94% error rate.
  • Web Scraping and Misinterpretation: AI search engines often rely on web scraping to gather information, which can be error-prone if the sources are not verified or if the model misinterprets the content. For instance, secondary sources noted that AI tools sometimes provide broken links, further complicating user trust.
  • Confidence Without Accuracy: A notable issue is that many AI search engines, such as ChatGPT, provide responses with high confidence, using hedging language in only 15 out of 134 incorrect citations, according to the Nieman Journalism Lab. This can mislead users into trusting inaccurate information, as noted by MIT Technology Review, which discussed how citations can give an "air of correctness" that may not be warranted.

Implications for Users and Publishers

The implications of these findings are significant:

  • Misinformation Risks: Users relying on AI search engines for news may encounter and propagate misinformation, especially given the high error rates. For example, a user querying recent news might receive a fabricated URL, leading to an error page instead of the correct source.
  • Erosion of Trust: The inaccuracies erode trust in both the AI tools and the news sources they cite, potentially damaging the credibility of journalism. This is particularly concerning as nearly one in four Americans uses AI models for news, according to the CJR study.
  • Impact on Publishers: News organizations face a dilemma: blocking AI crawlers might lead to a complete loss of attribution, while allowing them results in widespread reuse without driving traffic back to their websites. Mark Howard, chief operating officer at Time magazine, expressed concerns to CJR about ensuring transparency and control over how content appears in AI-generated searches, highlighting the tension between visibility and revenue loss.
  • Unexpected Detail: An unexpected finding is the variation in error rates, with premium models like Grok 3 performing worse than expected, at 94% incorrect, compared to Perplexity at 37%. This suggests that cost does not necessarily correlate with accuracy, challenging assumptions about premium AI services.

Comparative Analysis with Traditional Search Engines

Traditional search engines, like Google, act as intermediaries, guiding users to news websites and other quality content. In contrast, generative AI search tools parse and repackage information, often cutting off traffic flow to original sources, as noted by Futurism. This shift can exacerbate citation issues, as AI models prioritize conversational outputs over verifiable links, leading to the observed inaccuracies.

Recommendations for Improvement

To address these issues, several strategies can be implemented:

  • Improved Evaluation Metrics: Developers should establish robust metrics to assess citation accuracy, ensuring models are tested against real-world news queries.
  • Transparency and User Education: AI search engines should clearly communicate their limitations and encourage users to verify information, perhaps by integrating tools for fact-checking.
  • Collaboration with Publishers: AI companies could work with news organizations to ensure proper attribution and traffic flow, potentially through APIs or partnerships that respect publisher exclusion requests.
  • Regulatory Oversight: Governments and regulatory bodies may need to set standards for AI search engine accuracy and transparency, ensuring users are protected from misinformation.
  • Model-Specific Improvements: Given the high error rates for models like Grok 3, developers should prioritize enhancing source verification and reducing hallucinations, possibly by integrating real-time web crawling with better validation checks.

Conclusion

The research from CJR and analyticsindiamag.com highlights a critical challenge in the adoption of AI search engines for news consumption: their struggle with accurate citations. With error rates exceeding 60% and some models like Grok 3 reaching 94%, these tools pose risks of misinformation and trust erosion. As AI continues to evolve, addressing these issues is essential to maintain the integrity of the information ecosystem. Users should remain cautious, verifying information from multiple sources, while developers and regulators work toward more reliable AI search solutions.

Table: Error Rates of AI Search Engines

Below is a table summarizing the error rates from the CJR study, based on available data:

AI Search Engine Error Rate (%) Notes
Perplexity 37 Relatively better performer
Grok 3 94 Highest error rate, significant issue with broken URLs
ChatGPT Search 67 High error rate, low hedging language
Gemini Not specified Known for fabricated URLs
Others (e.g., Copilot) Varies Copilot declined many answers

This table illustrates the variability in performance, with Grok 3 standing out as particularly unreliable.