Hallucination Index

Ranking AI models by their tendency to fabricate information. Lower percentage = more reliable.

Source: Vectara ResearchLast updated: February 17, 202650 models tested

Methodology

Models are tested by asking them to summarize documents and then checking how often the summary contains information not present in the original. The Hallucination Rate is the percentage of summaries containing fabricated content.Answer Rate shows how often the model provided a response (some models refuse certain queries).

RankModelHallucination RateFactual AccuracyAnswer RateRating
#1AntGroup Finix S1 32B1.8%98.2%99.5%Excellent
#2Gemini 2.5 Flash Lite3.3%96.7%99.5%Excellent
#3Microsoft Phi-43.7%96.3%80.7%Excellent
#4Llama 3.3 70B Instruct4.1%95.9%99.5%Excellent
#5Snowflake Arctic Instruct4.3%95.7%62.7%Excellent
#6Gemma 3 12B4.4%95.6%97.4%Excellent
#7Mistral Large 24114.5%95.5%99.9%Excellent
#8Qwen 3 8B4.8%95.2%99.9%Excellent
#9Amazon Nova Pro5.1%94.9%99.3%Good
#10Amazon Nova 2 Lite5.1%94.9%99.6%Good
#11Mistral Small 25015.1%94.9%97.9%Good
#12IBM Granite 4.0 H Small5.2%94.8%100%Good
#13AI21 Jamba Mini 25.3%94.7%99.6%Good
#14DeepSeek V3.2 Exp5.3%94.7%96.6%Good
#15Qwen 3 14B5.4%94.6%99.7%Good
#16Amazon Nova Micro5.5%94.5%99.1%Good
#17DeepSeek V3.15.5%94.5%97.8%Good
#18GPT-4.15.6%94.4%99.9%Good
#19Qwen 3 4B5.7%94.3%98.5%Good
#20Grok 35.8%94.2%93%Good
#21Qwen 3 32B5.9%94.1%99.8%Good
#22Amazon Nova Lite6.1%93.9%99%Good
#23DeepSeek V36.1%93.9%97.5%Good
#24DeepSeek V3.26.3%93.7%97.2%Good
#25Gemma 3 4B6.4%93.6%96.8%Good
#26Gemini 2.5 Pro7%93%99.1%Fair
#27Gemini 2.5 Flash7.8%92.2%99%Fair
#28Llama 4 Maverick8.2%91.8%100%Fair
#29GPT-5.2 Low8.4%91.6%100%Fair
#30Claude Haiku 4.59.8%90.2%99.5%Fair
#31Claude Sonnet 410.3%89.7%98.6%Mediocre
#32GPT-5 Nano10.5%89.5%100%Mediocre
#33Claude Sonnet 4.610.6%89.4%99.1%Mediocre
#34GPT-5.2 High10.8%89.2%100%Mediocre
#35Claude Opus 4.510.9%89.1%98.7%Mediocre
#36DeepSeek R111.3%88.7%97%Mediocre
#37Claude Opus 4.111.8%88.2%91.5%Mediocre
#38Claude Opus 4.612.2%87.8%99.8%Mediocre
#39GPT-5 Mini12.9%87.1%99.9%Mediocre
#40Gemini 3 Flash Preview13.5%86.5%99.8%Poor
#41Gemini 3 Pro Preview13.6%86.4%99.4%Poor
#42GPT-5 High15.1%84.9%99.9%Poor
#43Grok 4.1 Fast17.8%82.2%98.5%Poor
#44o4-mini Low18.6%81.4%98.7%Poor
#45o4-mini High18.6%81.4%99.2%Poor
#46Grok 4 Fast20.2%79.8%99.5%Bad
#47Mistral Medium22.7%77.3%99.7%Bad
#48o3-pro23.3%76.7%100%Bad
#49Phi-4 Mini Instruct23.5%76.5%92.5%Bad
#50Ministral 3-3B24.2%75.8%74.3%Bad

Key Insights

Best Performers

AntGroup Finix leads with just 1.8% hallucination rate. Smaller, specialized models often outperform larger general-purpose ones.

Reasoning Models Struggle

Models optimized for reasoning (o3-pro, o4-mini) show higher hallucination rates. Longer chains of thought may introduce more opportunities for error.

Size Isn't Everything

Larger models don't automatically mean fewer hallucinations. Training data quality and fine-tuning methodology matter more.