Hallucination Index

Ranking AI models by their tendency to fabricate information. Lower percentage = more reliable.

Source: Vectara Research•Last updated: February 17, 2026•50 models tested

Methodology

Models are tested by asking them to summarize documents and then checking how often the summary contains information not present in the original. The Hallucination Rate is the percentage of summaries containing fabricated content.Answer Rate shows how often the model provided a response (some models refuse certain queries).

Rank	Model	Hallucination Rate	Factual Accuracy	Answer Rate	Rating
#1	AntGroup Finix S1 32B	1.8%	98.2%	99.5%	Excellent
#2	Gemini 2.5 Flash Lite	3.3%	96.7%	99.5%	Excellent
#3	Microsoft Phi-4	3.7%	96.3%	80.7%	Excellent
#4	Llama 3.3 70B Instruct	4.1%	95.9%	99.5%	Excellent
#5	Snowflake Arctic Instruct	4.3%	95.7%	62.7%	Excellent
#6	Gemma 3 12B	4.4%	95.6%	97.4%	Excellent
#7	Mistral Large 2411	4.5%	95.5%	99.9%	Excellent
#8	Qwen 3 8B	4.8%	95.2%	99.9%	Excellent
#9	Amazon Nova Pro	5.1%	94.9%	99.3%	Good
#10	Amazon Nova 2 Lite	5.1%	94.9%	99.6%	Good
#11	Mistral Small 2501	5.1%	94.9%	97.9%	Good
#12	IBM Granite 4.0 H Small	5.2%	94.8%	100%	Good
#13	AI21 Jamba Mini 2	5.3%	94.7%	99.6%	Good
#14	DeepSeek V3.2 Exp	5.3%	94.7%	96.6%	Good
#15	Qwen 3 14B	5.4%	94.6%	99.7%	Good
#16	Amazon Nova Micro	5.5%	94.5%	99.1%	Good
#17	DeepSeek V3.1	5.5%	94.5%	97.8%	Good
#18	GPT-4.1	5.6%	94.4%	99.9%	Good
#19	Qwen 3 4B	5.7%	94.3%	98.5%	Good
#20	Grok 3	5.8%	94.2%	93%	Good
#21	Qwen 3 32B	5.9%	94.1%	99.8%	Good
#22	Amazon Nova Lite	6.1%	93.9%	99%	Good
#23	DeepSeek V3	6.1%	93.9%	97.5%	Good
#24	DeepSeek V3.2	6.3%	93.7%	97.2%	Good
#25	Gemma 3 4B	6.4%	93.6%	96.8%	Good
#26	Gemini 2.5 Pro	7%	93%	99.1%	Fair
#27	Gemini 2.5 Flash	7.8%	92.2%	99%	Fair
#28	Llama 4 Maverick	8.2%	91.8%	100%	Fair
#29	GPT-5.2 Low	8.4%	91.6%	100%	Fair
#30	Claude Haiku 4.5	9.8%	90.2%	99.5%	Fair
#31	Claude Sonnet 4	10.3%	89.7%	98.6%	Mediocre
#32	GPT-5 Nano	10.5%	89.5%	100%	Mediocre
#33	Claude Sonnet 4.6	10.6%	89.4%	99.1%	Mediocre
#34	GPT-5.2 High	10.8%	89.2%	100%	Mediocre
#35	Claude Opus 4.5	10.9%	89.1%	98.7%	Mediocre
#36	DeepSeek R1	11.3%	88.7%	97%	Mediocre
#37	Claude Opus 4.1	11.8%	88.2%	91.5%	Mediocre
#38	Claude Opus 4.6	12.2%	87.8%	99.8%	Mediocre
#39	GPT-5 Mini	12.9%	87.1%	99.9%	Mediocre
#40	Gemini 3 Flash Preview	13.5%	86.5%	99.8%	Poor
#41	Gemini 3 Pro Preview	13.6%	86.4%	99.4%	Poor
#42	GPT-5 High	15.1%	84.9%	99.9%	Poor
#43	Grok 4.1 Fast	17.8%	82.2%	98.5%	Poor
#44	o4-mini Low	18.6%	81.4%	98.7%	Poor
#45	o4-mini High	18.6%	81.4%	99.2%	Poor
#46	Grok 4 Fast	20.2%	79.8%	99.5%	Bad
#47	Mistral Medium	22.7%	77.3%	99.7%	Bad
#48	o3-pro	23.3%	76.7%	100%	Bad
#49	Phi-4 Mini Instruct	23.5%	76.5%	92.5%	Bad
#50	Ministral 3-3B	24.2%	75.8%	74.3%	Bad

Key Insights

Best Performers

AntGroup Finix leads with just 1.8% hallucination rate. Smaller, specialized models often outperform larger general-purpose ones.

Reasoning Models Struggle

Models optimized for reasoning (o3-pro, o4-mini) show higher hallucination rates. Longer chains of thought may introduce more opportunities for error.

Size Isn't Everything

Larger models don't automatically mean fewer hallucinations. Training data quality and fine-tuning methodology matter more.