Hallucination Index
Ranking AI models by their tendency to fabricate information. Lower percentage = more reliable.
Methodology
Models are tested by asking them to summarize documents and then checking how often the summary contains information not present in the original. The Hallucination Rate is the percentage of summaries containing fabricated content.Answer Rate shows how often the model provided a response (some models refuse certain queries).
| Rank | Model | Hallucination Rate | Factual Accuracy | Answer Rate | Rating |
|---|---|---|---|---|---|
| #1 | AntGroup Finix S1 32B | 1.8% | 98.2% | 99.5% | Excellent |
| #2 | Gemini 2.5 Flash Lite | 3.3% | 96.7% | 99.5% | Excellent |
| #3 | Microsoft Phi-4 | 3.7% | 96.3% | 80.7% | Excellent |
| #4 | Llama 3.3 70B Instruct | 4.1% | 95.9% | 99.5% | Excellent |
| #5 | Snowflake Arctic Instruct | 4.3% | 95.7% | 62.7% | Excellent |
| #6 | Gemma 3 12B | 4.4% | 95.6% | 97.4% | Excellent |
| #7 | Mistral Large 2411 | 4.5% | 95.5% | 99.9% | Excellent |
| #8 | Qwen 3 8B | 4.8% | 95.2% | 99.9% | Excellent |
| #9 | Amazon Nova Pro | 5.1% | 94.9% | 99.3% | Good |
| #10 | Amazon Nova 2 Lite | 5.1% | 94.9% | 99.6% | Good |
| #11 | Mistral Small 2501 | 5.1% | 94.9% | 97.9% | Good |
| #12 | IBM Granite 4.0 H Small | 5.2% | 94.8% | 100% | Good |
| #13 | AI21 Jamba Mini 2 | 5.3% | 94.7% | 99.6% | Good |
| #14 | DeepSeek V3.2 Exp | 5.3% | 94.7% | 96.6% | Good |
| #15 | Qwen 3 14B | 5.4% | 94.6% | 99.7% | Good |
| #16 | Amazon Nova Micro | 5.5% | 94.5% | 99.1% | Good |
| #17 | DeepSeek V3.1 | 5.5% | 94.5% | 97.8% | Good |
| #18 | GPT-4.1 | 5.6% | 94.4% | 99.9% | Good |
| #19 | Qwen 3 4B | 5.7% | 94.3% | 98.5% | Good |
| #20 | Grok 3 | 5.8% | 94.2% | 93% | Good |
| #21 | Qwen 3 32B | 5.9% | 94.1% | 99.8% | Good |
| #22 | Amazon Nova Lite | 6.1% | 93.9% | 99% | Good |
| #23 | DeepSeek V3 | 6.1% | 93.9% | 97.5% | Good |
| #24 | DeepSeek V3.2 | 6.3% | 93.7% | 97.2% | Good |
| #25 | Gemma 3 4B | 6.4% | 93.6% | 96.8% | Good |
| #26 | Gemini 2.5 Pro | 7% | 93% | 99.1% | Fair |
| #27 | Gemini 2.5 Flash | 7.8% | 92.2% | 99% | Fair |
| #28 | Llama 4 Maverick | 8.2% | 91.8% | 100% | Fair |
| #29 | GPT-5.2 Low | 8.4% | 91.6% | 100% | Fair |
| #30 | Claude Haiku 4.5 | 9.8% | 90.2% | 99.5% | Fair |
| #31 | Claude Sonnet 4 | 10.3% | 89.7% | 98.6% | Mediocre |
| #32 | GPT-5 Nano | 10.5% | 89.5% | 100% | Mediocre |
| #33 | Claude Sonnet 4.6 | 10.6% | 89.4% | 99.1% | Mediocre |
| #34 | GPT-5.2 High | 10.8% | 89.2% | 100% | Mediocre |
| #35 | Claude Opus 4.5 | 10.9% | 89.1% | 98.7% | Mediocre |
| #36 | DeepSeek R1 | 11.3% | 88.7% | 97% | Mediocre |
| #37 | Claude Opus 4.1 | 11.8% | 88.2% | 91.5% | Mediocre |
| #38 | Claude Opus 4.6 | 12.2% | 87.8% | 99.8% | Mediocre |
| #39 | GPT-5 Mini | 12.9% | 87.1% | 99.9% | Mediocre |
| #40 | Gemini 3 Flash Preview | 13.5% | 86.5% | 99.8% | Poor |
| #41 | Gemini 3 Pro Preview | 13.6% | 86.4% | 99.4% | Poor |
| #42 | GPT-5 High | 15.1% | 84.9% | 99.9% | Poor |
| #43 | Grok 4.1 Fast | 17.8% | 82.2% | 98.5% | Poor |
| #44 | o4-mini Low | 18.6% | 81.4% | 98.7% | Poor |
| #45 | o4-mini High | 18.6% | 81.4% | 99.2% | Poor |
| #46 | Grok 4 Fast | 20.2% | 79.8% | 99.5% | Bad |
| #47 | Mistral Medium | 22.7% | 77.3% | 99.7% | Bad |
| #48 | o3-pro | 23.3% | 76.7% | 100% | Bad |
| #49 | Phi-4 Mini Instruct | 23.5% | 76.5% | 92.5% | Bad |
| #50 | Ministral 3-3B | 24.2% | 75.8% | 74.3% | Bad |
Key Insights
Best Performers
AntGroup Finix leads with just 1.8% hallucination rate. Smaller, specialized models often outperform larger general-purpose ones.
Reasoning Models Struggle
Models optimized for reasoning (o3-pro, o4-mini) show higher hallucination rates. Longer chains of thought may introduce more opportunities for error.
Size Isn't Everything
Larger models don't automatically mean fewer hallucinations. Training data quality and fine-tuning methodology matter more.