Report: Factual Consistency Rate – Model Comparison
Factual Consistency Rate is a metric that measures how factually accurate a language model is in its responses based on verifiable sources. This is particularly important in systems that use retrieval-augmented generation (RAG), where the model is expected to summarize, rephrase, or answer questions grounded in known content. A higher score indicates that the model tends to stay closer to the factual content, while a lower score may suggest hallucinations – instances where the model fabricates information.
Interpretation of the Chart
The chart presents the Factual Consistency Rate for a range of language models:
Model
Factual Consistency Rate
EGPT A1
0.908
EGPT 2
0.869
EGPT 32B (preview)
0.867
Llama 3.3 70B Instruct
0.840
Cogito 32B Instruct
0.833
EGPT 0.6.4
0.830
Mistral Small 24B Instruct
0.829
Method
The scores were generated using a hallucination evaluation model. All language models were quantized and evaluated on the same dataset, consisting of over 1,000 examples. Each language model generates a summary based on a given source from the dataset. The evaluation model then assigns higher scores to source–summary pairs that are factually consistent. The values shown in the plot represent the average score across the dataset.
Analysis
- EGPT A1 achieves the highest score with an average Factual Consistency Rate of 0.908. This indicates a very low rate of factual errors and strong reliability. 
- The EGPT series overall (versions 0.6.4, 2, 32B, and A1) consistently scores high, showing clear improvements across versions. A1 stands out as the most refined iteration. 
- Other models such as Llama 3.3 70B Instruct and Cogito 32B Instruct trail slightly behind the EGPT models but still maintain a good level of factual consistency (around 0.83–0.84). 
- Mistral Small 24B Instruct has the lowest score in this comparison (0.829), though the difference is relatively minor compared to the rest. 
Conclusion
Factual Consistency Rate is a critical quality metric for language models deployed in information-sensitive applications such as customer support, healthcare advice, and legal text generation. In this comparison, EGPT A1 stands out as the most reliable model for generating fact-based responses, followed closely by other EGPT variants, then Llama and Cogito.
For use cases where factual correctness is essential, models with a Factual Consistency Rate above 0.86 should be strongly considered. It is also recommended to combine these models with monitoring tools, evaluation pipelines, and human review when needed, depending on the domain.
Last updated
Was this helpful?

