Report: Factual Consistency Rate – Model Comparison

Factual Consistency Rate is a metric that measures how factually accurate a language model is in its responses based on verifiable sources. This is particularly important in systems that use retrieval-augmented generation (RAG), where the model is expected to summarize, rephrase, or answer questions grounded in known content. A higher score indicates that the model tends to stay closer to the factual content, while a lower score may suggest hallucinations – instances where the model fabricates information.

Interpretation of the Chart

The chart presents the Factual Consistency Rate for a range of language models:

Model

Factual Consistency Rate

EGPT A1

0.908

EGPT 2

0.869

EGPT 32B (preview)

0.867

Llama 3.3 70B Instruct

0.840

Cogito 32B Instruct

0.833

EGPT 0.6.4

0.830

Mistral Small 24B Instruct

0.829

Method

The scores were generated using a hallucination evaluation model. All language models were quantized and evaluated on the same dataset, consisting of over 1,000 examples. Each language model generates a summary based on a given source from the dataset. The evaluation model then assigns higher scores to source–summary pairs that are factually consistent. The values shown in the plot represent the average score across the dataset.

Analysis

EGPT A1 achieves the highest score with an average Factual Consistency Rate of 0.908. This indicates a very low rate of factual errors and strong reliability.
The EGPT series overall (versions 0.6.4, 2, 32B, and A1) consistently scores high, showing clear improvements across versions. A1 stands out as the most refined iteration.
Other models such as Llama 3.3 70B Instruct and Cogito 32B Instruct trail slightly behind the EGPT models but still maintain a good level of factual consistency (around 0.83–0.84).
Mistral Small 24B Instruct has the lowest score in this comparison (0.829), though the difference is relatively minor compared to the rest.

Conclusion

Factual Consistency Rate is a critical quality metric for language models deployed in information-sensitive applications such as customer support, healthcare advice, and legal text generation. In this comparison, EGPT A1 stands out as the most reliable model for generating fact-based responses, followed closely by other EGPT variants, then Llama and Cogito.

For use cases where factual correctness is essential, models with a Factual Consistency Rate above 0.86 should be strongly considered. It is also recommended to combine these models with monitoring tools, evaluation pipelines, and human review when needed, depending on the domain.

PreviousLLM benchmarks NextReport: Language Translation Ability – Model Comparison

Last updated 5 months ago

Was this helpful?