LogoLogo
  • Introduction
  • Ebbot Platform
  • Bot basics
    • Scenarios
    • Entities
    • Triggers
    • Training center
  • Scenarios
    • Cards and syntax
      • File Input
      • Text card
      • Input
      • Buttons
      • Image
      • File
      • Carousel
      • Location
      • List
      • Contact Agent
      • Rating request
      • Custom component
      • CoBrowsing
    • Transition
    • Card properties
  • AI Insights
    • Setup and Configuration
    • Using the Insights Dashboard
  • EbbotGPT
    • Knowledge
      • Data source transformer
      • Source types
        • File
        • Website scrape
        • Docx file
        • TOPdesk API
        • Sitevision API
        • SharePoint API
          • Create app with Sites.FullControl.All permission in Azure
          • Ebbot SharePoint Postman Guide
        • Confluence API
    • Configurations
    • Persona
    • GPT Evaluation
    • Embedder models
    • EGPT models
  • LLM benchmarks
    • Report: Factual Consistency Rate – Model Comparison
    • Report: Language Translation Ability – Model Comparison
  • Custom vocabulary
  • Tutorials
    • Create your first scenario
      • Select a trigger
      • Add bot responses
  • Data Object
  • Release notes
  • For developers
    • Ebbot SDK
    • Safe Exchange API / Vault
    • Subdomain manager
  • EbbotGPT API
  • Chatbot & Live chat
    • Install chat widget
    • Chats API
    • Chat widget API
    • Datasource API
    • Sales tracking for live chat
    • Webhook
      • Incoming webhooks
      • Outgoing webhooks
    • SMS API
      • Authentication
      • Send SMS
      • Errors
      • Encoding
    • Python components
    • Intent detection (NLP)
  • Product guides
    • Product data feeds
    • Install guide
    • Product guide events
      • Product guide user events
      • Received events
      • Send events
    • API & webhooks
    • GA4 integration
    • Klaviyo integration
  • Messenger marketing
    • Install popup
    • API & webhooks
  • Widget cookies & storage
  • For chat agents
    • Ebbot Chat
      • Settings modal
      • Queue
      • Topbar Stats
      • Menu
        • Power-Ups!
        • Quick Replies
  • INTEGRATIONS
    • Ebbot Live Chat in Zendesk
      • Setup guide
    • Active Directory - SAML
    • Configure SAML in Azure
Powered by GitBook
On this page

Was this helpful?

  1. LLM benchmarks

Report: Factual Consistency Rate – Model Comparison

PreviousLLM benchmarksNextReport: Language Translation Ability – Model Comparison

Last updated 8 days ago

Was this helpful?

Factual Consistency Rate is a metric that measures how factually accurate a language model is in its responses based on verifiable sources. This is particularly important in systems that use retrieval-augmented generation (RAG), where the model is expected to summarize, rephrase, or answer questions grounded in known content. A higher score indicates that the model tends to stay closer to the factual content, while a lower score may suggest hallucinations – instances where the model fabricates information.

Interpretation of the Chart

The chart presents the Factual Consistency Rate for a range of language models:

Model

Factual Consistency Rate

EGPT A1

0.908

EGPT 2

0.869

EGPT 32B (preview)

0.867

Llama 3.3 70B Instruct

0.840

Cogito 32B Instruct

0.833

EGPT 0.6.4

0.830

Mistral Small 24B Instruct

0.829

Method

The scores were generated using a hallucination evaluation model. All language models were quantized and evaluated on the same dataset, consisting of over 1,000 examples. Each language model generates a summary based on a given source from the dataset. The evaluation model then assigns higher scores to source–summary pairs that are factually consistent. The values shown in the plot represent the average score across the dataset.

Analysis

  1. EGPT A1 achieves the highest score with an average Factual Consistency Rate of 0.908. This indicates a very low rate of factual errors and strong reliability.

  2. The EGPT series overall (versions 0.6.4, 2, 32B, and A1) consistently scores high, showing clear improvements across versions. A1 stands out as the most refined iteration.

  3. Other models such as Llama 3.3 70B Instruct and Cogito 32B Instruct trail slightly behind the EGPT models but still maintain a good level of factual consistency (around 0.83–0.84).

  4. Mistral Small 24B Instruct has the lowest score in this comparison (0.829), though the difference is relatively minor compared to the rest.

Conclusion

Factual Consistency Rate is a critical quality metric for language models deployed in information-sensitive applications such as customer support, healthcare advice, and legal text generation. In this comparison, EGPT A1 stands out as the most reliable model for generating fact-based responses, followed closely by other EGPT variants, then Llama and Cogito.

For use cases where factual correctness is essential, models with a Factual Consistency Rate above 0.86 should be strongly considered. It is also recommended to combine these models with monitoring tools, evaluation pipelines, and human review when needed, depending on the domain.