LogoLogo
  • Introduction
  • Ebbot Platform
  • Bot basics
    • Scenarios
    • Entities
    • Triggers
    • Training center
  • Scenarios
    • Cards and syntax
      • File Input
      • Text card
      • Input
      • Buttons
      • Image
      • File
      • Carousel
      • Location
      • List
      • Contact Agent
      • Rating request
      • Custom component
      • CoBrowsing
    • Transition
    • Card properties
  • AI Insights
    • Setup and Configuration
    • Using the Insights Dashboard
  • EbbotGPT
    • Knowledge
      • Data source transformer
      • Source types
        • File
        • Website scrape
        • Docx file
        • TOPdesk API
        • Sitevision API
        • SharePoint API
          • Create app with Sites.FullControl.All permission in Azure
          • Ebbot SharePoint Postman Guide
        • Confluence API
    • Configurations
    • Persona
    • GPT Evaluation
    • Embedder models
    • EGPT models
  • LLM benchmarks
    • Report: Factual Consistency Rate – Model Comparison
    • Report: Language Translation Ability – Model Comparison
  • Custom vocabulary
  • Tutorials
    • Create your first scenario
      • Select a trigger
      • Add bot responses
  • Data Object
  • Release notes
  • For developers
    • Ebbot SDK
    • Safe Exchange API / Vault
    • Subdomain manager
  • EbbotGPT API
  • Chatbot & Live chat
    • Install chat widget
    • Chats API
    • Chat widget API
    • Datasource API
    • Sales tracking for live chat
    • Webhook
      • Incoming webhooks
      • Outgoing webhooks
    • SMS API
      • Authentication
      • Send SMS
      • Errors
      • Encoding
    • Python components
    • Intent detection (NLP)
  • Product guides
    • Product data feeds
    • Install guide
    • Product guide events
      • Product guide user events
      • Received events
      • Send events
    • API & webhooks
    • GA4 integration
    • Klaviyo integration
  • Messenger marketing
    • Install popup
    • API & webhooks
  • Widget cookies & storage
  • For chat agents
    • Ebbot Chat
      • Settings modal
      • Queue
      • Topbar Stats
      • Menu
        • Power-Ups!
        • Quick Replies
  • INTEGRATIONS
    • Ebbot Live Chat in Zendesk
      • Setup guide
    • Active Directory - SAML
    • Configure SAML in Azure
Powered by GitBook
On this page

Was this helpful?

  1. LLM benchmarks

Report: Language Translation Ability – Model Comparison

PreviousReport: Factual Consistency Rate – Model ComparisonNextCustom vocabulary

Last updated 8 days ago

Was this helpful?

Evaluating the linguistic abilities of a large language model (LLM) is a complex task that can be approached in various ways. This report focuses on one such method: having the model translate text and then comparing the result to a reference translation, or the "gold standard."

Like many evaluation methods, this approach has its limitations. For instance, a single sentence can be translated in multiple valid ways, making it difficult, if not impossible, to determine a single correct version. Additionally, the different metrics depend on different settings and parameters, which might impact the results. While these challenges can't be fully avoided, it's important to acknowledge and understand them.

Method

The method is based on a dataset of approximately 300 short English texts, which were translated into various languages using different quantized models. Each text was also translated by ChatGPT-4.1, whose output served as the reference or gold standard. All other translations were then evaluated against this gold standard using two different metrics: SacreBLEU and METEOR.

SacreBLEU, derived from the BLEU (BiLingual Evaluation Understudy) metric, measures the overlap of n-grams between the generated translation and the reference. It assigns a similarity score between 0 and 1, with higher scores indicating closer matches. This metric tends to favor exact word matches.

METEOR (Metric for Evaluation of Translation with Explicit Ordering) is a more comprehensive evaluation metric. It takes into account factors such as n-gram overlap, precision, recall, word order, synonyms, and stemming. METEOR also produces a score between 0 and 1, where higher values indicate greater similarity. Because of its flexibility, METEOR can assign high scores to translations that differ slightly in wording but are semantically equivalent, making it suitable for evaluating broader aspects of text generation quality beyond simple translation.

The metrics were calculated for all translated texts. The mean of all samples are shown in the graphs below.

Conclusion

The graphs indicate that our finetunings of the base models have led to a slight improvement in linguistic capabilities across both metrics. However, these improvements are generally minimal and it is challenging to quantify the exact impact of the finetunings. Nonetheless, it can be concluded that the finetunings haven’t negatively affected the overall linguistic capabilities. Furthermore, the graphs provide insight into which languages our models perform most effectively.