Report: Language Translation Ability – Model Comparison

Evaluating the linguistic abilities of a large language model (LLM) is a complex task that can be approached in various ways. This report focuses on one such method: having the model translate text and then comparing the result to a reference translation, or the "gold standard."

Like many evaluation methods, this approach has its limitations. For instance, a single sentence can be translated in multiple valid ways, making it difficult, if not impossible, to determine a single correct version. Additionally, the different metrics depend on different settings and parameters, which might impact the results. While these challenges can't be fully avoided, it's important to acknowledge and understand them.

Method

The method is based on a dataset of approximately 300 short English texts, which were translated into various languages using different quantized models. Each text was also translated by ChatGPT-4.1, whose output served as the reference or gold standard. All other translations were then evaluated against this gold standard using two different metrics: SacreBLEU and METEOR.

SacreBLEU, derived from the BLEU (BiLingual Evaluation Understudy) metric, measures the overlap of n-grams between the generated translation and the reference. It assigns a similarity score between 0 and 1, with higher scores indicating closer matches. This metric tends to favor exact word matches.

METEOR (Metric for Evaluation of Translation with Explicit Ordering) is a more comprehensive evaluation metric. It takes into account factors such as n-gram overlap, precision, recall, word order, synonyms, and stemming. METEOR also produces a score between 0 and 1, where higher values indicate greater similarity. Because of its flexibility, METEOR can assign high scores to translations that differ slightly in wording but are semantically equivalent, making it suitable for evaluating broader aspects of text generation quality beyond simple translation.

The metrics were calculated for all translated texts. The mean of all samples are shown in the graphs below.

Conclusion

The graphs indicate that our finetunings of the base models have led to a slight improvement in linguistic capabilities across both metrics. However, these improvements are generally minimal and it is challenging to quantify the exact impact of the finetunings. Nonetheless, it can be concluded that the finetunings haven’t negatively affected the overall linguistic capabilities. Furthermore, the graphs provide insight into which languages our models perform most effectively.

PreviousReport: Factual Consistency Rate – Model Comparison NextAI Hallucinations

Last updated 6 months ago

Was this helpful?