Model Comparisons

Evaluation Overview

This page presents our comprehensive evaluation framework designed to assess model performance across safety, reliability, grounding, and operational behavior. The tests go beyond standard accuracy metrics and focus on real-world robustness, including language compliance, tool-calling safety, hallucination resistance, RAG groundedness, and adversarial alignment. Together, these benchmarks provide a transparent and structured view of how models perform under both normal and high-risk conditions.


Benchmarks

  1. Language test: Ability to answer in a specific language

  2. Reference test: Ability to reference the source it used for its answer

  3. Tool test: Ability to trigger tools

  4. Safety test: Ability to avoid unwanted topics

  5. Hallucination test: Ability to only give information from customer knowledge and prompt


1. Language Test

Purpose

This test evaluates whether the model responds in the correct language based on the given instructions. It does not assess grammar, fluency, or content quality. The focus is strictly on whether the response language matches the required condition. The test includes multiple languages, such as English, Swedish, French, Japanese, and Indonesian.

Method

  1. Category Assignment Each test case is assigned to one of the two categories:

  • Same as User - The model must respond in the same language as the user’s most recent message.

  • Prompt Instructed - The model must respond only in the language explicitly specified in the prompt instructions, even if it differs from the language of the user’s last message.

  1. Evaluation Criteria

  • For Same as User: The response language must match the language of the user’s last message.

  • For Prompt Instructed: The response language must match the language specified in the prompt instructions, regardless of the user’s message language.

  1. Assessment Scope

  • Only language compliance is evaluated.

  • Grammar, tone, and overall response quality are not part of this assessment.


2. Reference Validation & RAG Performance Evaluation

To ensure high-quality and reliable outputs, our evaluation framework is structured into two primary categories: reference-test and no-context test, along with an overall RAG performance assessment.

2.1. Reference Tests

These tests verify that the model correctly cites and relies on the appropriate source material when generating responses.

  • Each test evaluates whether the model references the correct source(s).

  • The number of expected sources may vary depending on the scenario.

  • The focus is on validating accurate and appropriate source attribution.

2.2. No-Context Tests

These tests assess the model’s ability to recognize when no relevant information is available.

  • The model should not fabricate or reference any sources when sufficient context is not provided.

  • This ensures responsible behavior and reduces the risk of hallucinated references.

2.3. RAG Performance

Our evaluation methodology includes a comprehensive RAG Score that measures the overall effectiveness of a Retrieval-Augmented Generation (RAG) system across both retrieval and response generation stages.

Total RAG Score: Faithfulness

We define our Total RAG Score based on Faithfulness.

This metric evaluates:

  • Whether the generated response is fully grounded in the retrieved source material

  • Whether the model references only the provided sources

  • Whether the answer avoids unsupported claims or fabricated information

By focusing on faithfulness, we ensure that responses remain accurate, verifiable, and aligned strictly with the retrieved data.

This structured evaluation framework enables transparent measurement of source accuracy, reliability, and overall system performance.


3. Tool-Calling Evaluation

This evaluation assesses how the model behaves when tools are available and may need to be triggered. Unlike traditional benchmarks, higher accuracy is not always the goal. In tool-calling environments, we deliberately prioritize safety and user experience over raw activation rates.

False Positives vs. False Negatives

In this benchmark, we intentionally prefer:

  • False Negatives over False Positives

Why?

  • False Positive (unnecessary tool call)

    • Can interrupt the user experience

    • May escalate incorrectly

    • Can trigger irreversible or disruptive actions

  • False Negative (tool not triggered when it could have been)

    • Conversation continues safely

    • No unintended escalation

    • Lower operational risk

In practical deployments, avoiding incorrect tool activation is often more important than maximizing activation frequency.

Conversational Safety Behavior

In many scenarios, the model responds with a clarifying follow-up question before triggering a tool.

From a real-world product perspective, this is the correct and preferred behavior.

However, in automated evaluation, such clarification is often marked as incorrect, even though it reflects safer and more responsible decision-making.

Important Note on Scoring

In this benchmark, a higher numerical accuracy does not necessarily indicate better performance.

The objective is not to maximize tool triggers, but to ensure:

  • Responsible activation

  • Safe escalation behavior

  • Protection of user experience

  • Operational reliability

This evaluation framework prioritizes real-world robustness over simplistic accuracy metrics.

3.1. Tool Test

The Tool Test measures how well the models handle situations where tool usage (function calling) is available.

3.2. No tool-call

Test where the model should not make a tool-call.

Why EGPT May Be Preferred Over Gemini

Although Gemini achieves higher accuracy in the Tool Test, qualitative evaluation reveals an important behavioral difference:

Gemini tends to invoke functions even when it is not necessary. In other words, it overuses tool calls in situations where a direct answer would have been more appropriate.

This behavior can introduce:

  • Unnecessary system complexity

  • Increased latency

  • Higher operational costs

  • Reduced predictability in production environments

In contrast, EGPT demonstrates more conservative and controlled function-calling behavior. While its raw Tool Test accuracy appears lower, it is less likely to trigger tools when they are not required. This makes EGPT more reliable in scenarios where correct judgment about when to call a function is critical.

Practical Implication

If the goal is:

  • Maximum raw tool-task accuracy → Gemini performs best.

  • Controlled, disciplined function-calling behavior → EGPT may be the better choice.

In production systems, avoiding unnecessary tool calls can be just as important as achieving high accuracy, particularly in environments where stability, cost control, and predictable behavior are priorities.


4. Safety & Alignment Evaluation

Restricted Content Testing

This evaluation focuses on questions that the model should not answer, including topics such as:

  • War-related operational guidance

  • Criminal activity

  • Harmful or illegal instructions

  • Other high-risk content categories

The purpose is to ensure the model consistently chooses safe and policy-aligned responses.


DPO Test (Direct Preference Optimization)

Objective: Can the model prefer a safe response over an unsafe one when presented with standard prompts?

  • Includes a broad range of typical high-risk scenarios

  • Evaluates whether the model selects the aligned, policy-compliant answer

  • Designed to measure baseline safety alignment


DPO_ADV Test (Adversarial Safety Evaluation)

Objective: Can the model still prioritize the safe response when the prompt is adversarially crafted to bypass guardrails?

  • Prompts are intentionally designed to manipulate or trick the model

  • Tests resistance to jailbreak attempts

  • Measures robustness under pressure


Evaluation Focus

These tests assess:

  • Alignment strength

  • Resistance to unsafe generation

  • Robustness against prompt manipulation

  • Consistency in policy-compliant decision-making

The goal is not just compliance under normal conditions, but resilience under adversarial pressure.


5. Hallucination Evaluation

Important Context

This test does not measure how often a model hallucinates in general usage.

Instead, it is intentionally built around known high-risk scenarios, conversations that have previously caused hallucination issues in earlier evaluations. These are edge cases specifically designed to stress the model and expose weaknesses.

The purpose is to test robustness under pressure, not to estimate overall hallucination frequency.

Purpose

The Hallucination Test evaluates how well different LLM models maintain factual consistency and contextual grounding in challenging conversational situations.

These scenarios are known to increase the likelihood that models:

  • Introduce information not present in the system prompt

  • Add unsupported assumptions

  • Confidently generate fabricated details

Test Design

We selected conversations that historically triggered hallucinations. These difficult cases simulate real-world edge conditions where:

  • Context may be incomplete

  • The prompt may implicitly invite speculation

  • The model is tempted to “fill in gaps”

Evaluation Objective

This benchmark measures:

  • Resistance to speculative generation

  • Reliability in high-risk conversational scenarios

Because the dataset is intentionally constructed from known failure cases, results should be interpreted as a measure of robustness under stress, not as a general hallucination rate.

Last updated

Was this helpful?