Model Comparisons
Evaluation Overview
This page presents our comprehensive evaluation framework designed to assess model performance across safety, reliability, grounding, and operational behavior. The tests go beyond standard accuracy metrics and focus on real-world robustness, including language compliance, tool-calling safety, hallucination resistance, RAG groundedness, and adversarial alignment. Together, these benchmarks provide a transparent and structured view of how models perform under both normal and high-risk conditions.
Benchmarks
Language test: Ability to answer in a specific language
Reference test: Ability to reference the source it used for its answer
Tool test: Ability to trigger tools
Safety test: Ability to avoid unwanted topics
Hallucination test: Ability to only give information from customer knowledge and prompt
1. Language Test
Purpose
This test evaluates whether the model responds in the correct language based on the given instructions. It does not assess grammar, fluency, or content quality. The focus is strictly on whether the response language matches the required condition. The test includes multiple languages, such as English, Swedish, French, Japanese, and Indonesian.
Method
Category Assignment Each test case is assigned to one of the two categories:
Same as User - The model must respond in the same language as the user’s most recent message.
Prompt Instructed - The model must respond only in the language explicitly specified in the prompt instructions, even if it differs from the language of the user’s last message.
Evaluation Criteria
For Same as User: The response language must match the language of the user’s last message.
For Prompt Instructed: The response language must match the language specified in the prompt instructions, regardless of the user’s message language.
Assessment Scope
Only language compliance is evaluated.
Grammar, tone, and overall response quality are not part of this assessment.

2. Reference Validation & RAG Performance Evaluation
To ensure high-quality and reliable outputs, our evaluation framework is structured into two primary categories: reference-test and no-context test, along with an overall RAG performance assessment.
2.1. Reference Tests
These tests verify that the model correctly cites and relies on the appropriate source material when generating responses.
Each test evaluates whether the model references the correct source(s).
The number of expected sources may vary depending on the scenario.
The focus is on validating accurate and appropriate source attribution.

2.2. No-Context Tests
These tests assess the model’s ability to recognize when no relevant information is available.
The model should not fabricate or reference any sources when sufficient context is not provided.
This ensures responsible behavior and reduces the risk of hallucinated references.

2.3. RAG Performance
Our evaluation methodology includes a comprehensive RAG Score that measures the overall effectiveness of a Retrieval-Augmented Generation (RAG) system across both retrieval and response generation stages.
Total RAG Score: Faithfulness
We define our Total RAG Score based on Faithfulness.
This metric evaluates:
Whether the generated response is fully grounded in the retrieved source material
Whether the model references only the provided sources
Whether the answer avoids unsupported claims or fabricated information
By focusing on faithfulness, we ensure that responses remain accurate, verifiable, and aligned strictly with the retrieved data.
This structured evaluation framework enables transparent measurement of source accuracy, reliability, and overall system performance.

3. Tool-Calling Evaluation
This evaluation assesses how the model behaves when tools are available and may need to be triggered. Unlike traditional benchmarks, higher accuracy is not always the goal. In tool-calling environments, we deliberately prioritize safety and user experience over raw activation rates.
False Positives vs. False Negatives
In this benchmark, we intentionally prefer:
False Negatives over False Positives
Why?
False Positive (unnecessary tool call)
Can interrupt the user experience
May escalate incorrectly
Can trigger irreversible or disruptive actions
False Negative (tool not triggered when it could have been)
Conversation continues safely
No unintended escalation
Lower operational risk
In practical deployments, avoiding incorrect tool activation is often more important than maximizing activation frequency.
Conversational Safety Behavior
In many scenarios, the model responds with a clarifying follow-up question before triggering a tool.
From a real-world product perspective, this is the correct and preferred behavior.
However, in automated evaluation, such clarification is often marked as incorrect, even though it reflects safer and more responsible decision-making.
Important Note on Scoring
In this benchmark, a higher numerical accuracy does not necessarily indicate better performance.
The objective is not to maximize tool triggers, but to ensure:
Responsible activation
Safe escalation behavior
Protection of user experience
Operational reliability
This evaluation framework prioritizes real-world robustness over simplistic accuracy metrics.
3.1. Tool Test
The Tool Test measures how well the models handle situations where tool usage (function calling) is available.

3.2. No tool-call
Test where the model should not make a tool-call.

Why EGPT May Be Preferred Over Gemini
Although Gemini achieves higher accuracy in the Tool Test, qualitative evaluation reveals an important behavioral difference:
Gemini tends to invoke functions even when it is not necessary. In other words, it overuses tool calls in situations where a direct answer would have been more appropriate.
This behavior can introduce:
Unnecessary system complexity
Increased latency
Higher operational costs
Reduced predictability in production environments
In contrast, EGPT demonstrates more conservative and controlled function-calling behavior. While its raw Tool Test accuracy appears lower, it is less likely to trigger tools when they are not required. This makes EGPT more reliable in scenarios where correct judgment about when to call a function is critical.
Practical Implication
If the goal is:
Maximum raw tool-task accuracy → Gemini performs best.
Controlled, disciplined function-calling behavior → EGPT may be the better choice.
In production systems, avoiding unnecessary tool calls can be just as important as achieving high accuracy, particularly in environments where stability, cost control, and predictable behavior are priorities.
4. Safety & Alignment Evaluation
Restricted Content Testing
This evaluation focuses on questions that the model should not answer, including topics such as:
War-related operational guidance
Criminal activity
Harmful or illegal instructions
Other high-risk content categories
The purpose is to ensure the model consistently chooses safe and policy-aligned responses.
DPO Test (Direct Preference Optimization)
Objective: Can the model prefer a safe response over an unsafe one when presented with standard prompts?
Includes a broad range of typical high-risk scenarios
Evaluates whether the model selects the aligned, policy-compliant answer
Designed to measure baseline safety alignment
DPO_ADV Test (Adversarial Safety Evaluation)
Objective: Can the model still prioritize the safe response when the prompt is adversarially crafted to bypass guardrails?
Prompts are intentionally designed to manipulate or trick the model
Tests resistance to jailbreak attempts
Measures robustness under pressure
Evaluation Focus
These tests assess:
Alignment strength
Resistance to unsafe generation
Robustness against prompt manipulation
Consistency in policy-compliant decision-making
The goal is not just compliance under normal conditions, but resilience under adversarial pressure.

5. Hallucination Evaluation
Important Context
This test does not measure how often a model hallucinates in general usage.
Instead, it is intentionally built around known high-risk scenarios, conversations that have previously caused hallucination issues in earlier evaluations. These are edge cases specifically designed to stress the model and expose weaknesses.
The purpose is to test robustness under pressure, not to estimate overall hallucination frequency.
Purpose
The Hallucination Test evaluates how well different LLM models maintain factual consistency and contextual grounding in challenging conversational situations.
These scenarios are known to increase the likelihood that models:
Introduce information not present in the system prompt
Add unsupported assumptions
Confidently generate fabricated details
Test Design
We selected conversations that historically triggered hallucinations. These difficult cases simulate real-world edge conditions where:
Context may be incomplete
The prompt may implicitly invite speculation
The model is tempted to “fill in gaps”
Evaluation Objective
This benchmark measures:
Resistance to speculative generation
Reliability in high-risk conversational scenarios
Because the dataset is intentionally constructed from known failure cases, results should be interpreted as a measure of robustness under stress, not as a general hallucination rate.

Last updated
Was this helpful?

