Agent Evaluation - UnleashX

Agent Evaluation transforms call quality assurance from a reactive, manual process into a continuous, automated intelligence layer across every customer interaction. Organisations deploying AI voice agents at scale face a fundamental quality assurance challenge: how do you maintain consistent standards, meet regulatory obligations, and drive continuous agent improvement when call volumes exceed the capacity of human reviewers? Agent Evaluation addresses this directly. It scores every call, immediately after it ends, against a configurable set of quality criteria. Results are available in real time. No sampling. No lag. No reviewer bias.

Key Capabilities

Automated scoring of 100% of calls against configurable quality criteria
Composite scoring with weighted criteria to reflect business priorities
Independent threshold configuration per criterion and per evaluation
Real-time Dashboard with agent performance trends and criteria health

The business case for automated call evaluation is grounded in four strategic outcomes: quality at scale, risk reduction, operational efficiency, and customer experience consistency.

Quality at Scale

Traditional quality assurance processes evaluate between 2 and 5 percent of call volume. This sampling rate is insufficient to detect systemic issues, identify individual agent performance trends, or provide meaningful coaching data. Agent Evaluation scores every call, providing a complete and statistically reliable view of quality across the entire agent population.

Organisations that transition from sampled QA to 100% automated evaluation typically identify coaching opportunities three to four times faster, because patterns become visible across full call populations rather than small samples.

Compliance and Risk Management

Regulated industries require evidence that compliance obligations are being met on every customer interaction. Agent Evaluation provides a complete, timestamped audit record of every call scored against compliance criteria including identity verification, privacy handling, and accurate information disclosure.

Operational Efficiency

By automating the scoring process, quality assurance teams can redirect their time from listening to calls toward higher-value activities: analysing trends, designing coaching programmes, refining evaluation criteria, and working directly with agents on improvement.

Consistent Customer Experience

When quality standards are enforced by configuration rather than by individual reviewer judgment, the customer experience is more consistent. Every agent, on every call, is held to the same objective standard. This is particularly valuable in large deployments where multiple agent versions, multiple evaluations, and multiple teams are operating simultaneously.

	Traditional Manual QA	Agent Evaluation
Coverage	2 to 5% of calls reviewed	100% of calls scored automatically
Visibility	Patterns only visible in sampled data	Aggregate trends visible in real time across 100% of calls
Speed	Results available in days or weeks	Results available immediately after each call
Consistency	Inconsistent standards across reviewers and shifts	Identical rubric applied uniformly to every call
Effort	Significant supervisor time investment	Fully automated with no human scoring required

3. Use Cases

Agent Evaluation is purpose-built for enterprise teams operating AI voice agents in customer-facing roles. The following use cases represent the most common deployment patterns.

Continuous Quality Assurance

Deploy a standing evaluation for each voice agent in production. Every call is scored the moment it ends, and results are immediately visible in the Dashboard. Quality leadership can monitor composite score trends daily, identify declining performance early, and respond before issues affect customer satisfaction metrics.

This use case replaces the traditional weekly or monthly QA cycle with a continuous real-time signal. Teams typically configure one evaluation per agent, with criteria and thresholds aligned to their internal quality standards.

Regulatory Compliance Monitoring

Configure evaluations with compliance-specific criteria such as Privacy Compliance, No Misrepresentation, and No Offensive Content. Assign high weights to these criteria to ensure they drive the composite score. Every call is checked automatically, providing a complete audit trail for regulatory review.

Agent Performance Benchmarking

Run the same evaluation template across multiple agent versions or agent populations to compare performance objectively. Composite scores and per-criterion breakdowns provide a standardised basis for benchmarking that is not affected by reviewer subjectivity or sampling variance.

Use the Agent filter on the Dashboard to compare individual agent performance within the same evaluation, or create separate evaluation templates to apply different standards to different agent roles.

Targeted Coaching and Development

Use evaluation results to structure coaching conversations. When an agent scores below the configured threshold on a specific criterion, the scoring rationale and transcript evidence provide a concrete, objective basis for feedback. This removes subjectivity from the coaching process and enables coaching conversations to be focused on specific, documented behaviours.

New Agent Validation

Evaluate newly deployed agent versions from their first call without requiring human oversight. Monitor composite score trends in the early deployment period and compare performance against established agent benchmarks to determine whether additional tuning is required before broader rollout.

Multi-Market Quality Standards

Create separate evaluation templates for different markets, regions, or compliance contexts. Each template can apply different criteria, different weights, and different thresholds to reflect the quality standards that apply in each context. A single agent can be evaluated against multiple templates simultaneously.

4. How Agent Evaluation Works

Agent Evaluation operates as a fully automated pipeline. Once an evaluation is configured and activated, no manual intervention is required for calls to be scored and results to be surfaced.

Evaluation Pipeline

The voice agent completes a customer call and the recording is captured by the platform.
The recording is transmitted to the configured transcription provider, which converts audio to text.
The transcript is passed to the evaluator model along with the system prompt, evaluation criteria, and scoring rubrics.
The evaluator model scores each criterion independently, producing a score from 0 to 100 for each.
Criterion scores are combined using their configured weights to produce a composite call score visible in the Dashboard and Activities log.

The evaluator model is entirely independent of the voice agent. It functions as an objective, automated reviewer that reads the transcript and applies the scoring rubric configured by the administrator. It has no access to the live call and no ability to influence agent behaviour.

Composite Score Calculation

Each call receives a composite score representing overall call quality. The composite score is calculated as the weighted average of all criterion scores for that call. Example: An evaluation has three criteria with weights of 50, 30, and 20. The criterion scores for a specific call are 88, 71, and 52. The composite score is calculated as (88 multiplied by 50, plus 71 multiplied by 30, plus 52 multiplied by 20), divided by the total weight of 100. This produces a composite score of 76.

Criterion-Level Scoring

Each criterion is scored on a scale of 0 to 100. Criterion scores are entirely independent of each other. A high score on one criterion does not influence the score on another. The only relationship between criterion scores is their weighted contribution to the composite score.

5. Dashboard

The Dashboard provides executive-level visibility into call quality, agent performance, and criteria health. It is designed to surface actionable insights without requiring users to navigate individual call records.

Filters

Filter	Options and Default
Date Range	Today \| Last 7 Days (default) \| This Month \| Last 3 Months
Agent	All agents or a specific voice agent

All metrics on the Dashboard update dynamically when filters are changed. Selecting a specific agent narrows all metrics to that agent’s call population only.

Call Overview

Two summary cards provide the top-line quality signal for the selected period:

Metric	Definition	Operational Significance
Total Calls	Number of calls evaluated in the selected period	Confirms evaluation coverage. A lower-than-expected count may indicate a paused evaluation or a recording integration issue.
Avg Composite Score	Weighted average composite score across all evaluated calls	The primary quality indicator. A declining trend over consecutive periods signals a systemic performance issue requiring investigation.

Criteria Performance Table

The Criteria Performance table is the most operationally significant section of the Dashboard. It shows how each criterion is performing across all evaluated calls in the selected period, sorted from highest to lowest average score.

Column	Description
Rank	Criteria ordered by average score. The lowest-ranked criterion represents the most significant quality gap in the current period.
Criterion	The quality dimension being measured.
Avg Score	Average score for this criterion across all evaluated calls in the selected period.
Distribution Bar	A colour-coded bar showing the proportion of calls scoring in the Pass band (green), Review band (amber), and Fail band (red) for this criterion. Provides a signal on consistency as well as average performance.

A criterion with a high average score alongside a significant red band in the distribution bar indicates an inconsistency problem rather than a training gap. The agent performs well in most cases but fails in specific scenarios. This pattern often points to edge cases or specific call types that require targeted attention, rather than broad retraining.

Agent Performance Panels

Two panels provide agent-level context:

Top Performing displays the three agents with the highest average composite scores for the selected period. Each row is clickable and navigates to Activities filtered to that agent.
Needs Attention displays agents with the lowest average composite scores. These agents represent the priority coaching queue. Each row is clickable and navigates to Activities filtered to that agent.

Every element of the Dashboard is interactive. Clicking a KPI card, a row in the Criteria table, or an agent in either panel navigates directly to the Activities tab with the corresponding filter applied. This reduces the time from insight to investigation to a single click.

6. Activities

The Activities tab is the complete operational record of every call processed by Agent Evaluation. It provides call-level detail, filtering, and access to per-criterion scoring breakdowns for individual calls.

Call Table

Column	Description
Call ID	Unique identifier for the call. Click to open the call detail drawer. A flag icon appears when no agent audio is detected in the recording.
Agent	The voice agent that handled the call.
Evaluation	The evaluation template applied to this call.
Score	The composite score for this call, displayed as a colour-coded badge.
Sentiment	Detected customer sentiment: Positive, Neutral, or Negative.
Duration	Length of the call recording.
Timestamp	Date and time the evaluation was completed.
Status	Pass, Review, Fail, or Skipped. See section 6.3 for definitions.
Actions	View Details opens the call detail drawer. Copy Call ID copies the identifier to the clipboard.

Call Status Definitions

Status	Definition	Recommended Action
Pass	Composite score meets the configured quality standard	No immediate action required. Monitor for trend changes over time.
Review	Composite score falls in the borderline band between pass and fail thresholds	Recommend human review of the call detail. Open the criteria breakdown to identify which dimensions are underperforming.
Fail	Composite score falls below the configured minimum threshold	Prioritise for agent coaching. Use the per-criterion breakdown to identify specific improvement areas with transcript evidence.
Skipped	Call duration was below the configured minimum threshold	No quality action required. Skipped calls are excluded from all scoring calculations. Review the skip threshold configuration if the volume of skipped calls is unexpectedly high.

Call Detail Drawer

Selecting any Call ID opens a side panel with the complete evaluation record for that call:

Composite score with status classification.
Call metadata including duration, detected sentiment, and evaluation timestamp.
Per-criterion breakdown showing the individual score for each criterion, with a progress bar indicating where the score sits relative to the configured pass and fail thresholds.
Flagged status indicator when the call has been manually flagged for human review.

No Agent Audio Indicator

When no agent audio is detected in a call recording, a flag indicator appears next to the Call ID. This signals a potential recording or integration issue rather than an agent performance concern. Calls with this indicator should be investigated.

Manual Recording Upload

To evaluate recordings that were not automatically captured through the integration:

Select Upload Recordings from the Activities toolbar.
Select the evaluation template to apply to the recordings.
Upload the recording file. Supported formats are ZIP, MP4, WAV, and MP3 with a maximum file size of 50 MB.
Select Upload and Evaluate. Results will appear in the Activities table once processing is complete.

7. Configuration

The Configuration tab is the central management interface for evaluation templates. Administrators create, modify, activate, and archive evaluation templates from this view.

Evaluation Table

Column	Description
Name	The evaluation template name and its system-assigned unique identifier.
Agent	The voice agent assigned to this evaluation template.
LLM Model	The evaluator model configured to score transcripts for this evaluation.
Criteria	The number of quality criteria defined in this evaluation template.
Calls Run	The cumulative count of calls scored by this evaluation template since activation.
Status	Active indicates the evaluation is currently scoring new calls. Paused indicates it is inactive.

Template Actions

Action	Behaviour
Edit	Opens the setup wizard to modify any aspect of the evaluation template. The Voice Agent field is locked while the evaluation is Active.
Start or Pause	Activates or deactivates the evaluation. A paused evaluation does not score new calls. All historical results and configuration are preserved.
Archive	Removes the template from the active list. Archived templates are not deleted. All historical scoring data remains accessible.

The Voice Agent assigned to an evaluation template cannot be changed while the template is set to Active. Administrators must pause the evaluation before modifying the agent assignment. This constraint ensures that scoring continuity is maintained for active call populations.

8. Creating an Evaluation

Select Create Evaluation in the Configuration tab to launch the three-step configuration wizard. Each step must be completed and validated before progression to the next.

Step 1: Setup

Evaluation Name — Assign a name that clearly communicates the purpose and scope of the evaluation. Recommended naming conventions include agent role, quality focus, and period. Examples: Customer Support Quality Q2 2026 or Outbound Sales Compliance EMEA. Voice Agent — Select the voice agent whose calls this evaluation will score. All calls completed by the selected agent will be evaluated automatically once the evaluation is activated.

The Voice Agent field is locked once the evaluation is set to Active. To reassign an evaluation to a different agent, the evaluation must first be paused.

Call Settings — Two settings control call inclusion in the evaluation:

Include Past Call Recordings determines whether historical recordings are scored. When disabled (the default), only calls made after the evaluation is activated are scored. When enabled, all existing recordings for the assigned agent are also scored.
Skip Calls Shorter Than defines the minimum call duration for evaluation eligibility. Calls below this threshold are classified as Skipped and excluded from all calculations. The default value is 10 seconds.

Evaluator Model — Select the language model that will read transcripts and score each criterion. A primary model and a fallback model must both be specified. The fallback model is invoked automatically in the event of primary model unavailability. The transcription provider converts audio recordings to text prior to scoring. The temperature parameter controls scoring determinism. A value of 0.3 is recommended for consistent, repeatable evaluation results.

Step 2: QA Instructions

The system prompt is the instruction set provided to the evaluator model. It defines the evaluation context, the standards to apply, and the required output format. A default prompt aligned with general call quality standards is pre-populated. Administrators should customize the system prompt to reflect organisational tone of voice, industry-specific context, regulatory requirements, and any agent-specific evaluation considerations. The system prompt applies to all criteria within the evaluation.

Keep the system prompt focused on evaluation principles and context. Criterion-specific scoring guidance belongs in the individual Scoring Prompt fields configured in Step 3. Separating these concerns produces more consistent and predictable scoring outcomes.

Step 3: Criteria

Criteria define the quality dimensions evaluated on every call. Each criterion is scored independently. Select Add Criteria to add a criterion to the evaluation. Selecting a preset criterion auto-populates the scoring rubric with an industry-standard prompt. Custom criteria can be defined from scratch. New criteria are added to the top of the criteria list in an expanded state for immediate configuration. All other criteria collapse automatically to maintain a clean workspace.

9. Criteria and Scoring

Criteria are the foundational unit of Agent Evaluation. They define precisely what quality means within a given evaluation context, and how it is measured at the call level.

Criterion Configuration Fields

Field	Description
Criterion Preset	Fifteen standard presets are available. Selecting a preset populates the scoring prompt with a pre-validated rubric. The Custom option allows fully bespoke criteria definition.
Weight	A relative value determining this criterion’s proportional contribution to the composite score. Weights are not required to sum to 100. The platform calculates percentage contributions automatically based on the total weight across all criteria.
Pass Threshold	The minimum score (0 to 100) required for this criterion to be classified as Passed on a given call.
Fail Threshold	The score below which this criterion is classified as Failed on a given call.
Review Range	The band between the fail and pass thresholds, calculated automatically by the platform. Calls scoring in this range are classified as Review.
Scoring Prompt	The rubric provided to the evaluator model describing expected performance at each score band: 90 to 100, 60 to 89, and 0 to 59. This prompt is specific to the criterion and supplements the general system prompt.

Standard Criterion Library

The following fifteen criteria are available as presets. Each includes a pre-validated scoring rubric aligned with industry standards for voice AI quality assessment.

Criterion	Quality Dimension Evaluated
Empathy	Whether the agent acknowledged customer emotions appropriately and maintained a warm, natural tone throughout the interaction.
Language Switch	Whether the agent detected and adapted to the customer’s preferred language without friction or quality degradation.
Turn Taking	Whether the conversation was balanced and natural, with appropriate pauses and no agent interruptions.
Context Awareness	Whether the agent leveraged available customer history and avoided requesting information already on record.
Recovery From Errors	Whether the agent acknowledged and corrected errors promptly, maintaining customer confidence throughout.
Intent Recognition	Whether the customer’s intent was correctly identified on the first attempt without unnecessary rerouting or clarification.
Greeting Accuracy	Whether the call opening included all required elements: brand identification, agent identification, professional tone, and offer of assistance.
Intent Confirmation	Whether the agent confirmed their understanding of the customer’s request before taking action.
Probing Questions	Whether the agent used targeted, open-ended questions to establish the full context required for resolution.
Call Closing	Whether the agent summarised agreed next steps, confirmed customer satisfaction, and completed the call professionally.
No Offensive Content	Whether the call was free from offensive, discriminatory, or non-compliant language throughout the interaction.
No Misrepresentation	Whether all information provided by the agent was accurate and no misleading claims were made.
Privacy Compliance	Whether identity verification procedures were followed correctly and sensitive data was handled in accordance with applicable protocols.
Confidence Level	Whether the agent projected competence and authority, remained composed under pressure, and responded decisively.
Input Metadata Consistency	Whether all required post-call fields were accurately completed and consistent with the content of the interaction.

Threshold Configuration

Pass and Fail thresholds are configured independently for each criterion within each evaluation template. This design allows organisations to apply different quality standards to the same criterion depending on the context.

Pass and Fail thresholds apply at the criterion level, not at the composite score level. A call may have a high composite score while one criterion falls below its configured Fail threshold, if other criteria with higher weights offset it. Administrators should review the per-criterion breakdown in the call detail drawer when investigating calls with borderline composite scores.

Weight Configuration

Criterion weights determine the relative contribution of each criterion to the composite score. The platform calculates percentage contributions automatically from the raw weight values provided. Compliance-critical criteria such as Privacy Compliance, No Offensive Content, and No Misrepresentation should typically be assigned higher weights in regulated contexts to ensure that compliance failures drive the composite score down appropriately. Example: If Privacy Compliance has a weight of 40 and Empathy has a weight of 10, a fail on Privacy Compliance will have four times the impact on the composite score as a fail on Empathy. This reflects the relative business priority of those two dimensions.

10. Call Settings

Call Settings are defined in Step 1 of the evaluation setup wizard and determine which calls are included in evaluation scoring. These settings apply to the evaluation template and cannot be overridden at the call level.

Include Past Call Recordings

Configuration	Behaviour
Disabled (default)	Only calls completed after the evaluation template is saved and activated will be scored. Existing recordings for the assigned agent are not processed.
Enabled	All existing recordings for the assigned agent will be scored in addition to all subsequent calls. Enabling this setting will increase processing time and evaluation cost in proportion to the volume of historical recordings available.

Skip Calls Shorter Than

This setting establishes the minimum call duration threshold for evaluation eligibility. Calls below this duration are automatically classified as Skipped and excluded from all quality metrics, including composite score calculations and criteria performance aggregations. The skip threshold is designed to prevent incomplete interactions, such as dropped calls, silent calls, and test calls, from distorting quality measurements. Skipped calls remain visible in the Activities log for operational transparency, displayed with reduced opacity and a Skipped status label.

The default skip threshold of 10 seconds is appropriate for most deployments. If your agent population handles very short, transactional interactions, consider lowering this threshold to ensure those calls are captured in quality metrics. If your environment produces a high volume of accidental or test calls, consider raising the threshold.

Configuration range: 0 to 300 seconds. Default: 10 seconds.

11. Glossary

| Term | Definition | |------|-----------|
| Agent Evaluation | The UnleashX module that automatically scores calls made by AI voice agents against configurable quality criteria. | | Composite Score | The weighted average of all criterion scores for a single call, representing the overall quality of that interaction. | | Criterion | A single, defined quality dimension evaluated on a call. Examples include Empathy, Call Closing, and Privacy Compliance. | | Criteria Score | The score from 0 to 100 assigned to a single criterion for a single call by the evaluator model. | | Evaluation Template | A saved configuration defining which criteria to score, the associated rubrics, thresholds, evaluator model, and agent assignment. | | Evaluator Model | The language model that reads call transcripts and assigns criterion scores. Fully independent of the voice agent. | | Fail Threshold | The score below which a criterion is classified as Failed for a given call. | | Pass Threshold | The minimum score required for a criterion to be classified as Passed for a given call. | | Review | A status indicating a criterion or composite score falls in the band between the configured fail and pass thresholds. | | Scoring Prompt | The rubric provided to the evaluator model describing expected performance at each score band for a specific criterion. | | Skipped | A call status applied when call duration falls below the configured minimum threshold. Skipped calls are excluded from all quality calculations. | | Temperature | An evaluator model parameter controlling output determinism. Lower values produce more consistent, repeatable scoring results. Recommended value: 0.3. | | Transcription Provider | The third-party service that converts call audio recordings to text prior to evaluation scoring. | | Weight | A relative numerical value assigned to a criterion that determines its proportional contribution to the composite score. |

Documentation Index

​Key Capabilities

​Quality at Scale

​Compliance and Risk Management

​Operational Efficiency

​Consistent Customer Experience

​3. Use Cases

​Continuous Quality Assurance

​Regulatory Compliance Monitoring

​Agent Performance Benchmarking

​Targeted Coaching and Development

​New Agent Validation

​Multi-Market Quality Standards

​4. How Agent Evaluation Works

​Evaluation Pipeline

​Composite Score Calculation

​Criterion-Level Scoring

​5. Dashboard

​Filters

​Call Overview

​Criteria Performance Table

​Agent Performance Panels

​6. Activities

​Call Table

​Call Status Definitions

​Call Detail Drawer

​No Agent Audio Indicator

​Manual Recording Upload

​7. Configuration

​Evaluation Table

​Template Actions

​8. Creating an Evaluation

​Step 1: Setup

​Step 2: QA Instructions

​Step 3: Criteria

​9. Criteria and Scoring

​Criterion Configuration Fields

​Standard Criterion Library

​Threshold Configuration

​Weight Configuration

​10. Call Settings

​Include Past Call Recordings

​Skip Calls Shorter Than

​11. Glossary

Key Capabilities

Quality at Scale

Compliance and Risk Management

Operational Efficiency

Consistent Customer Experience

3. Use Cases

Continuous Quality Assurance

Regulatory Compliance Monitoring

Agent Performance Benchmarking

Targeted Coaching and Development

New Agent Validation

Multi-Market Quality Standards

4. How Agent Evaluation Works

Evaluation Pipeline

Composite Score Calculation

Criterion-Level Scoring

5. Dashboard

Filters

Call Overview

Criteria Performance Table

Agent Performance Panels

6. Activities

Call Table

Call Status Definitions

Call Detail Drawer

No Agent Audio Indicator

Manual Recording Upload

7. Configuration

Evaluation Table

Template Actions

8. Creating an Evaluation

Step 1: Setup

Step 2: QA Instructions

Step 3: Criteria

9. Criteria and Scoring

Criterion Configuration Fields

Standard Criterion Library

Threshold Configuration

Weight Configuration

10. Call Settings

Include Past Call Recordings

Skip Calls Shorter Than

11. Glossary