Documentation Index
Fetch the complete documentation index at: https://docs.unleashx.ai/llms.txt
Use this file to discover all available pages before exploring further.
Agent Evaluation transforms call quality assurance from a reactive, manual process into a continuous, automated intelligence layer across every customer interaction.
Organisations deploying AI voice agents at scale face a fundamental quality assurance challenge: how do you maintain consistent standards, meet regulatory obligations, and drive continuous agent improvement when call volumes exceed the capacity of human reviewers?
Agent Evaluation addresses this directly. It scores every call, immediately after it ends, against a configurable set of quality criteria. Results are available in real time. No sampling. No lag. No reviewer bias.
Key Capabilities
- Automated scoring of 100% of calls against configurable quality criteria
- Composite scoring with weighted criteria to reflect business priorities
- Independent threshold configuration per criterion and per evaluation
- Real-time Dashboard with agent performance trends and criteria health
The business case for automated call evaluation is grounded in four strategic outcomes: quality at scale, risk reduction, operational efficiency, and customer experience consistency.
Quality at Scale
Traditional quality assurance processes evaluate between 2 and 5 percent of call volume. This sampling rate is insufficient to detect systemic issues, identify individual agent performance trends, or provide meaningful coaching data. Agent Evaluation scores every call, providing a complete and statistically reliable view of quality across the entire agent population.
Organisations that transition from sampled QA to 100% automated evaluation typically identify coaching opportunities three to four times faster, because patterns become visible across full call populations rather than small samples.
Compliance and Risk Management
Regulated industries require evidence that compliance obligations are being met on every customer interaction. Agent Evaluation provides a complete, timestamped audit record of every call scored against compliance criteria including identity verification, privacy handling, and accurate information disclosure.
Operational Efficiency
By automating the scoring process, quality assurance teams can redirect their time from listening to calls toward higher-value activities: analysing trends, designing coaching programmes, refining evaluation criteria, and working directly with agents on improvement.
Consistent Customer Experience
When quality standards are enforced by configuration rather than by individual reviewer judgment, the customer experience is more consistent. Every agent, on every call, is held to the same objective standard. This is particularly valuable in large deployments where multiple agent versions, multiple evaluations, and multiple teams are operating simultaneously.
| Traditional Manual QA | Agent Evaluation |
|---|
| Coverage | 2 to 5% of calls reviewed | 100% of calls scored automatically |
| Visibility | Patterns only visible in sampled data | Aggregate trends visible in real time across 100% of calls |
| Speed | Results available in days or weeks | Results available immediately after each call |
| Consistency | Inconsistent standards across reviewers and shifts | Identical rubric applied uniformly to every call |
| Effort | Significant supervisor time investment | Fully automated with no human scoring required |
3. Use Cases
Agent Evaluation is purpose-built for enterprise teams operating AI voice agents in customer-facing roles. The following use cases represent the most common deployment patterns.
Continuous Quality Assurance
Deploy a standing evaluation for each voice agent in production. Every call is scored the moment it ends, and results are immediately visible in the Dashboard. Quality leadership can monitor composite score trends daily, identify declining performance early, and respond before issues affect customer satisfaction metrics.
This use case replaces the traditional weekly or monthly QA cycle with a continuous real-time signal. Teams typically configure one evaluation per agent, with criteria and thresholds aligned to their internal quality standards.
Regulatory Compliance Monitoring
Configure evaluations with compliance-specific criteria such as Privacy Compliance, No Misrepresentation, and No Offensive Content. Assign high weights to these criteria to ensure they drive the composite score. Every call is checked automatically, providing a complete audit trail for regulatory review.
Run the same evaluation template across multiple agent versions or agent populations to compare performance objectively. Composite scores and per-criterion breakdowns provide a standardised basis for benchmarking that is not affected by reviewer subjectivity or sampling variance.
Use the Agent filter on the Dashboard to compare individual agent performance within the same evaluation, or create separate evaluation templates to apply different standards to different agent roles.
Targeted Coaching and Development
Use evaluation results to structure coaching conversations. When an agent scores below the configured threshold on a specific criterion, the scoring rationale and transcript evidence provide a concrete, objective basis for feedback. This removes subjectivity from the coaching process and enables coaching conversations to be focused on specific, documented behaviours.
New Agent Validation
Evaluate newly deployed agent versions from their first call without requiring human oversight. Monitor composite score trends in the early deployment period and compare performance against established agent benchmarks to determine whether additional tuning is required before broader rollout.
Multi-Market Quality Standards
Create separate evaluation templates for different markets, regions, or compliance contexts. Each template can apply different criteria, different weights, and different thresholds to reflect the quality standards that apply in each context. A single agent can be evaluated against multiple templates simultaneously.
4. How Agent Evaluation Works
Agent Evaluation operates as a fully automated pipeline. Once an evaluation is configured and activated, no manual intervention is required for calls to be scored and results to be surfaced.
Evaluation Pipeline
- The voice agent completes a customer call and the recording is captured by the platform.
- The recording is transmitted to the configured transcription provider, which converts audio to text.
- The transcript is passed to the evaluator model along with the system prompt, evaluation criteria, and scoring rubrics.
- The evaluator model scores each criterion independently, producing a score from 0 to 100 for each.
- Criterion scores are combined using their configured weights to produce a composite call score visible in the Dashboard and Activities log.
The evaluator model is entirely independent of the voice agent. It functions as an objective, automated reviewer that reads the transcript and applies the scoring rubric configured by the administrator. It has no access to the live call and no ability to influence agent behaviour.
Composite Score Calculation
Each call receives a composite score representing overall call quality. The composite score is calculated as the weighted average of all criterion scores for that call.
Example: An evaluation has three criteria with weights of 50, 30, and 20. The criterion scores for a specific call are 88, 71, and 52. The composite score is calculated as (88 multiplied by 50, plus 71 multiplied by 30, plus 52 multiplied by 20), divided by the total weight of 100. This produces a composite score of 76.
Criterion-Level Scoring
Each criterion is scored on a scale of 0 to 100. Criterion scores are entirely independent of each other. A high score on one criterion does not influence the score on another. The only relationship between criterion scores is their weighted contribution to the composite score.
5. Dashboard
The Dashboard provides executive-level visibility into call quality, agent performance, and criteria health. It is designed to surface actionable insights without requiring users to navigate individual call records.
Filters
| Filter | Options and Default |
|---|
| Date Range | Today | Last 7 Days (default) | This Month | Last 3 Months |
| Agent | All agents or a specific voice agent |
All metrics on the Dashboard update dynamically when filters are changed. Selecting a specific agent narrows all metrics to that agent’s call population only.
Call Overview
Two summary cards provide the top-line quality signal for the selected period:
| Metric | Definition | Operational Significance |
|---|
| Total Calls | Number of calls evaluated in the selected period | Confirms evaluation coverage. A lower-than-expected count may indicate a paused evaluation or a recording integration issue. |
| Avg Composite Score | Weighted average composite score across all evaluated calls | The primary quality indicator. A declining trend over consecutive periods signals a systemic performance issue requiring investigation. |
The Criteria Performance table is the most operationally significant section of the Dashboard. It shows how each criterion is performing across all evaluated calls in the selected period, sorted from highest to lowest average score.
| Column | Description |
|---|
| Rank | Criteria ordered by average score. The lowest-ranked criterion represents the most significant quality gap in the current period. |
| Criterion | The quality dimension being measured. |
| Avg Score | Average score for this criterion across all evaluated calls in the selected period. |
| Distribution Bar | A colour-coded bar showing the proportion of calls scoring in the Pass band (green), Review band (amber), and Fail band (red) for this criterion. Provides a signal on consistency as well as average performance. |
A criterion with a high average score alongside a significant red band in the distribution bar indicates an inconsistency problem rather than a training gap. The agent performs well in most cases but fails in specific scenarios. This pattern often points to edge cases or specific call types that require targeted attention, rather than broad retraining.
Two panels provide agent-level context:
- Top Performing displays the three agents with the highest average composite scores for the selected period. Each row is clickable and navigates to Activities filtered to that agent.
- Needs Attention displays agents with the lowest average composite scores. These agents represent the priority coaching queue. Each row is clickable and navigates to Activities filtered to that agent.
Every element of the Dashboard is interactive. Clicking a KPI card, a row in the Criteria table, or an agent in either panel navigates directly to the Activities tab with the corresponding filter applied. This reduces the time from insight to investigation to a single click.
6. Activities
The Activities tab is the complete operational record of every call processed by Agent Evaluation. It provides call-level detail, filtering, and access to per-criterion scoring breakdowns for individual calls.
Call Table
| Column | Description |
|---|
| Call ID | Unique identifier for the call. Click to open the call detail drawer. A flag icon appears when no agent audio is detected in the recording. |
| Agent | The voice agent that handled the call. |
| Evaluation | The evaluation template applied to this call. |
| Score | The composite score for this call, displayed as a colour-coded badge. |
| Sentiment | Detected customer sentiment: Positive, Neutral, or Negative. |
| Duration | Length of the call recording. |
| Timestamp | Date and time the evaluation was completed. |
| Status | Pass, Review, Fail, or Skipped. See section 6.3 for definitions. |
| Actions | View Details opens the call detail drawer. Copy Call ID copies the identifier to the clipboard. |
Call Status Definitions
| Status | Definition | Recommended Action |
|---|
| Pass | Composite score meets the configured quality standard | No immediate action required. Monitor for trend changes over time. |
| Review | Composite score falls in the borderline band between pass and fail thresholds | Recommend human review of the call detail. Open the criteria breakdown to identify which dimensions are underperforming. |
| Fail | Composite score falls below the configured minimum threshold | Prioritise for agent coaching. Use the per-criterion breakdown to identify specific improvement areas with transcript evidence. |
| Skipped | Call duration was below the configured minimum threshold | No quality action required. Skipped calls are excluded from all scoring calculations. Review the skip threshold configuration if the volume of skipped calls is unexpectedly high. |
Call Detail Drawer
Selecting any Call ID opens a side panel with the complete evaluation record for that call:
- Composite score with status classification.
- Call metadata including duration, detected sentiment, and evaluation timestamp.
- Per-criterion breakdown showing the individual score for each criterion, with a progress bar indicating where the score sits relative to the configured pass and fail thresholds.
- Flagged status indicator when the call has been manually flagged for human review.
No Agent Audio Indicator
When no agent audio is detected in a call recording, a flag indicator appears next to the Call ID. This signals a potential recording or integration issue rather than an agent performance concern. Calls with this indicator should be investigated.
Manual Recording Upload
To evaluate recordings that were not automatically captured through the integration:
- Select Upload Recordings from the Activities toolbar.
- Select the evaluation template to apply to the recordings.
- Upload the recording file. Supported formats are ZIP, MP4, WAV, and MP3 with a maximum file size of 50 MB.
- Select Upload and Evaluate. Results will appear in the Activities table once processing is complete.
7. Configuration
The Configuration tab is the central management interface for evaluation templates. Administrators create, modify, activate, and archive evaluation templates from this view.
Evaluation Table
| Column | Description |
|---|
| Name | The evaluation template name and its system-assigned unique identifier. |
| Agent | The voice agent assigned to this evaluation template. |
| LLM Model | The evaluator model configured to score transcripts for this evaluation. |
| Criteria | The number of quality criteria defined in this evaluation template. |
| Calls Run | The cumulative count of calls scored by this evaluation template since activation. |
| Status | Active indicates the evaluation is currently scoring new calls. Paused indicates it is inactive. |
Template Actions
| Action | Behaviour |
|---|
| Edit | Opens the setup wizard to modify any aspect of the evaluation template. The Voice Agent field is locked while the evaluation is Active. |
| Start or Pause | Activates or deactivates the evaluation. A paused evaluation does not score new calls. All historical results and configuration are preserved. |
| Archive | Removes the template from the active list. Archived templates are not deleted. All historical scoring data remains accessible. |
The Voice Agent assigned to an evaluation template cannot be changed while the template is set to Active. Administrators must pause the evaluation before modifying the agent assignment. This constraint ensures that scoring continuity is maintained for active call populations.
8. Creating an Evaluation
Select Create Evaluation in the Configuration tab to launch the three-step configuration wizard. Each step must be completed and validated before progression to the next.
Step 1: Setup
Evaluation Name — Assign a name that clearly communicates the purpose and scope of the evaluation. Recommended naming conventions include agent role, quality focus, and period. Examples: Customer Support Quality Q2 2026 or Outbound Sales Compliance EMEA.
Voice Agent — Select the voice agent whose calls this evaluation will score. All calls completed by the selected agent will be evaluated automatically once the evaluation is activated.
The Voice Agent field is locked once the evaluation is set to Active. To reassign an evaluation to a different agent, the evaluation must first be paused.
Call Settings — Two settings control call inclusion in the evaluation:
- Include Past Call Recordings determines whether historical recordings are scored. When disabled (the default), only calls made after the evaluation is activated are scored. When enabled, all existing recordings for the assigned agent are also scored.
- Skip Calls Shorter Than defines the minimum call duration for evaluation eligibility. Calls below this threshold are classified as Skipped and excluded from all calculations. The default value is 10 seconds.
Evaluator Model — Select the language model that will read transcripts and score each criterion. A primary model and a fallback model must both be specified. The fallback model is invoked automatically in the event of primary model unavailability.
The transcription provider converts audio recordings to text prior to scoring. The temperature parameter controls scoring determinism. A value of 0.3 is recommended for consistent, repeatable evaluation results.
Step 2: QA Instructions
The system prompt is the instruction set provided to the evaluator model. It defines the evaluation context, the standards to apply, and the required output format. A default prompt aligned with general call quality standards is pre-populated.
Administrators should customize the system prompt to reflect organisational tone of voice, industry-specific context, regulatory requirements, and any agent-specific evaluation considerations. The system prompt applies to all criteria within the evaluation.
Keep the system prompt focused on evaluation principles and context. Criterion-specific scoring guidance belongs in the individual Scoring Prompt fields configured in Step 3. Separating these concerns produces more consistent and predictable scoring outcomes.
Step 3: Criteria
Criteria define the quality dimensions evaluated on every call. Each criterion is scored independently. Select Add Criteria to add a criterion to the evaluation. Selecting a preset criterion auto-populates the scoring rubric with an industry-standard prompt. Custom criteria can be defined from scratch.
New criteria are added to the top of the criteria list in an expanded state for immediate configuration. All other criteria collapse automatically to maintain a clean workspace.
9. Criteria and Scoring
Criteria are the foundational unit of Agent Evaluation. They define precisely what quality means within a given evaluation context, and how it is measured at the call level.
Criterion Configuration Fields
| Field | Description |
|---|
| Criterion Preset | Fifteen standard presets are available. Selecting a preset populates the scoring prompt with a pre-validated rubric. The Custom option allows fully bespoke criteria definition. |
| Weight | A relative value determining this criterion’s proportional contribution to the composite score. Weights are not required to sum to 100. The platform calculates percentage contributions automatically based on the total weight across all criteria. |
| Pass Threshold | The minimum score (0 to 100) required for this criterion to be classified as Passed on a given call. |
| Fail Threshold | The score below which this criterion is classified as Failed on a given call. |
| Review Range | The band between the fail and pass thresholds, calculated automatically by the platform. Calls scoring in this range are classified as Review. |
| Scoring Prompt | The rubric provided to the evaluator model describing expected performance at each score band: 90 to 100, 60 to 89, and 0 to 59. This prompt is specific to the criterion and supplements the general system prompt. |
Standard Criterion Library
The following fifteen criteria are available as presets. Each includes a pre-validated scoring rubric aligned with industry standards for voice AI quality assessment.
| Criterion | Quality Dimension Evaluated |
|---|
| Empathy | Whether the agent acknowledged customer emotions appropriately and maintained a warm, natural tone throughout the interaction. |
| Language Switch | Whether the agent detected and adapted to the customer’s preferred language without friction or quality degradation. |
| Turn Taking | Whether the conversation was balanced and natural, with appropriate pauses and no agent interruptions. |
| Context Awareness | Whether the agent leveraged available customer history and avoided requesting information already on record. |
| Recovery From Errors | Whether the agent acknowledged and corrected errors promptly, maintaining customer confidence throughout. |
| Intent Recognition | Whether the customer’s intent was correctly identified on the first attempt without unnecessary rerouting or clarification. |
| Greeting Accuracy | Whether the call opening included all required elements: brand identification, agent identification, professional tone, and offer of assistance. |
| Intent Confirmation | Whether the agent confirmed their understanding of the customer’s request before taking action. |
| Probing Questions | Whether the agent used targeted, open-ended questions to establish the full context required for resolution. |
| Call Closing | Whether the agent summarised agreed next steps, confirmed customer satisfaction, and completed the call professionally. |
| No Offensive Content | Whether the call was free from offensive, discriminatory, or non-compliant language throughout the interaction. |
| No Misrepresentation | Whether all information provided by the agent was accurate and no misleading claims were made. |
| Privacy Compliance | Whether identity verification procedures were followed correctly and sensitive data was handled in accordance with applicable protocols. |
| Confidence Level | Whether the agent projected competence and authority, remained composed under pressure, and responded decisively. |
| Input Metadata Consistency | Whether all required post-call fields were accurately completed and consistent with the content of the interaction. |
Threshold Configuration
Pass and Fail thresholds are configured independently for each criterion within each evaluation template. This design allows organisations to apply different quality standards to the same criterion depending on the context.
Pass and Fail thresholds apply at the criterion level, not at the composite score level. A call may have a high composite score while one criterion falls below its configured Fail threshold, if other criteria with higher weights offset it. Administrators should review the per-criterion breakdown in the call detail drawer when investigating calls with borderline composite scores.
Weight Configuration
Criterion weights determine the relative contribution of each criterion to the composite score. The platform calculates percentage contributions automatically from the raw weight values provided.
Compliance-critical criteria such as Privacy Compliance, No Offensive Content, and No Misrepresentation should typically be assigned higher weights in regulated contexts to ensure that compliance failures drive the composite score down appropriately.
Example: If Privacy Compliance has a weight of 40 and Empathy has a weight of 10, a fail on Privacy Compliance will have four times the impact on the composite score as a fail on Empathy. This reflects the relative business priority of those two dimensions.
10. Call Settings
Call Settings are defined in Step 1 of the evaluation setup wizard and determine which calls are included in evaluation scoring. These settings apply to the evaluation template and cannot be overridden at the call level.
Include Past Call Recordings
| Configuration | Behaviour |
|---|
| Disabled (default) | Only calls completed after the evaluation template is saved and activated will be scored. Existing recordings for the assigned agent are not processed. |
| Enabled | All existing recordings for the assigned agent will be scored in addition to all subsequent calls. Enabling this setting will increase processing time and evaluation cost in proportion to the volume of historical recordings available. |
Skip Calls Shorter Than
This setting establishes the minimum call duration threshold for evaluation eligibility. Calls below this duration are automatically classified as Skipped and excluded from all quality metrics, including composite score calculations and criteria performance aggregations.
The skip threshold is designed to prevent incomplete interactions, such as dropped calls, silent calls, and test calls, from distorting quality measurements. Skipped calls remain visible in the Activities log for operational transparency, displayed with reduced opacity and a Skipped status label.
The default skip threshold of 10 seconds is appropriate for most deployments. If your agent population handles very short, transactional interactions, consider lowering this threshold to ensure those calls are captured in quality metrics. If your environment produces a high volume of accidental or test calls, consider raising the threshold.
Configuration range: 0 to 300 seconds. Default: 10 seconds.
11. Glossary
| Term | Definition |
|------|-----------|
| Agent Evaluation | The UnleashX module that automatically scores calls made by AI voice agents against configurable quality criteria. |
| Composite Score | The weighted average of all criterion scores for a single call, representing the overall quality of that interaction. |
| Criterion | A single, defined quality dimension evaluated on a call. Examples include Empathy, Call Closing, and Privacy Compliance. |
| Criteria Score | The score from 0 to 100 assigned to a single criterion for a single call by the evaluator model. |
| Evaluation Template | A saved configuration defining which criteria to score, the associated rubrics, thresholds, evaluator model, and agent assignment. |
| Evaluator Model | The language model that reads call transcripts and assigns criterion scores. Fully independent of the voice agent. |
| Fail Threshold | The score below which a criterion is classified as Failed for a given call. |
| Pass Threshold | The minimum score required for a criterion to be classified as Passed for a given call. |
| Review | A status indicating a criterion or composite score falls in the band between the configured fail and pass thresholds. |
| Scoring Prompt | The rubric provided to the evaluator model describing expected performance at each score band for a specific criterion. |
| Skipped | A call status applied when call duration falls below the configured minimum threshold. Skipped calls are excluded from all quality calculations. |
| Temperature | An evaluator model parameter controlling output determinism. Lower values produce more consistent, repeatable scoring results. Recommended value: 0.3. |
| Transcription Provider | The third-party service that converts call audio recordings to text prior to evaluation scoring. |
| Weight | A relative numerical value assigned to a criterion that determines its proportional contribution to the composite score. |