10-Agentic-Metrics-Catalog.md
Docs
10-Agentic-Metrics-Catalog.md

10. AGENTIC METRICS CATALOG

Project: Enterprise AI Agent Evaluation Platform Version: 2.0 (Based on DeepEval Framework)

1. METRICS HIERARCHY

The system categorizes metrics into four main tiers to comprehensively evaluate an AI Agent from "Conversations" to "Actions."

TierCategoryFocusExamples
Tier 1Response QualityFinal answer quality.Answer Relevancy, Toxicity, Bias.
Tier 2RAG PipelineKnowledge retrieval quality.Faithfulness, Contextual Precision/Recall.
Tier 3Agentic ExecutionTool usage and reasoning capabilities.Tool Correctness, Task Completion.
Tier 4ConversationalLong-form dialogue maintenance.Role Adherence, Knowledge Retention.

2. TIER 1: RESPONSE QUALITY METRICS

Evaluates the output visible to the end user.

2.1. Answer Relevancy

  • Definition: Measures how relevant the response is to the query. It evaluates focus rather than factual correctness.
  • Method: LLM generates hypothetical questions from the bot's response and compares them to the original query using vector similarity.
  • Threshold: > 0.7

2.2. Toxicity

  • Definition: Detects harmful, hateful, harassing, or violent content.
  • Method: Uses classification models (BERT-based or LLM-based) to detect sensitive keywords.
  • Threshold: < 0.1 (Strict requirement).

2.3. Bias

  • Definition: Measures gender, racial, religious, or political bias.
  • Method: LLM assesses whether the response contains stereotypes.
  • Threshold: < 0.1

3. TIER 2: RAG PIPELINE METRICS

Evaluates the efficiency of the Retrieval-Augmented Generation system. This is the "heart" of Knowledge Bots.

3.1. Faithfulness

  • Definition: Measures if the response is entirely based on the provided retrieval context. Eliminates hallucinations.
  • Formula: $\frac{\text{Number of Claims supported by Context}}{\text{Total Claims in Output}}$
  • Evaluation Mode: Segments the response into claims and verifies each claim individually.

3.2. Contextual Precision

  • Definition: Evaluates if the most relevant documents (chunks) are ranked higher in the search results.
  • Why: LLMs often suffer from the "lost in the middle" phenomenon, paying more attention to the first and last results.

3.3. Contextual Recall

  • Definition: Evaluates if the retrieval system found enough information to answer the question.
  • Formula: Comparison between Retrieval Context and Expected Output (from the Golden Dataset).

4. TIER 3: AGENTIC EXECUTION METRICS

The most critical tier for autonomous agents.

4.1. Tool Correctness

  • Definition: Evaluates the accuracy of external function/tool calls.
  • Checklist:
    1. Tool Selection: Was the correct tool chosen? (e.g., calling get_weather for weather queries).
    2. Argument Extraction: Were parameters extracted correctly from the user prompt?
  • Method: Compares the Agent's JSON output with the Expected Dictionary.

4.2. Task Completion (Goal Rate)

  • Definition: Assesses whether the user's ultimate goal was achieved.
  • Method: LLM Judge evaluates the final conversation state.

4.3. Plan Adherence

  • Definition: Verifies if the agent followed defined business processes (SOPs).
  • Method: LangGraph state check (tracking the execution path).

5. TIER 4: CONVERSATIONAL METRICS (Multi-turn)

Designed for long-form conversational chatbots.

5.1. Knowledge Retention (Memory)

  • Definition: The bot's ability to remember information provided by the user in previous turns.
  • Test: Confirming if the bot remembers a piece of information introduced several turns earlier.

5.2. Role Adherence (Persona Consistency)

  • Definition: Ensures the bot maintains its assigned persona throughout the conversation.

6. CUSTOM METRICS (G-Eval)

Used when standard metrics are insufficient.

6.1. Brand Tone Consistency

  • Method: GPT-4 evaluation with Chain-of-Thought based on specific brand guidelines.

7. IMPLEMENTATION GUIDE

Example code integrating DeepEval for ToolCorrectness:

python
from deepeval import evaluate
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCallParam

# 1. Define Test Case
test_case = LLMTestCase(
    input="Book a flight to NYC tomorrow",
    actual_output="Calling book_flight(destination='NYC', date='2024-01-22')",
    tools_called=[
        ToolCallParam(name="book_flight", arguments={"destination": "NYC", "date": "2024-01-22"})
    ],
    expected_tools=[
        ToolCallParam(name="book_flight", arguments={"destination": "New York City", "date": "2024-01-22"})
    ]
)

# 2. Initialize Metric
metric = ToolCorrectnessMetric(threshold=0.8)

# 3. Measure
metric.measure(test_case)
print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")
Cập nhật lần cuối: Hôm nay
Chỉnh sửa trang này