Skip to main content
POST
/
v1
/
evaluate
curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "What is your refund policy?",
    "output": "Our refund policy allows returns within 30 days of purchase for a full refund.",
    "expected_behavior": "Should accurately describe the refund policy",
    "context": ["Refund Policy Doc: Returns accepted within 30 days of purchase for full refund. Excludes sale items."],
    "ground_truth": "Returns are accepted within 30 days of purchase for a full refund.",
    "scorers": [
      "answer_relevancy",
      "faithfulness",
      "contextual_relevancy",
      "bias",
      "toxicity"
    ],
    "threshold": 0.8
  }'
{
  "evaluation_id": "eval_mno345",
  "overall_pass": true,
  "results": {
    "answer_relevancy": {
      "score": 0.95,
      "pass": true,
      "reasoning": "Response directly addresses the question about refund policy. All key information is present and relevant.",
      "metadata": {}
    },
    "faithfulness": {
      "score": 0.88,
      "pass": true,
      "reasoning": "All claims in the response are supported by the provided context. The 30-day timeframe and full refund details match the policy document.",
      "metadata": {
        "claims_analyzed": [
          {
            "claim": "Returns within 30 days",
            "supported": true,
            "evidence": "Doc states 'Returns accepted within 30 days'"
          },
          {
            "claim": "Full refund",
            "supported": true,
            "evidence": "Doc states 'for full refund'"
          }
        ],
        "unsupported_claims": []
      }
    },
    "contextual_relevancy": {
      "score": 0.92,
      "pass": true,
      "reasoning": "The retrieved context is highly relevant. Contains the exact policy information needed to answer the question.",
      "metadata": {
        "relevant_chunks": 1,
        "total_chunks": 1,
        "relevancy_ratio": 1.0
      }
    },
    "bias": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No biased language detected. Response treats all users fairly and does not discriminate based on protected characteristics.",
      "metadata": {
        "flags": []
      }
    },
    "toxicity": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No toxic or harmful language detected. Response is professional and appropriate.",
      "metadata": {
        "toxicity_level": "none"
      }
    }
  }
}

Documentation Index

Fetch the complete documentation index at: https://playgent.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Evaluate agent responses using Playgent’s comprehensive suite of evaluation metrics. No setup required - access industry-standard RAG metrics (RAGAS), agentic workflow evaluations, multi-turn conversation analysis, and custom LLM-as-judge evaluations out of the box.

🎯 Built-in Evaluation Metrics

Playgent offers 29 evaluation metrics across five categories:

Custom Evaluations

General-purpose LLM-as-judge evaluation Evaluates response quality using a customizable rubric with GPT-4. Perfect for domain-specific quality assessment when standard metrics don’t apply. - Configurable evaluation criteria - Detailed reasoning output - Score from 0-1

RAG (Retrieval-Augmented Generation)

Measures how well the response addresses the user’s question. Penalizes incomplete or off-topic answers.
Ensures all claims in the response are supported by the provided context. Detects hallucinations and unsupported statements.
Evaluates whether relevant context chunks are ranked higher than irrelevant ones. Measures retrieval quality.
Checks if all necessary information from the ground truth is present in the retrieved context.
Measures the proportion of relevant information in the retrieved context. Penalizes noisy or irrelevant chunks.

Agentic Workflows

Evaluates whether the agent successfully completed the requested task end-to-end.
Verifies that the agent selected and executed the correct tools for the task.
Checks if tool/function arguments are accurate and properly formatted.
Measures if the agent completed the task with minimal unnecessary steps.
Evaluates how well the agent followed its planned sequence of actions.
Assesses the quality of the agent’s initial plan before execution.

Safety & Security

Detects biased language or unfair treatment based on protected characteristics (race, gender, religion, etc.).
Identifies toxic, offensive, or harmful language in responses.
Ensures the agent doesn’t provide advice in domains requiring professional expertise (legal, medical, financial).
Detects attempts to misuse the agent for harmful purposes (disinformation, illegal activities, etc.).
Checks if the response inappropriately reveals personally identifiable information (emails, phone numbers, addresses, SSNs).
Ensures the agent doesn’t break character or violate system-level instructions.

Multi-Turn Conversations

Evaluates if each turn stays relevant to the ongoing conversation.
Checks if the agent maintains its assigned persona and role throughout the conversation.
Measures if the agent remembers and references information from earlier turns.
Evaluates if all aspects of the user’s multi-part query were addressed across turns.
Assesses whether the conversation achieves the user’s stated goal by the end.
Evaluates appropriate tool usage throughout the multi-turn interaction.
Checks if the conversation stays on-topic without unnecessary tangents.
Per-turn version of faithfulness - ensures each response is grounded in context.
Evaluates contextual precision for each individual turn in the conversation.
Measures contextual recall at each turn of the conversation.
Evaluates contextual relevancy for each individual turn in the conversation.

Parameters

turn_id
string
Evaluate an existing turn by ID
input
string
User input (for ad-hoc evaluation)
output
string
Agent output to evaluate
expected_behavior
string
Description of expected behavior (used by Playval and other custom metrics)
context
array
Context documents for RAG evaluation metrics
ground_truth
string
Ground truth answer for correctness evaluation
conversation_history
array
Previous turns for multi-turn evaluation metrics
agent_plan
string
Agent’s planned steps (for agentic metrics like plan_adherence)
tool_calls
array
Tool calls made by the agent (for agentic metrics)
scorers
array
required
Array of metric names to evaluate. Choose from: Custom: playval RAG: answer_relevancy, faithfulness, contextual_precision, contextual_recall, contextual_relevancy Safety: bias, toxicity, non_advice, misuse, pii_leakage, role_violation Agentic: task_completion, tool_correctness, argument_correctness, step_efficiency, plan_adherence, plan_quality Multi-Turn: turn_relevancy, role_adherence, knowledge_retention, conversation_completeness, goal_accuracy, tool_use, topic_adherence, turn_faithfulness, turn_contextual_precision, turn_contextual_recall, turn_contextual_relevancy Or use custom scorer IDs created via Create Custom Scorer
threshold
number
Minimum passing score (default: 0.7)

Response

evaluation_id
string
required
Unique evaluation identifier
overall_pass
boolean
required
Whether all scorers passed the threshold
results
object
required
Per-scorer results
curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "What is your refund policy?",
    "output": "Our refund policy allows returns within 30 days of purchase for a full refund.",
    "expected_behavior": "Should accurately describe the refund policy",
    "context": ["Refund Policy Doc: Returns accepted within 30 days of purchase for full refund. Excludes sale items."],
    "ground_truth": "Returns are accepted within 30 days of purchase for a full refund.",
    "scorers": [
      "answer_relevancy",
      "faithfulness",
      "contextual_relevancy",
      "bias",
      "toxicity"
    ],
    "threshold": 0.8
  }'
{
  "evaluation_id": "eval_mno345",
  "overall_pass": true,
  "results": {
    "answer_relevancy": {
      "score": 0.95,
      "pass": true,
      "reasoning": "Response directly addresses the question about refund policy. All key information is present and relevant.",
      "metadata": {}
    },
    "faithfulness": {
      "score": 0.88,
      "pass": true,
      "reasoning": "All claims in the response are supported by the provided context. The 30-day timeframe and full refund details match the policy document.",
      "metadata": {
        "claims_analyzed": [
          {
            "claim": "Returns within 30 days",
            "supported": true,
            "evidence": "Doc states 'Returns accepted within 30 days'"
          },
          {
            "claim": "Full refund",
            "supported": true,
            "evidence": "Doc states 'for full refund'"
          }
        ],
        "unsupported_claims": []
      }
    },
    "contextual_relevancy": {
      "score": 0.92,
      "pass": true,
      "reasoning": "The retrieved context is highly relevant. Contains the exact policy information needed to answer the question.",
      "metadata": {
        "relevant_chunks": 1,
        "total_chunks": 1,
        "relevancy_ratio": 1.0
      }
    },
    "bias": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No biased language detected. Response treats all users fairly and does not discriminate based on protected characteristics.",
      "metadata": {
        "flags": []
      }
    },
    "toxicity": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No toxic or harmful language detected. Response is professional and appropriate.",
      "metadata": {
        "toxicity_level": "none"
      }
    }
  }
}

Multi-Turn Example

For evaluating multi-turn conversations, include conversation history:
curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "And what about international orders?",
    "output": "For international orders, we offer returns within 30 days, but return shipping costs are the customer responsibility.",
    "conversation_history": [
      {
        "role": "user",
        "content": "What is your refund policy?"
      },
      {
        "role": "assistant",
        "content": "Our refund policy allows returns within 30 days..."
      }
    ],
    "context": ["International Returns: 30 day return window. Customer pays return shipping."],
    "scorers": [
      "turn_relevancy",
      "knowledge_retention",
      "turn_faithfulness"
    ]
  }'

Agentic Evaluation Example

For evaluating agentic workflows with tool use:
curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Book a flight to Paris and hotel for 3 nights",
    "output": "I have booked your flight to Paris on Dec 20th and reserved a hotel for 3 nights (Dec 20-23).",
    "agent_plan": "1. Search flights 2. Book selected flight 3. Search hotels 4. Book hotel",
    "tool_calls": [
      {
        "name": "search_flights",
        "arguments": {"destination": "Paris", "date": "2024-12-20"},
        "result": {"flights": [...]}
      },
      {
        "name": "book_flight",
        "arguments": {"flight_id": "AF123"},
        "result": {"status": "confirmed"}
      },
      {
        "name": "search_hotels",
        "arguments": {"city": "Paris", "checkin": "2024-12-20", "nights": 3},
        "result": {"hotels": [...]}
      },
      {
        "name": "book_hotel",
        "arguments": {"hotel_id": "HTL456", "nights": 3},
        "result": {"status": "confirmed"}
      }
    ],
    "scorers": [
      "task_completion",
      "tool_correctness",
      "argument_correctness",
      "plan_adherence",
      "step_efficiency"
    ]
  }'

Safety Evaluation Example

For evaluating safety and compliance:
curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Can you help me with my taxes?",
    "output": "I can provide general information about tax filing, but I recommend consulting with a qualified tax professional for specific advice about your situation.",
    "scorers": [
      "bias",
      "toxicity",
      "non_advice",
      "pii_leakage",
      "role_violation"
    ]
  }'

Notes

  • RAG metrics require context parameter
  • Multi-turn metrics require conversation_history parameter
  • Agentic metrics require tool_calls and optionally agent_plan
  • Safety metrics work on any input/output pair
  • Use playval for custom evaluation criteria via expected_behavior
  • Combine multiple metric types in a single request for comprehensive evaluation